0% found this document useful (0 votes)
442 views

IBM - IBM Storage Scale 5.1.9 Problem Determination Guide (2024)

Uploaded by

Ramon Barrios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
442 views

IBM - IBM Storage Scale 5.1.9 Problem Determination Guide (2024)

Uploaded by

Ramon Barrios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1008

IBM Storage Scale

5.1.9

Problem Determination Guide

IBM

SC28-3476-02
Note
Before using this information and the product it supports, read the information in “Notices” on page
925.

This edition applies to Version 5 release 1 modification 9 of the following products, and to all subsequent releases and
modifications until otherwise indicated in new editions:
• IBM Storage Scale Data Management Edition ordered through Passport Advantage® (product number 5737-F34)
• IBM Storage Scale Data Access Edition ordered through Passport Advantage (product number 5737-I39)
• IBM Storage Scale Erasure Code Edition ordered through Passport Advantage (product number 5737-J34)
• IBM Storage Scale Data Management Edition ordered through AAS (product numbers 5641-DM1, DM3, DM5)
• IBM Storage Scale Data Access Edition ordered through AAS (product numbers 5641-DA1, DA3, DA5)
• IBM Storage Scale Data Management Edition for IBM® ESS (product number 5765-DME)
• IBM Storage Scale Data Access Edition for IBM ESS (product number 5765-DAE)
• IBM Storage Scale Backup ordered through Passport Advantage® (product number 5900-AXJ)
• IBM Storage Scale Backup ordered through AAS (product numbers 5641-BU1, BU3, BU5)
• IBM Storage Scale Backup for IBM® Storage Scale System (product number 5765-BU1)
Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change.
IBM welcomes your comments; see the topic “How to send your comments” on page xlii. When you send information
to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without
incurring any obligation to you.
© Copyright International Business Machines Corporation 2015, 2024.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
Contents

Tables..................................................................................................................xv

About this information........................................................................................ xxi


Prerequisite and related information.........................................................................................................xli
Conventions used in this information........................................................................................................ xli
How to send your comments.................................................................................................................... xlii

Summary of changes..........................................................................................xliii

Chapter 1. Monitoring system health by using IBM Storage Scale GUI......................1


Monitoring events by using GUI...................................................................................................................2
Event notifications....................................................................................................................................... 4
Configuring email notifications.............................................................................................................. 4
Configuring SNMP manager....................................................................................................................5
Monitoring tips by using GUI....................................................................................................................... 7
Monitoring thresholds by using GUI............................................................................................................ 7

Chapter 2. Monitoring system health by using the mmhealth command................. 13


Monitoring the health of a node................................................................................................................ 13
Running a user-defined script when an event is raised...................................................................... 17
Event type and monitoring status for system health................................................................................ 18
Creating, raising, and finding custom defined events......................................................................... 19
Threshold monitoring for system health................................................................................................... 22
Threshold monitoring prerequisites.................................................................................................... 22
Predefined and user-defined thresholds.............................................................................................22
Active threshold monitor role.............................................................................................................. 24
System health monitoring use cases........................................................................................................ 25
Threshold monitoring use cases.......................................................................................................... 29
Configuring webhook by using the mmhealth command......................................................................... 44
Webhook JSON data.............................................................................................................................47
Additional checks on file system availability for CES exported data........................................................49
Proactive system health alerts.................................................................................................................. 50

Chapter 3. Dynamic page pool monitoring............................................................ 53

Chapter 4. Performance monitoring......................................................................55


Network performance monitoring............................................................................................................. 55
Monitoring networks by using GUI.......................................................................................................57
Monitoring I/O performance with the mmpmon command..................................................................... 59
Overview of mmpmon.......................................................................................................................... 59
Specifying input to the mmpmon command....................................................................................... 60
Display I/O statistics per mounted file system................................................................................... 62
Display I/O statistics for the entire node.............................................................................................65
Understanding the node list facility..................................................................................................... 67
Reset statistics to zero......................................................................................................................... 75
Understanding the request histogram facility..................................................................................... 76
Understanding the Remote Procedure Call (RPC) facility................................................................... 88
Displaying mmpmon version................................................................................................................93
Example mmpmon scenarios and how to analyze and interpret their results................................... 94

iii
Other information about mmpmon output........................................................................................103
Using the performance monitoring tool.................................................................................................. 105
Configuring the performance monitoring tool................................................................................... 106
Starting and stopping the performance monitoring tool...................................................................153
Restarting the performance monitoring tool.....................................................................................154
Configuring the metrics to collect performance data....................................................................... 154
Removing non-detectable resource identifiers from the performance monitoring tool database..154
Measurements....................................................................................................................................156
Viewing and analyzing the performance data.........................................................................................158
Performance monitoring using IBM Storage Scale GUI.................................................................... 158
Viewing performance data with mmperfmon.................................................................................... 170
Using IBM Storage Scale performance monitoring bridge with Grafana..........................................176

Chapter 5. Monitoring GPUDirect storage............................................................177

Chapter 6. Monitoring events through callbacks..................................................179

Chapter 7. Monitoring capacity through GUI....................................................... 181

Chapter 8. Monitoring AFM and AFM DR..............................................................185


Monitoring fileset states for AFM............................................................................................................ 185
Monitoring fileset states for AFM DR.......................................................................................................189
Monitoring health and events..................................................................................................................194
Monitoring with mmhealth................................................................................................................ 194
Monitoring callback events for AFM and AFM DR............................................................................. 194
Monitoring performance.......................................................................................................................... 195
Monitoring using mmpmon ..................................................................................................................196
Monitoring using mmperfmon............................................................................................................ 196
Monitoring prefetch................................................................................................................................. 197
Monitoring status using mmdiag............................................................................................................. 197
Policies used for monitoring AFM and AFM DR.......................................................................................199
Monitoring AFM and AFM DR using GUI..................................................................................................200

Chapter 9. Monitoring AFM to cloud object storage..............................................205


Monitoring fileset states for AFM to cloud object storage......................................................................205
Monitoring health and events..................................................................................................................207
Monitoring performance.......................................................................................................................... 208
Monitoring using mmpmon...................................................................................................................208
Monitoring using mmperfmon............................................................................................................ 209
Monitoring AFM to cloud object storage download and upload.............................................................209

Chapter 10. GPFS SNMP support........................................................................211


Installing Net-SNMP................................................................................................................................ 211
Configuring Net-SNMP.............................................................................................................................212
Configuring management applications................................................................................................... 213
Installing MIB files on the collector node and management node........................................................ 214
Collector node administration.................................................................................................................215
Starting and stopping the SNMP subagent............................................................................................. 216
The management and monitoring subagent...........................................................................................216
SNMP object IDs................................................................................................................................ 218
MIB objects........................................................................................................................................ 218
Cluster status information................................................................................................................. 219
Cluster configuration information......................................................................................................219
Node status information.................................................................................................................... 220
Node configuration information.........................................................................................................220
File system status information.......................................................................................................... 221

iv
File system performance information............................................................................................... 222
Storage pool information................................................................................................................... 223
Disk status information...................................................................................................................... 223
Disk configuration information.......................................................................................................... 224
Disk performance information........................................................................................................... 224
Net-SNMP traps..................................................................................................................................225

Chapter 11. Monitoring the IBM Storage Scale system by using call home............227
Uploading custom files using call home................................................................................................. 227

Chapter 12. Monitoring remote cluster through GUI............................................ 229

Chapter 13. Monitoring file audit logging............................................................ 233


Monitoring file audit logging states......................................................................................................... 233
Monitoring the file audit logging fileset for events................................................................................. 233
Monitoring file audit logging using mmhealth commands.....................................................................234
Monitoring file audit logging using the GUI.............................................................................................235
Monitoring file audit logging using audit log parser................................................................................235
Example of parsing file audit logs with python....................................................................................... 235
Monitoring file audit logging with rsyslog and SELinux.......................................................................... 236

Chapter 14. Monitoring clustered watch folder................................................... 237


Monitoring clustered watch folder states............................................................................................... 237
Monitoring clustered watch folder with the mmhealth command........................................................ 237
Monitoring clustered watch folder with the mmwatch status command.............................................. 238
Monitoring clustered watch folder using GUI......................................................................................... 239

Chapter 15. Monitoring local read-only cache..................................................... 241


Monitoring health and events with the mmhealth commands..............................................................241
Monitoring LROC status using mmdiag command..................................................................................242
Monitoring file contents in LROC with mmcachectl command.............................................................. 243

Chapter 16. Best practices for troubleshooting................................................... 245


How to get started with troubleshooting................................................................................................ 245
Back up your data.................................................................................................................................... 245
Resolve events in a timely manner......................................................................................................... 246
Keep your software up to date................................................................................................................ 246
Subscribe to the support notification......................................................................................................246
Know your IBM warranty and maintenance agreement details............................................................. 247
Know how to report a problem................................................................................................................247
Other problem determination hints and tips.......................................................................................... 247
Which physical disk is associated with a logical volume in AIX systems?....................................... 248
Which nodes in my cluster are quorum nodes?................................................................................ 248
What is stored in the /tmp/mmfs directory and why does it sometimes disappear?...................... 249
Why does my system load increase significantly during the night?................................................. 249
What do I do if I receive message 6027-648?..................................................................................249
Why can't I see my newly mounted Windows file system?.............................................................. 249
Why is the file system mounted on the wrong drive letter?..............................................................250
Why does the offline mmfsck command fail with "Error creating internal storage"?...................... 250
Why do I get timeout executing function error message?................................................................ 250
Questions related to active file management................................................................................... 250

Chapter 17. Understanding the system limitations..............................................253

Chapter 18. Collecting details of the issues.........................................................255


Collecting details of issues by using logs, dumps, and traces............................................................... 255

v
Time stamp in GPFS log entries.........................................................................................................255
Logs.....................................................................................................................................................256
Setting up core dumps on a client RHEL or SLES system................................................................. 281
Configuration changes required on protocol nodes to collect core dump data............................... 282
Setting up an Ubuntu system to capture crash files......................................................................... 283
Trace facility....................................................................................................................................... 283
Collecting diagnostic data through GUI.................................................................................................. 294
CLI commands for collecting issue details............................................................................................. 295
Using the gpfs.snap command.......................................................................................................295
mmdumpperfdata command.............................................................................................................310
mmfsadm command.......................................................................................................................... 312
Commands for GPFS cluster state information.................................................................................312
GPFS file system and disk information commands...........................................................................317
Collecting details of the issues from performance monitoring tools..................................................... 330
Other problem determination tools........................................................................................................ 331

Chapter 19. Managing deadlocks........................................................................333


Debug data for deadlocks........................................................................................................................333
Automated deadlock detection...............................................................................................................334
Automated deadlock data collection...................................................................................................... 335
Automated deadlock breakup.................................................................................................................336
Deadlock breakup on demand................................................................................................................ 337

Chapter 20. Installation and configuration issues............................................... 339


Resolving most frequent problems related to installation, deployment, and upgrade.........................339
Finding deployment related error messages more easily and using them for failure analysis....... 339
Problems due to missing prerequisites............................................................................................. 339
Problems due to mixed operating system levels in the cluster........................................................ 342
Problems due to using the installation toolkit for functions or configurations not supported........ 343
Understanding supported upgrade functions with installation toolkit.............................................346
Installation toolkit setup command fails after upgrade to Ubuntu 22.04............................................. 347
Installation toolkit fails with Python not found error............................................................................. 347
Installation toolkit fails on Ubuntu 20.04.4 nodes with Ansible related error...................................... 347
Installation toolkit Ansible package troubleshooting if it fails for already installed ansible for Red
Hat Enterprise Linux® 9.0 and >= Red Hat Enterprise Linux 8.6....................................................... 348
Installation toolkit fails if running yum commands result in warning or error.......................................348
Installation toolkit operation fails with PKey parsing or OpenSSH keys related errors........................ 348
Installation toolkit setup fails with an ssh-agent related error..............................................................349
systemctl commands time out during installation, deployment, or upgrade with the installation
toolkit.................................................................................................................................................. 349
Installation toolkit setup on Ubuntu fails due to dpkg database lock issue..........................................349
Installation toolkit config populate operation fails to detect object endpoint...................................... 350
Post installation and configuration problems......................................................................................... 350
Cluster is crashed after re-installation....................................................................................................351
Node cannot be added to the GPFS cluster............................................................................................351
Problems with the /etc/hosts file............................................................................................................ 352
Linux configuration considerations......................................................................................................... 352
Python conflicts while deploying object packages using installation toolkit.........................................352
Problems with running commands on other nodes................................................................................353
Authorization problems..................................................................................................................... 353
Connectivity problems....................................................................................................................... 354
GPFS error messages for rsh problems.............................................................................................354
Cluster configuration data file issues......................................................................................................354
GPFS cluster configuration data file issues....................................................................................... 354
GPFS error messages for cluster configuration data file problems..................................................355
Recovery from loss of GPFS cluster configuration data file..............................................................355
Automatic backup of the GPFS cluster data......................................................................................356

vi
Installation of gpfs.gpfsbin reports an error........................................................................................... 356
GPFS application calls............................................................................................................................. 356
Error numbers specific to GPFS applications calls........................................................................... 357
GPFS modules cannot be loaded on Linux............................................................................................. 357
GPFS daemon issues............................................................................................................................... 358
GPFS daemon does not come up.......................................................................................................358
GPFS daemon went down..................................................................................................................361
GPFS commands are unsuccessful......................................................................................................... 362
GPFS error messages for unsuccessful GPFS commands................................................................ 364
Quorum loss.............................................................................................................................................364
CES configuration issues......................................................................................................................... 365
Application program errors..................................................................................................................... 365
GPFS error messages for application program errors.......................................................................366
Windows issues....................................................................................................................................... 366
Home and .ssh directory ownership and permissions...................................................................... 366
Problems running as Administrator...................................................................................................367
GPFS Windows and SMB2 protocol (CIFS serving)...........................................................................367

Chapter 21. Upgrade issues............................................................................... 369


Installation toolkit setup command fails on RHEL 7.x nodes with setuptools package-related error. 369
Upgrade precheck by using the installation toolkit might fail on protocol nodes in an ESS
environment........................................................................................................................................370
Home cluster unable to unload mmfs modules for upgrades................................................................370
File conflict issue while upgrading SLES on IBM Storage Scale nodes..................................................370
NSD nodes cannot connect to storage after upgrading from SLES 15 SP1 to SP2................................370
After upgrading IBM Storage Scale code, trying to mark events as read returns a server error
message..............................................................................................................................................371
RDMA adapters not supporting ATOMIC operations.............................................................................. 371

Chapter 22. CCR issues...................................................................................... 373

Chapter 23. Network issues............................................................................... 375


IBM Storage Scale failures due to a network failure.............................................................................. 375
OpenSSH connection delays................................................................................................................... 375
Analyze network problems with the mmnetverify command................................................................ 375

Chapter 24. File system issues........................................................................... 377


File system fails to mount....................................................................................................................... 377
GPFS error messages for file system mount problems.....................................................................379
Error numbers specific to GPFS application calls when a file system mount is not successful...... 380
Mount failure due to client nodes joining before NSD servers are online........................................ 380
File system fails to unmount................................................................................................................... 381
Remote node expelled after remote file system successfully mounted................................................382
File system forced unmount....................................................................................................................382
Additional failure group considerations............................................................................................ 383
GPFS error messages for file system forced unmount problems..................................................... 384
Error numbers specific to GPFS application calls when a file system has been forced to
unmount........................................................................................................................................ 384
Automount file system does not mount..................................................................................................385
Steps to follow if automount fails to mount on Linux....................................................................... 385
Steps to follow if automount fails to mount on AIX.......................................................................... 386
Remote file system does not mount....................................................................................................... 387
Remote file system I/O fails with the “Function not implemented” error message when UID
mapping is enabled.......................................................................................................................387
Remote file system does not mount due to differing GPFS cluster security configurations........... 388
Cannot resolve contact node address............................................................................................... 388

vii
The remote cluster name does not match the cluster name supplied by the
mmremotecluster command.....................................................................................................389
Contact nodes down or GPFS down on contact nodes..................................................................... 389
GPFS is not running on the local node...............................................................................................390
The NSD disk does not have an NSD server specified, and the mounting cluster does not have
direct access to the disks............................................................................................................. 390
The cipherList option has not been set properly...............................................................................390
Remote mounts fail with the "permission denied" error message...................................................391
Unable to determine whether a file system is mounted........................................................................ 391
GPFS error messages for file system mount status.......................................................................... 391
Multiple file system manager failures..................................................................................................... 391
GPFS error messages for multiple file system manager failures......................................................392
Error numbers specific to GPFS application calls when file system manager appointment fails... 392
Discrepancy between GPFS configuration data and the on-disk data for a file system........................392
Errors associated with storage pools, filesets and policies................................................................... 393
A NO_SPACE error occurs when a file system is known to have adequate free space.................... 393
Negative values occur in the 'predicted pool utilizations', when some files are 'ill-placed'............ 395
Policies - usage errors........................................................................................................................395
Errors encountered with policies.......................................................................................................396
Filesets - usage errors....................................................................................................................... 397
Errors encountered with filesets....................................................................................................... 398
Storage pools - usage errors..............................................................................................................398
Errors encountered with storage pools............................................................................................. 399
Snapshot problems..................................................................................................................................400
Problems with locating a snapshot....................................................................................................400
Problems not directly related to snapshots...................................................................................... 400
Snapshot usage errors....................................................................................................................... 400
Snapshot status errors.......................................................................................................................401
Snapshot directory name conflicts....................................................................................................402
Errors encountered when restoring a snapshot................................................................................ 402
Failures using the mmbackup command................................................................................................ 403
GPFS error messages for mmbackup errors..................................................................................... 403
IBM Storage Protect error messages................................................................................................ 403
Data integrity........................................................................................................................................... 404
Error numbers specific to GPFS application calls when data integrity may be corrupted...............404
Messages requeuing in AFM....................................................................................................................404
NFSv4 ACL problems............................................................................................................................... 405

Chapter 25. Disk issues......................................................................................407


NSD and underlying disk subsystem failures..........................................................................................407
Error encountered while creating and using NSD disks.................................................................... 407
Displaying NSD information............................................................................................................... 408
Disk device name is an existing NSD name....................................................................................... 410
GPFS has declared NSDs as down.....................................................................................................410
Unable to access disks.......................................................................................................................411
Guarding against disk failures........................................................................................................... 412
Disk connectivity failure and recovery...............................................................................................412
Partial disk failure.............................................................................................................................. 413
GPFS has declared NSDs built on top of AIX logical volumes as down................................................. 413
Verify whether the logical volumes are properly defined................................................................. 413
Check the volume group on each node............................................................................................. 414
Volume group varyon problems.........................................................................................................414
Disk accessing commands fail to complete due to problems with some non-IBM disks..................... 415
Disk media failure.................................................................................................................................... 415
Replica mismatches........................................................................................................................... 416
Replicated metadata and data...........................................................................................................422
Replicated metadata only.................................................................................................................. 423

viii
Strict replication.................................................................................................................................423
No replication..................................................................................................................................... 423
GPFS error messages for disk media failures................................................................................... 424
Error numbers specific to GPFS application calls when disk failure occurs.................................... 424
Persistent Reserve errors........................................................................................................................ 425
Understanding Persistent Reserve.................................................................................................... 425
Checking Persistent Reserve............................................................................................................. 426
Clearing a leftover Persistent Reserve reservation........................................................................... 426
Manually enabling or disabling Persistent Reserve.......................................................................... 428
GPFS is not using the underlying multipath device................................................................................ 428
Kernel panics with the message "GPFS deadman switch timer has expired and there are still
outstanding I/O requests"..................................................................................................................429

Chapter 26. GPUDirect Storage troubleshooting................................................. 431

Chapter 27. Security issues................................................................................435


Encryption issues.....................................................................................................................................435
Unable to add encryption policies .................................................................................................... 435
Receiving "Permission denied" message.......................................................................................... 435
"Value too large" failure when creating a file.................................................................................... 435
Mount failure for a file system with encryption rules........................................................................435
"Permission denied" failure of key rewrap........................................................................................ 436
Authentication issues.............................................................................................................................. 436
File protocol authentication setup issues......................................................................................... 436
Protocol authentication issues.......................................................................................................... 436
Authentication error events............................................................................................................... 437
Nameserver issues related to AD authentication............................................................................. 438
Authorization issues................................................................................................................................ 438
The IBM Security Lifecycle Manager prerequisites cannot be installed................................................ 440
IBM Security Lifecycle Manager cannot be installed .............................................................................440

Chapter 28. Protocol issues................................................................................443


NFS issues................................................................................................................................................443
CES NFS failure due to network failure .............................................................................................443
NFS client with stale inode data........................................................................................................ 443
NFS mount issues.............................................................................................................................. 444
NFS error events.................................................................................................................................447
NFS error scenarios............................................................................................................................449
Collecting diagnostic data for NFS.....................................................................................................450
NFS startup warnings.........................................................................................................................451
Customizing failover behavior for unresponsive NFS....................................................................... 452
SMB issues...............................................................................................................................................452
Determining the health of integrated SMB server.............................................................................452
File access failure from an SMB client with sharing conflict.............................................................454
SMB client on Linux fails with an "NT status logon failure".............................................................. 455
SMB client on Linux fails with the NT status password must change error message.......... 456
SMB mount issues..............................................................................................................................456
Net use on Windows fails with "System error 86" ........................................................................... 457
Net use on Windows fails with "System error 59" for some users................................................... 457
Winbindd causes high CPU utilization............................................................................................... 458
SMB error events................................................................................................................................458
SMB access issues ............................................................................................................................ 459
Slow access to SMB caused by contended access to files or directories ........................................460
CTDB issues........................................................................................................................................461
smbd start issue.................................................................................................................................462
Object issues............................................................................................................................................462
Getting started with troubleshooting object issues ......................................................................... 463

ix
Authenticating the object service...................................................................................................... 464
Authenticating or using the object service........................................................................................ 464
Accessing resources.......................................................................................................................... 465
Connecting to the object services..................................................................................................... 465
Creating a path................................................................................................................................... 466
Constraints for creating objects and containers............................................................................... 466
The Bind password is used when the object authentication configuration has expired..................467
The password used for running the keystone command has expired or is incorrect.......................467
The LDAP server is not reachable......................................................................................................468
The TLS certificate has expired..........................................................................................................468
The TLS CACERT certificate has expired........................................................................................... 469
The TLS certificate on the LDAP server has expired......................................................................... 469
The SSL certificate has expired..........................................................................................................470
Users are not listed in the OpenStack user list................................................................................. 470
The error code signature does not match when using the S3 protocol............................................471
The swift-object-info output does not display........................................................................ 471
Swift PUT returns the 202 error and S3 PUT returns the 500 error due to the missing time
synchronization............................................................................................................................. 472
Unable to generate the accurate container listing by performing a GET operation for unified file
and object access container......................................................................................................... 473
Fatal error in object configuration during deployment..................................................................... 473
Object authentication configuration fatal error during deployment.................................................474
Unrecoverable error in object authentication during deployment .................................................. 474

Chapter 29. Disaster recovery issues.................................................................. 477


Disaster recovery setup problems.......................................................................................................... 477
Other problems with disaster recovery...................................................................................................478

Chapter 30. Performance issues.........................................................................479


Issues caused by the low-level system components.............................................................................479
Suboptimal performance due to high utilization of the system level components .........................479
Suboptimal performance due to long IBM Storage Scale waiters....................................................479
Suboptimal performance due to networking issues caused by faulty system components ...........480
Issues caused by the suboptimal setup or configuration of the IBM Storage Scale cluster.................481
Suboptimal performance due to unbalanced architecture and improper system level settings.... 481
Suboptimal performance due to low values assigned to IBM Storage Scale configuration
parameters ................................................................................................................................... 481
Suboptimal performance due to new nodes with default parameter values added to the cluster 482
Suboptimal performance due to low value assigned to QoSIO operation classes.......................... 483
Suboptimal performance due to improper mapping of the file system NSDs to the NSD servers.. 484
Suboptimal performance due to incompatible file system block allocation type............................485
Issues caused by the unhealthy state of the components used............................................................486
Suboptimal performance due to failover of NSDs to secondary server - NSD server failure.......... 486
Suboptimal performance due to failover of NSDs to secondary server - Disk connectivity failure 487
Suboptimal performance due to file system being fully utilized...................................................... 488
Suboptimal performance due to VERBS RDMA being inactive ........................................................ 490
Issues caused by the use of configurations or commands related to maintenance and operation..... 492
Suboptimal performance due to maintenance commands in progress........................................... 492
Suboptimal performance due to frequent invocation or execution of maintenance commands.... 493
Suboptimal performance when a tracing is active on a cluster........................................................494
Suboptimal performance due to replication settings being set to 2 or 3.........................................494
Suboptimal performance due to updates made on a file system or fileset with snapshot .............495
Delays and deadlocks..............................................................................................................................496
Failures using the mmpmon command...................................................................................................498
Setup problems using mmpmon....................................................................................................... 498
Incorrect output from mmpmon........................................................................................................498
Abnormal termination or hang in mmpmon...................................................................................... 498

x
Tracing the mmpmon command........................................................................................................499

Chapter 31. GUI and monitoring issues...............................................................501


GUI fails to start.......................................................................................................................................501
GUI fails to restart after upgrade of all other GPFS packages on SLES 15 SP3.................................... 502
GUI fails to start after manual installation or upgrade on Ubuntu nodes..............................................503
GUI login page does not open................................................................................................................. 503
GUI performance monitoring issues....................................................................................................... 504
GUI is showing “Server was unable to process the request” error........................................................506
GUI is displaying outdated information.................................................................................................. 506
Capacity information is not available in GUI pages................................................................................ 509
GUI automatically logs off the users when using Google Chrome or Mozilla Firefox............................510

Chapter 32. AFM issues......................................................................................511

Chapter 33. AFM DR issues................................................................................ 515

Chapter 34. AFM to cloud object storage issues.................................................. 517

Chapter 35. Transparent cloud tiering issues...................................................... 519

Chapter 36. File audit logging issues.................................................................. 523


Failure of mmaudit because of the file system level.............................................................................. 523
JSON reporting issues in file audit logging............................................................................................. 523
File audit logging issues with remotely mounted file systems...............................................................524
Audit fileset creation issues when enabling file audit logging............................................................... 524
Failure to append messages to Buffer Pool............................................................................................ 524

Chapter 37. Cloudkit issues................................................................................525


Cloudkit issues on AWS........................................................................................................................... 525
Cloudkit issues on GCP............................................................................................................................527

Chapter 38. Troubleshooting mmwatch................................................................531

Chapter 39. Maintenance procedures................................................................. 533


Directed maintenance procedures available in the GUI.........................................................................533
Start NSD............................................................................................................................................ 533
Start GPFS daemon............................................................................................................................ 534
Increase fileset space........................................................................................................................ 534
Synchronize node clocks....................................................................................................................534
Start performance monitoring collector service............................................................................... 535
Start performance monitoring sensor service...................................................................................535
Activate AFM performance monitoring sensors................................................................................536
Activate NFS performance monitoring sensors................................................................................ 536
Activate SMB performance monitoring sensors................................................................................536
Configure NFS sensors.......................................................................................................................537
Configure SMB sensors...................................................................................................................... 537
Mount file system if it must be mounted........................................................................................... 538
Start the GUI service on the remote nodes.......................................................................................538
Repair a failed GUI refresh task.........................................................................................................539
Directed maintenance procedures for tip events................................................................................... 539

Chapter 40. Recovery procedures.......................................................................543


Restoring data and system configuration............................................................................................... 543
Automatic recovery..................................................................................................................................543
Upgrade recovery.....................................................................................................................................544

xi
Recovering cluster configuration by using CCR...................................................................................... 545
Recovering from a single quorum or non-quorum node failure........................................................545
Recovering from the loss of a majority of quorum nodes................................................................. 546
Recovering from damage or loss of the CCR on all quorum nodes...................................................549
Recovering from an existing CCR backup.......................................................................................... 551
Repair of cluster configuration information when no CCR backup is available..................................... 552
Repair of cluster configuration information when no CCR backup information is available:
mmsdrrestore command........................................................................................................... 553

Chapter 41. Support for troubleshooting.............................................................555


Contacting IBM support center............................................................................................................... 555
Information to be collected before contacting the IBM Support Center......................................... 555
How to contact the IBM Support Center........................................................................................... 557
Call home notifications to IBM Support.................................................................................................. 558

Chapter 42. References......................................................................................559


Events.......................................................................................................................................................559
AFM events......................................................................................................................................... 559
Authentication events........................................................................................................................ 565
Call Home events............................................................................................................................... 570
CES network events........................................................................................................................... 573
CESIP events...................................................................................................................................... 578
Cluster state events........................................................................................................................... 578
Disk events......................................................................................................................................... 582
Enclosure events................................................................................................................................ 584
Encryption events...............................................................................................................................593
File Audit Logging events................................................................................................................... 597
File system events..............................................................................................................................602
File system manager events.............................................................................................................. 613
GDS events......................................................................................................................................... 615
GPFS events....................................................................................................................................... 615
GUI events..........................................................................................................................................634
Hadoop connector events..................................................................................................................645
HDFS data node events......................................................................................................................645
HDFS name node events....................................................................................................................646
Keystone events................................................................................................................................. 648
Local Cache events.............................................................................................................................650
Network events.................................................................................................................................. 652
NFS events..........................................................................................................................................660
NVMe events...................................................................................................................................... 671
NVMeoF events.................................................................................................................................. 674
Object events..................................................................................................................................... 675
Performance events........................................................................................................................... 685
Server RAID events............................................................................................................................ 687
SMB events.........................................................................................................................................687
Stretch cluster events........................................................................................................................ 691
Transparent cloud tiering events....................................................................................................... 693
Threshold events................................................................................................................................708
Watchfolder events............................................................................................................................ 710
Transparent cloud tiering status description..........................................................................................713
Cloud services audit events.....................................................................................................................726
Messages................................................................................................................................................. 728
Message severity tags........................................................................................................................ 728

Accessibility features for IBM Storage Scale....................................................... 923


Accessibility features.............................................................................................................................. 923
Keyboard navigation................................................................................................................................ 923

xii
IBM and accessibility...............................................................................................................................923

Notices..............................................................................................................925
Trademarks.............................................................................................................................................. 926
Terms and conditions for product documentation................................................................................. 926

Glossary............................................................................................................ 929

Index................................................................................................................ 937

xiii
xiv
Tables

1. IBM Storage Scale library information units............................................................................................. xxii

2. Conventions................................................................................................................................................. xli

3. System health monitoring options that are available in IBM Storage Scale GUI........................................ 1

4. Notification levels.......................................................................................................................................... 3

5. Notification levels.......................................................................................................................................... 4

6. SNMP objects included in event notifications.............................................................................................. 5

7. SNMP OID ranges.......................................................................................................................................... 6

8. Threshold rule configuration - A sample scenario..................................................................................... 10

9. Input requests to the mmpmon command................................................................................................ 60

10. Keywords and values for the mmpmon fs_io_s response....................................................................... 63

11. Keywords and values for the mmpmon io_s response............................................................................ 65

12. nlist requests for the mmpmon command...............................................................................................67

13. Keywords and values for the mmpmon nlist add response.....................................................................68

14. Keywords and values for the mmpmon nlist del response......................................................................69

15. Keywords and values for the mmpmon nlist new response.................................................................... 70

16. Keywords and values for the mmpmon nlist s response......................................................................... 70

17. Keywords and values for the mmpmon nlist failures...............................................................................74

18. Keywords and values for the mmpmon reset response.......................................................................... 75

19. rhist requests for the mmpmon command...............................................................................................76

20. Keywords and values for the mmpmon rhist nr response....................................................................... 79

21. Keywords and values for the mmpmon rhist off response...................................................................... 81

22. Keywords and values for the mmpmon rhist on response...................................................................... 82

23. Keywords and values for the mmpmon rhist p response........................................................................ 83

xv
24. Keywords and values for the mmpmon rhist reset response.................................................................. 85

25. Keywords and values for the mmpmon rhist s response.........................................................................86

26. rpc_s requests for the mmpmon command............................................................................................. 88

27. Keywords and values for the mmpmon rpc_s response..........................................................................89

28. Keywords and values for the mmpmon rpc_s size response...................................................................91

29. Keywords and values for the mmpmon ver response..............................................................................93

30. Sensor with high impacts........................................................................................................................112

31. Resource types and the sensors responsible for them......................................................................... 155

32. Measurements........................................................................................................................................ 156

33. Performance monitoring options available in IBM Storage Scale GUI..................................................158

34. Sensors available for each resource type.............................................................................................. 163

35. Sensors available to capture capacity details........................................................................................164

36. AFM states and their description............................................................................................................185

37. AFM DR states and their description......................................................................................................190

38. List of events that can be added using mmaddcallback........................................................................195

39. Field description of the example............................................................................................................ 196

40. Attributes with their description............................................................................................................ 199

41. AFM to cloud object storage states and their description..................................................................... 205

42. gpfsClusterStatusTable: Cluster status information.............................................................................. 219

43. gpfsClusterConfigTable: Cluster configuration information.................................................................. 219

44. gpfsNodeStatusTable: Node status information.................................................................................... 220

45. gpfsNodeConfigTable: Node configuration information........................................................................ 220

46. gpfsFileSystemStatusTable: File system status information.................................................................221

47. gpfsFileSystemPerfTable: File system performance information......................................................... 222

48. GPFS storage pool information...............................................................................................................223

xvi
49. gpfsDiskStatusTable: Disk status information....................................................................................... 223

50. gpfsDiskConfigTable: Disk configuration information............................................................................224

51. gpfsDiskPerfTable: Disk performance information................................................................................224

52. Net-SNMP traps...................................................................................................................................... 225

53. Remote cluster monitoring options available in GUI............................................................................. 229

54. IBM websites for help, services, and information................................................................................. 247

55. Core object log files in /var/log/swift..................................................................................................... 267

56. Extra object log files in /var/log/swift.................................................................................................... 269

57. General system log files in /var/adm/ras............................................................................................... 269

58. Authentication log files........................................................................................................................... 270

59. AFM logged errors...................................................................................................................................279

60. IBM Security Lifecycle Manager preinstallation checklist..................................................................... 441

61. CES NFS log levels.................................................................................................................................. 450

62. Sensors available for each resource type.............................................................................................. 504

63. GUI refresh tasks.................................................................................................................................... 507

64. Troubleshooting details for capacity data display issues in GUI...........................................................509

65. Common questions in AFM with their resolution...................................................................................511

66. Common questions in AFM DR with their resolution............................................................................. 515

67. Common questions in AFM to cloud object storage with their resolution............................................ 517

68. DMPs....................................................................................................................................................... 533

69. NFS sensor configuration example........................................................................................................ 537

70. SMB sensor configuration example........................................................................................................538

71. Tip events list.......................................................................................................................................... 540

72. Events for the AFM component.............................................................................................................. 559

73. Events for the Auth component..............................................................................................................565

xvii
74. Events for the Callhome component...................................................................................................... 570

75. Events for the CES network component.................................................................................................573

76. Events for the CESIP component........................................................................................................... 578

77. Events for the cluster state component................................................................................................. 578

78. Events for the Disk component.............................................................................................................. 582

79. Events for the enclosure component..................................................................................................... 584

80. Events for the Encryption component....................................................................................................593

81. Events for the File Audit Logging component........................................................................................ 597

82. Events for the file system component....................................................................................................602

83. Events for the file system manager component.................................................................................... 613

84. Events for the GDS component.............................................................................................................. 615

85. Events for the GPFS component.............................................................................................................615

86. Events for the GUI component............................................................................................................... 634

87. Events for the Hadoop connector component....................................................................................... 645

88. Events for the HDFS data node component........................................................................................... 645

89. Events for the HDFS name node component......................................................................................... 646

90. Events for the Keystone component...................................................................................................... 648

91. Events for the Local cache component.................................................................................................. 650

92. Events for the network component........................................................................................................ 652

93. Events for the NFS component...............................................................................................................660

94. Events for the NVMe component............................................................................................................671

95. Events for the NVMeoF component........................................................................................................674

96. Events for the object component........................................................................................................... 676

97. Events for the Performance component................................................................................................ 685

98. Events for the Server RAID component................................................................................................. 687

xviii
99. Events for the SMB component.............................................................................................................. 687

100. Events for the Stretch cluster component........................................................................................... 691

101. Events for the Transparent cloud tiering component.......................................................................... 693

102. Events for the threshold component....................................................................................................708

103. Events for the Watchfolder component............................................................................................... 710

104. Cloud services status description........................................................................................................ 713

105. Audit events.......................................................................................................................................... 727

106. Message severity tags ordered by priority........................................................................................... 729

xix
xx
About this information
This edition applies to IBM Storage Scale version 5.1.9 for AIX®, Linux®, and Windows.
IBM Storage Scale is a file management infrastructure, based on IBM General Parallel File System (GPFS)
technology, which provides unmatched performance and reliability with scalable access to critical file
data.
To find out which version of IBM Storage Scale is running on a particular AIX node, enter:

lslpp -l gpfs\*

To find out which version of IBM Storage Scale is running on a particular Linux node, enter:

rpm -qa | grep gpfs (for SLES and Red Hat Enterprise Linux)

dpkg -l | grep gpfs (for Ubuntu Linux)

To find out which version of IBM Storage Scale is running on a particular Windows node, open Programs
and Features in the control panel. The IBM Storage Scale installed program name includes the version
number.

Which IBM Storage Scale information unit provides the information you need?
The IBM Storage Scale library consists of the information units listed in Table 1 on page xxii.
To use these information units effectively, you must be familiar with IBM Storage Scale and the AIX,
Linux, or Windows operating system, or all of them, depending on which operating systems are in use at
your installation. Where necessary, these information units provide some background information relating
to AIX, Linux, or Windows. However, more commonly they refer to the appropriate operating system
documentation.
Note: Throughout this documentation, the term "Linux" refers to all supported distributions of Linux,
unless otherwise specified.

© Copyright IBM Corp. 2015, 2024 xxi


Table 1. IBM Storage Scale library information units
Information unit Type of information Intended users
IBM Storage Scale: This guide provides the following System administrators, analysts,
Concepts, Planning, and information: installers, planners, and
Installation Guide programmers of IBM Storage Scale
Product overview
clusters who are very experienced
• Overview of IBM Storage Scale with the operating systems on
• GPFS architecture which each IBM Storage Scale
cluster is based
• Protocols support overview:
Integration of protocol access
methods with GPFS
• Active File Management
• AFM-based Asynchronous
Disaster Recovery (AFM DR)
• Introduction to AFM to cloud
object storage
• Introduction to system health and
troubleshooting
• Introduction to performance
monitoring
• Data protection and disaster
recovery in IBM Storage Scale
• Introduction to IBM Storage Scale
GUI
• IBM Storage Scale management
API
• Introduction to Cloud services
• Introduction to file audit logging
• Introduction to clustered watch
folder
• Understanding call home
• IBM Storage Scale in an
OpenStack cloud deployment
• IBM Storage Scale product
editions
• IBM Storage Scale license
designation
• Capacity-based licensing
• Dynamic pagepool

xxii About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Planning
Concepts, Planning, and
Installation Guide • Planning for GPFS
• Planning for protocols
• Planning for cloud services
• Planning for IBM Storage Scale on
Public Clouds
• Planning for AFM
• Planning for AFM DR
• Planning for AFM to cloud object
storage
• Planning for performance
monitoring tool
• Planning for UEFI secure boot

IBM Storage Scale: • Firewall recommendations


Concepts, Planning, and
• Considerations for GPFS
Installation Guide
applications
• Security-Enhanced Linux support
• Space requirements for call home
data upload

About this information xxiii


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Installing System administrators, analysts,
Concepts, Planning, and installers, planners, and
Installation Guide • Steps for establishing and starting programmers of IBM Storage Scale
your IBM Storage Scale cluster clusters who are very experienced
• Installing IBM Storage Scale with the operating systems on
on Linux nodes and deploying which each IBM Storage Scale
protocols cluster is based
• Installing IBM Storage Scale on
public cloud by using cloudkit
• Installing IBM Storage Scale on
AIX nodes
• Installing IBM Storage Scale on
Windows nodes
• Installing Cloud services on IBM
Storage Scale nodes
• Installing and configuring IBM
Storage Scale management API
• Installing GPUDirect Storage for
IBM Storage Scale
• Installation of Active File
Management (AFM)
• Installing AFM Disaster Recovery
• Installing call home
• Installing file audit logging
• Installing clustered watch folder
• Installing the signed kernel
modules for UEFI secure boot
• Steps to permanently uninstall
IBM Storage Scale
Upgrading
• IBM Storage Scale supported
upgrade paths
• Online upgrade support for
protocols and performance
monitoring
• Upgrading IBM Storage Scale
nodes

xxiv About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Upgrading IBM Storage Scale System administrators, analysts,
Concepts, Planning, and non-protocol Linux nodes installers, planners, and
Installation Guide programmers of IBM Storage Scale
• Upgrading IBM Storage Scale clusters who are very experienced
protocol nodes with the operating systems on
• Upgrading IBM Storage Scale on which each IBM Storage Scale
cloud cluster is based
• Upgrading GPUDirect Storage
• Upgrading AFM and AFM DR
• Upgrading object packages
• Upgrading SMB packages
• Upgrading NFS packages
• Upgrading call home
• Upgrading the performance
monitoring tool
• Upgrading signed kernel modules
for UEFI secure boot
• Manually upgrading pmswift
• Manually upgrading the IBM
Storage Scale management GUI
• Upgrading Cloud services
• Upgrading to IBM Cloud Object
Storage software level 3.7.2 and
above
• Upgrade paths and commands for
file audit logging and clustered
watch folder
• Upgrading IBM Storage Scale
components with the installation
toolkit
• Protocol authentication
configuration changes during
upgrade
• Changing the IBM Storage Scale
product edition
• Completing the upgrade to a new
level of IBM Storage Scale
• Reverting to the previous level of
IBM Storage Scale

About this information xxv


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Coexistence considerations
Concepts, Planning, and
• Compatibility considerations
Installation Guide
• Considerations for IBM Storage
Protect for Space Management
• Applying maintenance to your
IBM Storage Scale system
• Guidance for upgrading the
operating system on IBM Storage
Scale nodes
• Considerations for upgrading
from an operating system not
supported in IBM Storage Scale
5.1.x.x
• Servicing IBM Storage Scale
protocol nodes
• Offline upgrade with complete
cluster shutdown

xxvi About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: This guide provides the following System administrators or
Administration Guide information: programmers of IBM Storage Scale
systems
Configuring
• Configuring the GPFS cluster
• Configuring GPUDirect Storage for
IBM Storage Scale
• Configuring the CES and protocol
configuration
• Configuring and tuning your
system for GPFS
• Parameters for performance
tuning and optimization
• Ensuring high availability of the
GUI service
• Configuring and tuning your
system for Cloud services
• Configuring IBM Power Systems
for IBM Storage Scale
• Configuring file audit logging
• Configuring clustered watch
folder
• Configuring the cloudkit
• Configuring Active File
Management
• Configuring AFM-based DR
• Configuring AFM to cloud object
storage
• Tuning for Kernel NFS backend on
AFM and AFM DR
• Configuring call home
• Integrating IBM Storage Scale
Cinder driver with Red Hat
OpenStack Platform 16.1
• Configuring Multi-Rail over TCP
(MROT)
• Dynamic pagepool configuration

About this information xxvii


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Administering System administrators or
Administration Guide programmers of IBM Storage Scale
• Performing GPFS administration systems
tasks
• Performing parallel copy with
mmxcp command
• Protecting file data: IBM Storage
Scale safeguarded copy
• Verifying network operation with
the mmnetverify command
• Managing file systems
• File system format changes
between versions of IBM Storage
Scale
• Managing disks

xxviii About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Managing protocol services System administrators or
Administration Guide programmers of IBM Storage Scale
• Managing protocol user systems
authentication
• Managing protocol data exports
• Managing object storage
• Managing GPFS quotas
• Managing GUI users
• Managing GPFS access control
lists
• Native NFS and GPFS
• Accessing a remote GPFS file
system
• Information lifecycle
management for IBM Storage
Scale
• Creating and maintaining
snapshots of file systems
• Creating and managing file clones
• Scale Out Backup and Restore
(SOBAR)
• Data Mirroring and Replication
• Implementing a clustered NFS
environment on Linux
• Implementing Cluster Export
Services
• Identity management on
Windows / RFC 2307 Attributes
• Protocols cluster disaster
recovery
• File Placement Optimizer
• Encryption
• Managing certificates to secure
communications between GUI
web server and web browsers
• Securing protocol data
• Cloud services: Transparent cloud
tiering and Cloud data sharing
• Managing file audit logging
• RDMA tuning
• Configuring Mellanox Memory
Translation Table (MTT) for GPFS
RDMA VERBS Operation
• Administering cloudkit
• Administering AFM
• Administering AFM DR

About this information xxix


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Administering AFM to cloud System administrators or
Administration Guide object storage programmers of IBM Storage Scale
systems
• Highly available write cache
(HAWC)
• Local read-only cache
• Miscellaneous advanced
administration topics
• GUI limitations

xxx About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Problem This guide provides the following System administrators of GPFS
Determination Guide information: systems who are experienced with
the subsystems used to manage
Monitoring
disks and who are familiar with
• Monitoring system health by using the concepts presented in the IBM
IBM Storage Scale GUI Storage Scale: Concepts, Planning,
• Monitoring system health by using and Installation Guide
the mmhealth command
• Dynamic pagepool monitoring
• Performance monitoring
• Monitoring GPUDirect storage
• Monitoring events through
callbacks
• Monitoring capacity through GUI
• Monitoring AFM and AFM DR
• Monitoring AFM to cloud object
storage
• GPFS SNMP support
• Monitoring the IBM Storage Scale
system by using call home
• Monitoring remote cluster through
GUI
• Monitoring file audit logging
• Monitoring clustered watch folder
• Monitoring local read-only cache
Troubleshooting
• Best practices for troubleshooting
• Understanding the system
limitations
• Collecting details of the issues
• Managing deadlocks
• Installation and configuration
issues
• Upgrade issues
• CCR issues
• Network issues
• File system issues
• Disk issues
• GPUDirect Storage
troubleshooting
• Security issues
• Protocol issues
• Disaster recovery issues
• Performance issues

About this information xxxi


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Problem • GUI and monitoring issues
Determination Guide
• AFM issues
• AFM DR issues
• AFM to cloud object storage
issues
• Transparent cloud tiering issues
• File audit logging issues
• Cloudkit issues
• Troubleshooting mmwatch
• Maintenance procedures
• Recovery procedures
• Support for troubleshooting
• References

xxxii About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: This guide provides the following • System administrators of IBM
Command and Programming information: Storage Scale systems
Reference Guide
Command reference • Application programmers who are
experienced with IBM Storage
• cloudkit command
Scale systems and familiar with
• gpfs.snap command the terminology and concepts in
• mmaddcallback command the XDSM standard
• mmadddisk command
• mmaddnode command
• mmadquery command
• mmafmconfig command
• mmafmcosaccess command
• mmafmcosconfig command
• mmafmcosctl command
• mmafmcoskeys command
• mmafmctl command
• mmafmlocal command
• mmapplypolicy command
• mmaudit command
• mmauth command
• mmbackup command
• mmbackupconfig command
• mmbuildgpl command
• mmcachectl command
• mmcallhome command
• mmces command
• mmchattr command
• mmchcluster command
• mmchconfig command
• mmchdisk command
• mmcheckquota command
• mmchfileset command
• mmchfs command
• mmchlicense command
• mmchmgr command
• mmchnode command
• mmchnodeclass command
• mmchnsd command
• mmchpolicy command
• mmchpool command
• mmchqos command
• mmclidecode command

About this information xxxiii


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • mmclone command • System administrators of IBM
Command and Programming Storage Scale systems
• mmcloudgateway command
Reference Guide
• mmcrcluster command • Application programmers who are
experienced with IBM Storage
• mmcrfileset command Scale systems and familiar with
• mmcrfs command the terminology and concepts in
• mmcrnodeclass command the XDSM standard

• mmcrnsd command
• mmcrsnapshot command
• mmdefedquota command
• mmdefquotaoff command
• mmdefquotaon command
• mmdefragfs command
• mmdelacl command
• mmdelcallback command
• mmdeldisk command
• mmdelfileset command
• mmdelfs command
• mmdelnode command
• mmdelnodeclass command
• mmdelnsd command
• mmdelsnapshot command
• mmdf command
• mmdiag command
• mmdsh command
• mmeditacl command
• mmedquota command
• mmexportfs command
• mmfsck command
• mmfsckx command
• mmfsctl command
• mmgetacl command
• mmgetstate command
• mmhadoopctl command
• mmhdfs command
• mmhealth command
• mmimgbackup command
• mmimgrestore command
• mmimportfs command
• mmkeyserv command

xxxiv About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • mmlinkfileset command • System administrators of IBM
Command and Programming Storage Scale systems
• mmlsattr command
Reference Guide
• mmlscallback command • Application programmers who are
experienced with IBM Storage
• mmlscluster command Scale systems and familiar with
• mmlsconfig command the terminology and concepts in
• mmlsdisk command the XDSM standard

• mmlsfileset command
• mmlsfs command
• mmlslicense command
• mmlsmgr command
• mmlsmount command
• mmlsnodeclass command
• mmlsnsd command
• mmlspolicy command
• mmlspool command
• mmlsqos command
• mmlsquota command
• mmlssnapshot command
• mmmigratefs command
• mmmount command
• mmnetverify command
• mmnfs command
• mmnsddiscover command
• mmobj command
• mmperfmon command
• mmpmon command
• mmprotocoltrace command
• mmpsnap command
• mmputacl command
• mmqos command
• mmquotaoff command
• mmquotaon command
• mmreclaimspace command
• mmremotecluster command
• mmremotefs command
• mmrepquota command
• mmrestoreconfig command
• mmrestorefs command
• mmrestrictedctl command
• mmrestripefile command

About this information xxxv


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • mmrestripefs command • System administrators of IBM
Command and Programming Storage Scale systems
• mmrpldisk command
Reference Guide
• mmsdrrestore command • Application programmers who are
experienced with IBM Storage
• mmsetquota command Scale systems and familiar with
• mmshutdown command the terminology and concepts in
• mmsmb command the XDSM standard

• mmsnapdir command
• mmstartup command
• mmstartpolicy command
• mmtracectl command
• mmumount command
• mmunlinkfileset command
• mmuserauth command
• mmwatch command
• mmwinservctl command
• mmxcp command
• spectrumscale command
Programming reference
• IBM Storage Scale Data
Management API for GPFS
information
• GPFS programming interfaces
• GPFS user exits
• IBM Storage Scale management
API endpoints
• Considerations for GPFS
applications

xxxvi About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Big Data This guide provides the following • System administrators of IBM
and Analytics Guide information: Storage Scale systems
Summary of changes • Application programmers who are
experienced with IBM Storage
Big data and analytics support
Scale systems and familiar with
Hadoop Scale Storage Architecture the terminology and concepts in
the XDSM standard
• Elastic Storage Server
• Erasure Code Edition
• Share Storage (SAN-based
storage)
• File Placement Optimizer (FPO)
• Deployment model
• Additional supported storage
features
IBM Spectrum® Scale support for
Hadoop
• HDFS transparency overview
• Supported IBM Storage Scale
storage modes
• Hadoop cluster planning
• CES HDFS
• Non-CES HDFS
• Security
• Advanced features
• Hadoop distribution support
• Limitations and differences from
native HDFS
• Problem determination
IBM Storage Scale Hadoop
performance tuning guide
• Overview
• Performance overview
• Hadoop Performance Planning
over IBM Storage Scale
• Performance guide

About this information xxxvii


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Big Data Cloudera Data Platform (CDP) • System administrators of IBM
and Analytics Guide Private Cloud Base Storage Scale systems
• Overview • Application programmers who are
• Planning experienced with IBM Storage
Scale systems and familiar with
• Installing the terminology and concepts in
• Configuring the XDSM standard
• Administering
• Monitoring
• Upgrading
• Limitations
• Problem determination

IBM Storage Scale: Big Data Cloudera HDP 3.X • System administrators of IBM
and Analytics Guide Storage Scale systems
• Planning
• Application programmers who are
• Installation
experienced with IBM Storage
• Upgrading and uninstallation Scale systems and familiar with
• Configuration the terminology and concepts in
the XDSM standard
• Administration
• Limitations
• Problem determination
Open Source Apache Hadoop
• Open Source Apache Hadoop
without CES HDFS
• Open Source Apache Hadoop with
CES HDFS

xxxviii About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale Erasure IBM Storage Scale Erasure Code • System administrators of IBM
Code Edition Guide Edition Storage Scale systems
• Summary of changes • Application programmers who are
experienced with IBM Storage
• Introduction to IBM Storage Scale
Scale systems and familiar with
Erasure Code Edition
the terminology and concepts in
• Planning for IBM Storage Scale the XDSM standard
Erasure Code Edition
• Installing IBM Storage Scale
Erasure Code Edition
• Uninstalling IBM Storage Scale
Erasure Code Edition
• Creating an IBM Storage Scale
Erasure Code Edition storage
environment
• Using IBM Storage Scale Erasure
Code Edition for data mirroring
and replication
• Deploying IBM Storage Scale
Erasure Code Edition on VMware
infrastructure
• Upgrading IBM Storage Scale
Erasure Code Edition
• Incorporating IBM Storage Scale
Erasure Code Edition in an Elastic
Storage Server (ESS) cluster
• Incorporating IBM Elastic Storage
Server (ESS) building block in an
IBM Storage Scale Erasure Code
Edition cluster
• Administering IBM Storage Scale
Erasure Code Edition
• Troubleshooting
• IBM Storage Scale RAID
Administration

About this information xxxix


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale Container This guide provides the following • System administrators of IBM
Native Storage Access information: Storage Scale systems
• Overview • Application programmers who are
• Planning experienced with IBM Storage
Scale systems and familiar with
• Installation prerequisites the terminology and concepts in
• Installing the IBM Storage Scale the XDSM standard
container native operator and
cluster
• Upgrading
• Configuring IBM Storage Scale
Container Storage Interface (CSI)
driver
• Using IBM Storage Scale GUI
• Maintenance of a deployed cluster
• Cleaning up the container native
cluster
• Monitoring
• Troubleshooting
• References

IBM Storage Scale Data This guide provides the following • System administrators of IBM
Access Service information: Storage Scale systems
• Overview • Application programmers who are
• Architecture experienced with IBM Storage
Scale systems and familiar with
• Security the terminology and concepts in
• Planning the XDSM standard
• Installing and configuring
• Upgrading
• Administering
• Monitoring
• Collecting data for support
• Troubleshooting
• The mmdas command
• REST APIs

xl About this information


Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale Container This guide provides the following • System administrators of IBM
Storage Interface Driver information: Storage Scale systems
Guide
• Summary of changes • Application programmers who are
• Introduction experienced with IBM Storage
Scale systems and familiar with
• Planning the terminology and concepts in
• Installation the XDSM standard
• Upgrading
• Configurations
• Using IBM Storage Scale
Container Storage Interface Driver
• Managing IBM Storage Scale
when used with IBM Storage
Scale Container Storage Interface
driver
• Cleanup
• Limitations
• Troubleshooting

Prerequisite and related information


For updates to this information, see IBM Storage Scale in IBM Documentation.
For the latest support information, see the IBM Storage Scale FAQ in IBM Documentation.

Conventions used in this information


Table 2 on page xli describes the typographic conventions used in this information. UNIX file name
conventions are used throughout this information.
Note: Users of IBM Storage Scale for Windows must be aware that on Windows, UNIX-style
file names need to be converted appropriately. For example, the GPFS cluster configuration data
is stored in the /var/mmfs/gen/mmsdrfs file. On Windows, the UNIX namespace starts under
the %SystemDrive%\cygwin64 directory, so the GPFS cluster configuration data is stored in the
C:\cygwin64\var\mmfs\gen\mmsdrfs file.

Table 2. Conventions
Convention Usage
bold Bold words or characters represent system elements that you must use literally,
such as commands, flags, values, and selected menu options.
Depending on the context, bold typeface sometimes represents path names,
directories, or file names.

bold bold underlined keywords are defaults. These take effect if you do not specify a
underlined different keyword.

About this information xli


Table 2. Conventions (continued)
Convention Usage
constant width Examples and information that the system displays appear in constant-width
typeface.
Depending on the context, constant-width typeface sometimes represents path
names, directories, or file names.

italic Italic words or characters represent variable values that you must supply.
Italics are also used for information unit titles, for the first use of a glossary term,
and for general emphasis in text.

<key> Angle brackets (less-than and greater-than) enclose the name of a key on the
keyboard. For example, <Enter> refers to the key on your terminal or workstation
that is labeled with the word Enter.
\ In command examples, a backslash indicates that the command or coding example
continues on the next line. For example:

mkcondition -r IBM.FileSystem -e "PercentTotUsed > 90" \


-E "PercentTotUsed < 85" -m p "FileSystem space used"

{item} Braces enclose a list from which you must choose an item in format and syntax
descriptions.
[item] Brackets enclose optional items in format and syntax descriptions.
<Ctrl-x> The notation <Ctrl-x> indicates a control character sequence. For example,
<Ctrl-c> means that you hold down the control key while pressing <c>.
item... Ellipses indicate that you can repeat the preceding item one or more times.
| In synopsis statements, vertical lines separate a list of choices. In other words, a
vertical line means Or.
In the left margin of the document, vertical lines indicate technical changes to the
information.

Note: CLI options that accept a list of option values delimit with a comma and no space between
values. As an example, to display the state on three nodes use mmgetstate -N NodeA,NodeB,NodeC.
Exceptions to this syntax are listed specifically within the command.

How to send your comments


Your feedback is important in helping us to produce accurate, high-quality information. If you have any
comments about this information or any other IBM Storage Scale documentation, send your comments to
the following e-mail address:
[email protected]
Include the publication title and order number, and, if applicable, the specific location of the information
about which you have comments (for example, a page number or a table number).
To contact the IBM Storage Scale development organization, send your comments to the following e-mail
address:
[email protected]

xlii IBM Storage Scale 5.1.9: Problem Determination Guide


Summary of changes
This topic summarizes changes to the IBM Storage Scale licensed program and the IBM Storage Scale
library. Within each information unit in the library, a vertical line (|) to the left of text and illustrations
indicates technical changes or additions that are made to the previous edition of the information.

Summary of changes
for IBM Storage Scale 5.1.9
as updated, February 2024

This release of the IBM Storage Scale licensed program and the IBM Storage Scale library includes the
following improvements. All improvements are available after an upgrade, unless otherwise specified.
• Commands, data types, and programming APIs
• Messages
• Stabilized, deprecated, and discontinued features
AFM and AFM DR-related changes
• AFM DR is supported in a remote destination routing (RDR) environment.
• Added the support of the getOutbandList option for the out-of-band metadata population for a GPFS
backend. For more information, see the mmafmctl command, in the IBM Storage Scale: Command
and Programming Reference Guide.
• AFM online dependent fileset can be created and linked in the AFM DR secondary fileset without
stopping the fileset by using the afmOnlineDepFset parameter. For more information, see the
mmchconfig command, in the IBM Storage Scale: Command and Programming Reference Guide and
the Online creation and linking of a dependent fileset in AFM DR section in the IBM Storage Scale:
Concepts, Planning, and Installation Guide.
• Added sample tools for the AFM external caching to S3 servers in a sample directory.

/usr/lpp/mmfs/samples/pcache/
drwxr-xr-x 3 root root 129 Oct 8 11:45 afm-s3-tests
drwxr-xr-x 2 root root 86 Oct 8 11:45 mmafmtransfer-s3-tests

AFM to Cloud Object Storage


• Added the support for inline migration of a TCT-enabled fileset to an AFM Manual Mode fileset.
Added a new option, checkTCT, to migrate filesets from TCT to AFM to cloud object storage
manual mode fileset. For more information, see mmafmcosctl command, in the IBM Storage
Scale: Command and Programming Reference Guide and the Migration of a transparent cloud tiering-
enabled IBM Storage Scale fileset or file system to an AFM to cloud object storage fileset in the
manual update mode section in IBM Storage Scale: Administration Guide.
• Added the native support of Microsoft Azure Blob API by using the --azure option AFM to cloud
object storage. Now, AFM can synchronize data to Azure Blob object storage directly without setting
S3 gateway. For more information, see the mmafmcosconfig command and the mmafmcosconfig
command, in the IBM Storage Scale: Command and Programming Reference Guide, and the Microsoft
Azure Blob support for the AFM to cloud object storage section, in the IBM Storage Scale: Concepts,
Planning, and Installation Guide.
• Added the support for CEPH S3 v6 as AFM to cloud object storage backend. From IBM Storage Scale
5.1.9, AFM can synchronize cache data to CEPH S3 cloud object storage.
Big data and analytics changes
For information on changes in IBM Storage Scale Big Data and Analytics support and HDFS protocol,
see Big Data and Analytics - summary of changes.

© Copyright IBM Corp. 2015, 2024 xliii


IBM Storage Scale Container Storage Interface driver changes
For information on changes in the IBM Storage Scale Container Storage Interface driver, see IBM
Storage Scale Container Storage Interface driver - Summary of changes.

IBM Storage Scale Container Native Storage Access changes


For information on changes in the IBM Storage Scale Container Native Storage Access, see IBM
Storage Scale Container Native Storage Access - Summary of changes.

IBM Storage Scale Erasure Code Edition changes


For information on changes in the IBM Storage Scale Erasure Code Edition, see IBM Storage Scale
Erasure Code Edition - Summary of changes.

Cloudkit changes
• Cloudkit adds support for Google Cloud Platform (GCP).
• Cloudkit enhancement to support AWS cluster upgrade.
• Cloudkit enhancement to support for scale out on AWS cluster instances.
Discontinuation of the CES Swift Object protocol feature
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was
provided in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM
Storage Scale provides.
• Please contact IBM for further details and migration planning.
File system core improvements
• The dynamic pagepool feature is now available in IBM Storage Scale. The feature adjusts the size of
the pagepool memory dynamically. For more information, see the Dynamic pagepool section in IBM
Storage Scale: Concepts, Planning, and Installation Guide.
• The GPFSBufMgr sensor has been added to the performance monitoring tool. Issue the mmperfmon
config add command to add sensor to IBM Storage Scale 5.1.9. For more information, see
GPFSBufMgr in the GPFS metrics section, in the IBM Storage Scale: Problem Determination Guide.
• Enhanced node expel logic has been added in IBM Storage Scale. The expel logic addresses the
issue of a single node experiencing communication issues resulting in other nodes being expelled
from the cluster.
• The mmxcp command has been updated:
– The enable option:
- a new parameter, --hardlinks, has been added that executes an additional pass through the
source files searching and copying hardlinked files as a single batch.
- two new attributes for the copy-attrs parameter, appendonly and immutable, have been
added which copies the appendonly and immutable attributes, if present.
– The verify option:
- Two new attributes for the check option, appendonly and immutable, have been added that
compare the appendonly and immutable attributes, if present.

xliv Summary of changes


For more information, see the mmxcp command in the IBM Storage Scale: Command and
Programming Reference Guide.
• The MMBACKUP_PROGRESS_CONTENT environment variable has a new value that indicates file
size information should be displayed during the backup. For more information, see the mmbackup
command in the IBM Storage Scale: Command and Programming Reference Guide.
• The mmapplypolicy command has a new option --silent-on-delete that ignores certain type
of errors during the DELETE rule execution. For more information, see the mmapplypolicy command
in the IBM Storage Scale: Command and Programming Reference Guide.
• The mmremotecluster command was updated to include the remote cluster id in the output of the
show action. For more information, see the mmremotecluster command in the IBM Storage Scale:
Command and Programming Reference Guide.
• Using the mmkeyserv update command, you can change the encryption key server's hostname
and IPA. For more information, see the mmkeyserv command in the IBM Storage Scale: Command
and Programming Reference Guide.
• IBM Storage Scale supports signed kernel modules for UEFI secure boot:
IBM Storage Scale 5.1.9 supports UEFI secure boot environments with RHEL 9.2 on x86_64. The
gpfs.bin.rpm holding signed kernel modules and the public key can be downloaded from Fix
Central. For more information, see the Signed kernel modules for UEFI secure boot topic in the IBM
Storage Scale: Concepts, Planning, and Installation Guide.
• IBM Storage Scale no longer uses the Linux kernel flag PF_MEMALLOC for its threads. Because of
this flag, the Linux kernel displayed a warning when the XFS file system was used for a local file
system on IBM Storage Scale nodes. After the removal of this flag, no warnings were displayed.
File system protocol changes
• NFS-Ganesha is upgraded to version 4.3.
• The default option for the SMB configuration parameter fileid:algorithm is changed from
fsname to fsname_norootdir.
Installation toolkit changes
• Toolkit support for Remote mount configuration.
• Extended operating system certification and support.
• Toolkit code enhancement to work with latest Ansible library.
• ECE SED installation, upgrade and multi DA support for vdisk creation.
Management API changes
The following endpoints are modified:
• GET filesystems/{filesystemName}/filesets/{filesetName}
For more information, see the topic IBM Storage Scale management API endpoints in the IBM Storage
Scale: Command and Programming Reference Guide.
Native REST API (technology preview)
IBM Storage Scale introduces the Native Rest API feature as a technology preview feature. The
feature adds a new control plane component to the IBM Storage Scale stack for administering clusters
instead of the mm-command layer. The feature also adds a few security enhancements. For more
information, see the following IBM Storage Scale support page: https://www.ibm.com/support/pages/
node/7059676
ARM processor (technology preview)
IBM Storage Scale is supported on the ARM processor as a technology preview (nonproduction
environments), starting with IBM Storage Scale 5.1.9. IBM Storage Scale has been developed for
the ARM processor with an instruction set of at least version 8.2-A. For more information, see the
following IBM Storage Scale support page: https://www.ibm.com/support/pages/node/7066226

Summary of changes xlv


Commands, data types, and programming APIs
The following section lists the modifications to the documented commands, structures, and
subroutines:
New commands
There are no new commands.
New structures
There are no new structure changes.
New subroutines
There are no new subroutines.
New user exits
There are no new user exits.
Changed commands
• cloudkit
• gpfs.snap
• mmaddcallback
• mmafmconfig
• mmafmcosconfig
• mmafmcoskeys
• mmapplypolicy
• mmafmctl
• mmbackup
• mmces
• mmchconfig
• mmchfileset
• mmchfs
• mmcrfs
• mmdelacl
• mmdiag
• mmeditacl
• mmedquota
• mmfsckx
• mmgetacl
• mmhdfs
• mmhealth
• mmimportfs
• mmkeyserv
• mmobj
• mmnfs
• mmputacl
• mmremotecluster
• mmrestripefile
• mmsdrrestore
• mmsetquota
• spectrumscale
• mmxcp

xlvi Summary of changes


Changed structures
There are no changed structures.
Changed subroutines
• gpfs_fcntl
Deleted commands
There are no deleted commands.
Deleted structures
There are no deleted structures.
Deleted subroutines
There are no deleted subroutines.
Messages
The following are the new, changed, and deleted messages:
New messages
6027-1831 and 6027-1832, 6027-2054 through 6027-2063, 6027-3270, 6027-3415 through
6027-3417, 6027-3617, and 6027-3954 through 6027-3956
Changed messages
6027-1242
Deleted messages
There are no deleted messages.

Summary of changes xlvii


xlviii IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 1. Monitoring system health by using IBM
Storage Scale GUI
The IBM Storage Scale system provides background monitoring capabilities to check the health of a
cluster and each node of the cluster, including all the services that are hosted on a node. You can view
the system health states or corresponding events for the selected health state on the individual pages,
widgets or panels of the IBM Storage Scale GUI. You can also view system health details by issuing the
mmhealth command options like mmhealth cluster show, mmhealth node show, or other similar
options.
The following table lists the system health monitoring options that are available in the IBM Storage Scale
GUI.

Table 3. System health monitoring options that are available in IBM Storage Scale GUI
Option Function
Home Provides overall system health of the IBM Storage
Scale system.
System overview widget in the Monitoring > Displays the number of events that are reported
Dashboard page against each component.
System health events widget in the Monitoring > Provides an overview of the events that are
Dashboard page reported in the system.
Timeline widget in the Monitoring > Dashboard Displays the events that are reported in a particular
page timeframe on the selected performance chart.
Filesets with the largest growth rate last week Displays the filesets with the highest growth rate in
widget in the Monitoring > Dashboard page the last one week.
File system capacity by fileset widget in the Displays the capacity reported per fileset in a
Monitoring > Dashboard page file system. The per fileset capacity data requires
quota enablement at the file system level.
Monitoring > Events Lists the events that are reported in the system.
You can monitor and troubleshoot errors on your
system from the Events page.
Monitoring > Tips Lists the tips that are reported in the system and
allows user to hide or show tips. The tip events
give recommendations to the user to avoid certain
issues that might occur in the future.
Monitoring > Thresholds Lists the events that are raised when certain
thresholds are reached for the data that is
collected through performance monitoring sensors.
For more information, see “Monitoring thresholds
by using GUI” on page 7.
Monitoring > Event Notifications Enables you to configure event notifications to
notify the users about significant event changes
that occur in the system.
Nodes Lists the events that are reported at the node level.
Files > File Systems Lists the events that are reported at the file system
level.

© Copyright IBM Corp. 2015, 2024 1


Table 3. System health monitoring options that are available in IBM Storage Scale GUI (continued)
Option Function
Files > Transparent Cloud Tiering Lists the events that are reported for the
Transparent Cloud Tiering service. The GUI
displays the page only when the transparent cloud
tiering feature is enabled in the system.
Files > Filesets Lists events that are reported for filesets.
Files > Active File Management Displays health status and lists events that are
reported for AFM cache relationship, AFM disaster
recovery (AFMDR) relationship, and gateway
nodes.
Storage > Pools Displays health status and lists events that are
reported for storage pools.
Storage > NSDs Lists the events that are reported at the NSD level.

Note: The alerts and Tips icons on the IBM Storage Scale GUI header displays the number of tips and
alerts that are received. It specifies the number and age of events that are triggered. The notifications
disappear when the alert or tip is resolved.

Monitoring events by using GUI


You can primarily use the Monitoring > Events page to monitor the events that are reported in the system.
The status bar that is placed on the upper right of the GUI header, also displays the number of events that
are reported in the system.
The events are raised against the respective component, for example, GPFS, NFS, SMB, and others. Some
of these events might occur multiple times in the system. Such events are grouped under the Event
Groups tab and the number of occurrences of the events are indicated in the Occurrences column. The
Individual Events tab lists all the events irrespective of the multiple occurrences.
A graphical view of events that are reported against each component is available. Clicking the graph
displays only the relevant events in the grid view. Clicking a section on the graphical view applies the
corresponding filter on the search action and fetches only the relevant data in the events table.
You can further filter the events that are listed in the Events page with the help of the following filter
options:
• Current Issues displays all unfixed errors and warnings.
• Current State lists active events that are generated because of state changes.
• Notices displays the events that are not caused by a state change of an entity or component. Notices
never become inactive on their own. Use the Mark Notices as Read action to make them historical
when read.
• All Events displays all the events irrespective of severity and type. It shows both active and historical
events. The historical events are displayed with a grey-colored icon.
The severity icons help to quickly determine whether the event is informational, a warning, or an error.
Click an event and select Properties from the Action menu to see detailed information on the event. The
event table displays the most recent events first.
The following table describes the severity levels of events.

2 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 4. Notification levels
Notification level Description
Error Error notification is sent to indicate a problem that must be corrected as soon
as possible.
This notification indicates a serious problem with the system. For example, the
event that is being reported might indicate a loss of redundancy in the system,
and it is possible that another failure might result in loss of access to data.

Warning A warning notification is sent to indicate a problem or unexpected condition


with the system. Always immediately investigate this type of notification to
determine the effect that it might have on your operation, and make any
necessary corrections.
Information An informational notification is sent to indicate the occurrence of an expected
event. For example, a NAS service is started. No remedial action is required
when these notifications are sent.

Note: A separate event type with severity "Tip" is also available. Tips are the recommendations that are
given to ensure that you avoid certain issues that might occur in the future. The tip events are monitored
separately in the Monitoring > Tips page of the GUI.

Marking events as read


The basic types of events can be categorized in the following manner.
• State change events - The type of event that is generated when an entity changes its state.
• The non-state change events - The type of event that is generated when there is no change in the
entity's state.
The state change events are managed by the system and such events become inactive as soon as the
state changes again. The non-state change events are referred to as notices in the GUI. Notices never
become inactive on their own.
You must use the action Mark Selected Notices as Read or Mark All Notices as Read on the non-state
change events. By using these actions, you can make them historical because the system displays those
events as active events even if the problem or information is not valid anymore.
The result of mark as read operation is stored locally on the GUI node. That is, the changes that are made
to the events are not visible through the other GUI nodes.

Resolving Event
Some issues can be resolved manually. To resolve events created for such issues, select the event and
then click the Resolve Event option that is available under the Actions menu. On selecting the option,
the mmhealth event resolve command is run to resolve the specific event. You can also right-click
an event and select Resolve Event option from the drop-down menu that appears. On completion of the
task, the status appears in the task window. The complete event thread can be viewed under the detailed
view that you can access by using the View Details option.

Running fix procedure


Some issues can be resolved by running a fix procedure. To run a fix procedure, select Run Fix Procedure
option that is available in the Actions menu.

Chapter 1. Monitoring system health by using IBM Storage Scale GUI 3


Event notifications
The system can use Simple Network Management Protocol (SNMP) traps and emails to notify you
when significant events are detected. Any combination of these notification methods can be used
simultaneously. Use Monitoring > Event Notifications page in the GUI to configure event notifications.
Notifications are normally sent immediately after an event is raised.
In email notification method, you can also define whether a recipient needs to get a report of events that
are reported in the system. These reports are sent only once in a day. Based on the seriousness of the
issue, each event that is reported in the system gets a severity level that is associated with it.
The following table describes the severity levels of event notifications.

Table 5. Notification levels


Notification level Description
Error Error notification is sent to indicate a problem that must be corrected as soon
as possible.
This notification indicates a serious problem with the system. For example, the
event that is being reported might indicate a loss of redundancy in the system,
and it is possible that another failure might result in loss of access to data.
The most typical reason that this type of notification is because of a hardware
failure, but some configuration errors or fabric errors also are included in this
notification level.

Warning A warning notification is sent to indicate a problem or unexpected condition


with the system. Always immediately investigate this type of notification to
determine the effect that it might have on your operation, and make any
necessary corrections.
Therefore, a warning notification does not require any replacement parts and it
does not require IBM Support Center involvement.

Information An informational notification is sent to indicate that an expected event is


occurred. For example, a NAS service is started. No remedial action is required
when these notifications are sent.

Configuring email notifications


The email feature transmits operational and error-related data in the form of an event notification email.
To configure an email server, from the Event Notifications page, select Email Server. Select Edit and
then click Enable email notifications. Enter required details and when you are ready, click OK.
Email notifications can be customized by setting a custom header and footer for the emails and
customizing the subject by selecting and combining from the following variables: &message, &messageId,
&severity, &dateAndTime, &cluster, and &component.
Emails containing the quota reports and other events that are reported in the following functional areas
are sent to the recipients:
• AFM and AFM DR
• Authentication
• CES network
• Transparent Cloud Tiering
• NSD
• File system
• GPFS

4 IBM Storage Scale 5.1.9: Problem Determination Guide


• GUI
• Hadoop connector
• iSCSI
• Keystone
• Network
• NFS
• Object
• Performance monitoring
• SMB
• Object authentication
• Node
• CES
You can specify the severity level of events and whether to send a report that contains a summary of the
events received.
To create email recipients, select Email Recipients from the Event Notifications page, and then click
Create Recipient.
Note: You can change the email notification configuration or disable the email service at any time.

Configuring SNMP manager


Simple Network Management Protocol (SNMP) is a standard protocol for managing networks and
exchanging messages. The system can send SNMP messages that notify personnel about an event. You
can use an SNMP manager to view the SNMP messages that the system sends.
With an SNMP manager, such as IBM Systems Director, you can view and act on the messages that the
SNMP agent sends. The SNMP manager can send SNMP notifications, which are also known as traps,
when an event occurs in the system. Select Settings > Event Notifications > SNMP Manager to configure
SNMP managers for event notifications. You can specify up to a maximum of six SNMP managers.
Note: The GUI-based SNMP service supports only SNMP traps for system health events. SNMP queries
are not supported. For information on the SNMP queries, which is independent of the GUI-based SNMP
trap service, see Chapter 10, “GPFS SNMP support,” on page 211.
In the SNMP mode of event notification, one SNMP notification (trap) with object identifiers (OID).
1.3.6.1.4.1.2.6.212.10.0.1 is sent by the GUI for each event. The following table provides the SNMP
objects that are included in the event notifications.

Table 6. SNMP objects included in event notifications


OID Description Examples
1.3.6.1.4.1.2.6.212.10.1.1 Cluster ID 317908494245422510
1.3.6.1.4.1.2.6.212.10.1.2 Entity type NODE, FILESYSTEM
1.3.6.1.4.1.2.6.212.10.1.3 Entity name gss-11, fs01
1.3.6.1.4.1.2.6.212.10.1.4 Component NFS, FILESYSTEM, NSD
1.3.6.1.4.1.2.6.212.10.1.5 Severity INFO, TIP, WARNING, ERROR
1.3.6.1.4.1.2.6.212.10.1.6 Date and time 17.02.2016 13:27:42.516
1.3.6.1.4.1.2.6.212.10.1.7 Event name nfs_active
1.3.6.1.4.1.2.6.212.10.1.8 Message NFS service is now active.

Chapter 1. Monitoring system health by using IBM Storage Scale GUI 5


Table 6. SNMP objects included in event notifications (continued)
OID Description Examples
1.3.6.1.4.1.2.6.212.10.1.9 Reporting node The node where the problem is
reported.

SNMP OID ranges


The following table gives the description of the SNMP OID ranges.

Table 7. SNMP OID ranges


OID range Description
1.3.6.1.4.1.2.6.212 IBM Storage Scale
1.3.6.1.4.1.2.6.212.10 IBM Storage Scale GUI
1.3.6.1.4.1.2.6.212.10.0.1 IBM Storage Scale GUI event notification (trap)
1.3.6.1.4.1.2.6.212.10.1.x IBM Storage Scale GUI event notification
parameters (objects)

The traps for the core IBM Storage Scale and those trap objects are not included in the SNMP notifications
that are configured through the IBM Storage Scale management GUI. For more information on SNMP
traps from the core IBM Storage Scale, see Chapter 10, “GPFS SNMP support,” on page 211.

Example for SNMP traps


The following example shows the SNMP event notification that is sent when performance monitoring
sensor is shut down on a node:

SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::enterprises.2.6.212.10.0.1


SNMPv2-SMI::enterprises.2.6.212.10.1.1 = STRING: "317908494245422510"
SNMPv2-SMI::enterprises.2.6.212.10.1.2 = STRING: "NODE"
SNMPv2-SMI::enterprises.2.6.212.10.1.3 = STRING: "gss-11"
SNMPv2-SMI::enterprises.2.6.212.10.1.4 = STRING: "PERFMON"
SNMPv2-SMI::enterprises.2.6.212.10.1.5 = STRING: "ERROR"
SNMPv2-SMI::enterprises.2.6.212.10.1.6 = STRING: "18.02.2016 12:46:44.839"
SNMPv2-SMI::enterprises.2.6.212.10.1.7 = STRING: "pmsensors_down"
SNMPv2-SMI::enterprises.2.6.212.10.1.8 = STRING: "pmsensors service should be started and is
stopped"
SNMPv2-SMI::enterprises.2.6.212.10.1.9 = STRING: "gss-11"

The following example shows the SNMP event notification that is sent for an SNMP test message:

SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::enterprises.2.6.212.10.0.1


SNMPv2-SMI::enterprises.2.6.212.10.1.1 = STRING: "317908494245422510"
SNMPv2-SMI::enterprises.2.6.212.10.1.2 = STRING: "CLUSTER"
SNMPv2-SMI::enterprises.2.6.212.10.1.3 = STRING: "UNKNOWN"
SNMPv2-SMI::enterprises.2.6.212.10.1.4 = STRING: "GUI"
SNMPv2-SMI::enterprises.2.6.212.10.1.5 = STRING: "INFO"
SNMPv2-SMI::enterprises.2.6.212.10.1.6 = STRING: "18.02.2016 12:47:10.851"
SNMPv2-SMI::enterprises.2.6.212.10.1.7 = STRING: "snmp_test"
SNMPv2-SMI::enterprises.2.6.212.10.1.8 = STRING: "This is a SNMP test message."
SNMPv2-SMI::enterprises.2.6.212.10.1.9 = STRING: "gss-11"

SNMP MIBs
The SNMP Management Information Base (MIB) is a collection of definitions that define the properties of
the managed objects.
The IBM Storage Scale GUI MIB OID range starts with 1.3.6.1.4.1.2.6.212.10. The OID
range 1.3.6.1.4.1.2.6.212.10.0.1 denotes IBM Storage Scale GUI event notification (trap) and
1.3.6.1.4.1.2.6.212.10.1.x denotes IBM Storage Scale GUI event notification parameters (objects).

6 IBM Storage Scale 5.1.9: Problem Determination Guide


While configuring SNMP, use the MIB file that is available at the following location of each GUI
node: /usr/lpp/mmfs/gui/IBM-SPECTRUM-SCALE-GUI-MIB.txt.
Related concepts
“GPFS SNMP support” on page 211
GPFS supports the use of the SNMP protocol for monitoring the status and configuration of the GPFS
cluster. Using an SNMP application, the system administrator can get a detailed view of the system and be
instantly notified of important events, such as a node or disk failure.

Monitoring tips by using GUI


You can monitor events of type "Tips" from the Monitoring > Tips page of the GUI. The tip events give
recommendations to the user to avoid certain issues that might occur in the future. Some tips are targeted
for optimization, such as tuning the system for better performance. Certain tips help the users to fully
use capabilities, such as tips on how to enable sensors. A tip disappears from the GUI when the problem
behind the tip event is resolved.
Select Properties on the Actions menu to view the details of the tip. After you review the tip, decide
whether it requires attention or can be ignored. Select Hide on the Actions menu to ignore the events
that are not important and select Show to mark the tips that require attention. You cannot hide an already
resolved tip.
You can filter the tip events with the help of the following set of filters:
• Active Tips: This filter shows all active and unresolved tips.
• Hidden Tips: This filter shows the unresolved tips that are marked as hidden by using the Hide option.
Use the Show option in the Actions menu to mark the tips that require attention.
• All Tips: This filter shows all the tips that are reported in the system. It shows active, resolved, and
hidden tips.
• Actions > Filter By Dacha db Time
• Actions > Show Entries Within Minutes, Hours, or Days
Use the Reset Date filter option to remove the date filter.
The Active Tips filter shows all tips that are currently active and the Hidden Tips filter shows the tips that
are marked as hidden.
Fix procedures are available for certain tip events. To fix such tips, click the Run Fix Procedure option
that is available under the Action column and follow the instructions.

Monitoring thresholds by using GUI


You can configure the IBM Storage Scale to raise events when certain thresholds are reached. Use the
Monitoring > Thresholds page to define or modify thresholds for the data that is collected through the
performance monitoring sensors.
You can set the following two types of threshold levels for data that is collected through performance
monitoring sensors:
Warning level
When the data that is being monitored reaches the warning level, the system raises an event with
severity "Warning". When the observed value exceeds the current threshold level, the system removes
the warning.
Error level
When the data that is being monitored reaches the error level, the system raises an event with
severity "Error". When the observed value exceeds the current threshold level, the system removes
the error state.
Certain types of thresholds are predefined in the system. The following predefined thresholds are
available:

Chapter 1. Monitoring system health by using IBM Storage Scale GUI 7


• Inode utilization at the fileset level
• Datapool capacity utilization
• Metadata pool capacity utilization
• Free memory utilization
Apart from the predefined thresholds, you can also create user-defined thresholds for the data that is
collected through the performance monitoring sensors.
You can use the Monitoring > Thresholds page in the GUI and the mmhealth command to manage both
predefined and user-defined thresholds.

Defining thresholds
Use the Create Thresholds option to define user-defined thresholds or to modify the predefined
thresholds. You can use the Use as Template option that is available in the Actions menu to use an
already defined threshold as the template to create a threshold. You can specify the following details in a
threshold rule:
• Metric category: Lists all performance monitoring sensors that are enabled in the system and
thresholds that are derived by performing certain calculations on certain performance metrics. These
derived thresholds are referred as measurements. The measurements category provides the flexibility to
edit certain predefined threshold rules. The following measurements are available for selection:
DataPool_capUtil
Datapool capacity utilization, which is calculated as:
(sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB))/
sum(gpfs_pool_total_dataKB)
DiskIoLatency_read
Average time in milliseconds spent for a read operation on the physical disk. Calculated as:
disk_read_time/disk_read_ios
DiskIoLatency_write
Average time in milliseconds spent for a write operation on the physical disk. Calculated as:
disk_write_time/disk_write_ios
Fileset_inode
Inode capacity utilization at the fileset level. This is calculated as:
(sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes))/
sum(gpfs_fset_maxInodes)
FsLatency_diskWaitRd
File system latency for the read operations. Average disk wait time per read operation on the IBM
Storage Scale client.
sum(gpfs_fs_tot_disk_wait_rd)/sum(gpfs_fs_read_ops)
FsLatency_diskWaitWr
File system latency for the write operations. Average disk wait time per write operation on the IBM
Storage Scale client.
sum(gpfs_fs_tot_disk_wait_wr)/sum(gpfs_fs_write_ops)
MemoryAvailable_percent
Estimated available memory percentage. Calculated as:
– For the nodes that have less than 40 GB total memory allocation:
(mem_memfree+mem_buffers+mem_cached)/mem_memtotal
– For the nodes that have equal to or greater than 40 GB memory allocation:
(mem_memfree+mem_buffers+mem_cached)/40000000

8 IBM Storage Scale 5.1.9: Problem Determination Guide


MetaDataPool_capUtil
Metadata pool capacity utilization. This is calculated as:
(sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/
sum(gpfs_pool_total_metaKB)
MemoryAvailable_percent
Memory percentage available is calculated to the total RAM or 40 GB, whichever is lower, at the
node level.
(memtotal < 48000000) * ((memfree memcached membuffers)/memtotal * 100)
(memtotal >= 48000000) * ((memfree memcached membuffers)/40000000 * 100
DiskIoLatency_read
I/O read latency at the disk level.
disk_read_time/disk_read_ios
DiskIoLatency_write
I/O write latency at the disk level.
disk_write_time/disk_write_ios
NFSNodeLatency_read
NFS read latency at the node level.
sum(nfs_read_lat)/sum(nfs_read_ops)
NFSNodeLatency_write
NFS writes latency at the node level.
sum(nfs_write_lat)/sum(nfs_write_ops)
SMBNodeLatency_read
SMB read latency at the node level.
avg(op_time)/avg(op_count)
SMBNodeLatency_write
SMB writes latency at the node level.
avg(op_time)/avg(op_count)
• Metric name: The list of performance metrics that are available under the selected performance
monitoring sensor or the measurement.
• Name: User-defined name of the threshold rule.
• Filter by: Defines the filter criteria for the threshold rule.
• Group by: Groups the threshold values by the selected grouping criteria. If you select a value in this
field, you must select an aggregator criteria in the Aggregator field. By default, there is no grouping,
which means that the thresholds are evaluated based on the finest available key.
• Warning level: Defines the threshold level for warning events to be raised for the selected metric. When
the warning level is reached, the system raises an event with severity "Warning". You can customize the
warning message to specify the user action that is required to fix the issue.
• Error level: Defines the threshold level for error events to be raised for the selected metric. When the
error level is reached, the system raises an event with severity "Error". You can customize the error
message to specify the user action that is required to fix the issue.
• Aggregator: When grouping is selected in the Group by field, an aggregator must be chosen to define
the aggregation function. When the Rate aggregator is set, the grouping is automatically set to the finest
available grouping.
• Downsampling: Defines the operation to be performed on the samples over the selected monitoring
interval.
• Sensitivity: Defines the sample interval value. If a sensor is configured with interval period greater than
5 minutes, then the sensitivity is set to the same value as sensors period. The minimum value that

Chapter 1. Monitoring system health by using IBM Storage Scale GUI 9


is allowed is 120 seconds. If a sensor is configured with interval period less than 120 seconds, the
--sensitivity is set to 120 seconds.
• Hysteresis: Defines the percentage of the observed value that must be under or over the current
threshold level to switch back to the previous state. The default value is 0%. Hysteresis is used to avoid
frequent state changes when the values are close to the threshold. The level needs to be set according
to the volatility of the metric.
• Direction: Defines whether the events and messages are sent when the value that is being monitored
exceeds or goes less than the threshold level.
You can also edit and delete a threshold rule.

Threshold configuration - A scenario


The user wants to configure a threshold rule to monitor the maximum disk capacity usage. The
following table shows the values against each field of the Create Threshold dialog and their respective
functionality.

Table 8. Threshold rule configuration - A sample scenario


GUI fields Value and Function
Metric Category GPFSDiskCap
Specifies that the threshold rule is going to
be defined for the metrics that belong to the
GPFSDiskCap sensor.

Metric name Available capacity in full blocks


The threshold rule is going to be defined to monitor
the threshold levels of available capacity.

Name Total capacity threshold


By default, the performance monitoring metric
name is used as the threshold rule name. Here,
the default value is overwritten with "Total capacity
threshold".

Filter by Cluster
The values are filtered at the cluster level.

Group by File system


Groups the selected metric by file system.

10 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 8. Threshold rule configuration - A sample scenario (continued)
GUI fields Value and Function
Aggregator Minimum
When the available capacity reaches minimum
threshold level, the system raises event. If the
following values are selected, then the nature of
the threshold rule changes.
• Sum: When the sum of the metric values exceeds
the threshold levels, the system raises the
events.
• Average: When the average value exceeds the
average, the system raises the events.
• Maximum: When the maximum value exceeds
maximum level, the system raises the events.
• Minimum: When the minimum value exceeds the
sum of or goes less than the threshold levels, the
system raises the events.
• Rate: When the rate exceeds the threshold value,
the system raises the events. Rate is only added
for the "finest" group by clause. If you want
to get a rate for a "partial key", this is not
supported. That is, when Rate is selected, the
system automatically selects the best possible
values in the grouping field.

Downsampling None
Specifies how the tested value is computed from
all the available samples in the selected monitoring
interval, if the monitoring interval is greater than
the sensor period:
• None: The values are averaged.
• Sum: The sum of all values is computed.
• Minimum: The minimum value is selected.
• Maximum: the maximum value is selected.

Warning level 10 GiB


The system raises an event with severity Warning
when the available capacity reaches 10 GiB.

Error level 9 GiB


The system raises an event with severity level Error
when the available capacity reaches 9 GiB.

Sensitivity 24 hours
The threshold value is being monitored once in a
day.

Chapter 1. Monitoring system health by using IBM Storage Scale GUI 11


Table 8. Threshold rule configuration - A sample scenario (continued)
GUI fields Value and Function
Hysteresis 0
If the value is reduced more than 10 GiB, the
warning state is removed. Similarly, if the value
is reduced more than 9 GiB, the error state is
removed.

Direction Low
When the value that is being monitored goes less
than the threshold limit, the system raises an
event.

12 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 2. Monitoring system health by using the
mmhealth command
The mmhealth command displays the results of the background monitoring for the health of a node and
services that are hosted on the node. You can use the mmhealth command to view the health status of a
whole cluster in a single view.
Every service that is hosted on an IBM Storage Scale node has its own health monitoring service. All
the subcomponents like the file system or network interfaces are monitored through the monitoring
service of their main component. Only the subcomponents of CES service such as NFS, SMB, Object, and
authentication have their own health monitors. The mmhealth command gets the health details from
these monitoring services. The role of a node in monitoring determines the components that need to
be monitored. This is an internal node role, and a node can have more than one role. For example, a
CES node can also be a node with file systems and performance monitoring. The role of the node also
determines the monitoring service that is required on a specific node. For example, you do not need a CES
monitoring on a non-CES node. The monitoring services are only started if a specific node role is assigned
to the node. Every monitoring service includes at least one monitor.

Prerequisites
The following criteria must be met to use the health monitoring function on your GPFS cluster:
• Only Linux and AIX nodes are supported.
• All operating systems that are running IBM Storage Scale 5.1.x, including AIX nodes, must have Python
3.6 or later installed.
• All operating systems that are running IBM Storage Scale 5.0.x, including AIX nodes, must have Python
2.7 installed.
• CCR must be enabled.
• The cluster must be able to use the mmhealth cluster show command.

Known limitations
The mmhealth command has the following limitations:
• Only GPFS monitoring is supported on AIX.
• The mmhealth command does not fully monitor Omni-Path connections.

Related concepts
“Monitoring system health by using IBM Storage Scale GUI” on page 1
The IBM Storage Scale system provides background monitoring capabilities to check the health of a
cluster and each node of the cluster, including all the services that are hosted on a node. You can view
the system health states or corresponding events for the selected health state on the individual pages,
widgets or panels of the IBM Storage Scale GUI. You can also view system health details by issuing the
mmhealth command options like mmhealth cluster show, mmhealth node show, or other similar
options.

Monitoring the health of a node


The following list provides the details of the monitoring services available in the IBM Storage Scale
system:

General
1. CALLHOME

© Copyright IBM Corp. 2015, 2024 13


• Node role: The call home nodes or group servers get this node role.
• Task: Monitors the call home feature and sends call home heartbeats.
2. DISK
• Node role: Nodes with the node class nsdNodes monitor the DISK service. IBM Storage Scale
nodes.
• Task: Checks whether the IBM Storage Scale disks are available and running.
3. ENCRYPTION
• Node role: A node that is configured with file system encryption with a Remote Key Manager,
RKM.conf.
• Task: Displays the events that are related to the configuration of the encrypted file systems.
4. FILE SYSTEM
• Node role: This node role is active on all IBM Storage Scale nodes.
• Task: Monitors different aspects of IBM Storage Scale file systems.
5. FILEAUDITLOG
• Node role: All nodes get the FILEAUDITLOG producer node role if fileauditlog is made active for
a file system.
• Task: Monitors the event producer state for nodes that have this role.
6. FILESYSMGR
• Node role: A file system manager node, which can be detected by using the mmlsmgr command.
• Task: Shows the file systems where the current node acts as a manager. If quality of service
monitoring (QoS) is enabled for a file system, it might show more hints.
7. GDS
• Node role: A node, which is configured for GDS (verbsGPUDirectStorage=enable) and the
gdscheck program installed. The full path to this file must be declared in the /var/mmfs/
mmsysmon/mmsysmonitor.conf file.
For more information, see the gpudirect section of the /var/mmfs/mmsysmon/
mmsysmonitor.conf file to modify the gdscheckfile variable.
Declare the path of the gdscheck program:
gdscheckfile = /usr/local/cuda/gds/tools/gdscheck
• Task: Monitors the health state of the GDS configuration based on the gdscheck program output.
For more information on GPUDirect Storage (GDS) support for IBM Storage Scale, see GPUDirect
Storage support for IBM Storage Scale in the IBM Storage Scale: Concepts, Planning, and Installation
Guide.
8. GPFS
• Node role: This node role is always active on all IBM Storage Scale nodes.
• Task: Monitors all GPFS daemon-related functions. For example, mmfsd process and gpfs port
accessibility.
9. GUI
• Node role: Nodes with the node class GUI_MGMT_SERVERS monitor the GUI service.
• Task: Verifies whether the GUI services are functioning properly.
10. HEALTHCHECK
• Node role: The call home nodes or group servers get this node role.
• Task: Monitors health check service alert events and raises dynamic health events for the
HEALTHCHECK component on the first call home server node.

14 IBM Storage Scale 5.1.9: Problem Determination Guide


For more information, see Monitoring system health by using the HEALTHCHECK component in the IBM
Storage Scale: Problem Determination Guide.
11. LOCAL CACHE
• Node role: This node role is active when local read-only cache disks are configured.
• Task: Monitors the health state of the local read-only cache devices.
12. NETWORK
• Node role: This node role is active on every IBM Storage Scale node.
• Task: Monitors all IBM Storage Scale relevant IP-based (Ethernet + IPoIB) and InfiniBand RDMA
networks.
13. NVMeoF
• Node role: The node, which is contained in the NVMeoFFT_SERVERS node class, monitors the
NVMeoF service.
• Task: Monitors the NVMeoF feature for the nodes that have this node role.
14. PERFMON
• Node role: Nodes where PerfmonSensors or PerfmonCollector services are running get the PERFMON
node role. PerfmonSensors are determined through the perfmon designation in mmlscluster.
PerfmonCollector are determined through the colCandidates line in the configuration file.
• Task: Monitors whether PerfmonSensors and PerfmonCollector are running as expected.
15. SERVERRAID
• Node role: An Elastic Storage Server (ESS) node, which is configured for IBM Power® RAID (IPR) and
has the /sbin/iprconfig program installed.
• Task: Monitors the health state of the IBM Power RAID based on the /sbin/iprconfig program
output.
16. THRESHOLD
• Node role: Nodes where the performance data collection is configured and enabled. If a node role is
not configured to PERFMON, it cannot have a THRESHOLD role either.
• Task: Monitors whether the node-related thresholds rules evaluation is running as expected, and if
the health status changed as a result of the threshold limits being crossed.
Note: In a mixed environment, when a cluster contains some nodes that run IBM Storage Scale
versions, which are different from the versions that are running on other nodes, the threshold
service is not available.
17. WATCHFOLDER
• Node role: All nodes get the WATCHFOLDER producer node role if watchfolder is made active for
a file system.
• Task: Monitors the event producer state for nodes that have this role.

GNR
1. NVMe
• Node role: Node must be either an Elastic Storage Server (ESS) node or an ECE node that is
connected to an NVMe device.
• Task: Monitors the health state of the NVMe devices.

Interface
1. AFM
• Node role: The AFM monitoring service is active if the node is a gateway node.

Chapter 2. Monitoring system health by using the mmhealth command 15


Note: To know whether the node is a gateway node, run the mmlscluster command.
• Task: Monitors the cache states and different user exit events for all the AFM fileset.
2. CES
• Node role: This node role is active on the CES nodes that are listed by mmlscluster --ces. After
a node obtains this role, all corresponding CES sub services are activated on that node. The CES
service does not have its own monitoring service or events. The status of the CES is an aggregation of
the status of its sub services. The following sub services are monitored:
a. AUTH
– Task: Monitors LDAP, AD and or NIS-based authentication services.
b. AUTH_OBJ
– Task: Monitoring the OpenStack identity service functions.
c. BLOCK
– Task: Checks whether the iSCSI daemon is functioning properly.
d. CESNETWORK
– Task: Monitoring CES network-related adapters and IP addresses.
e. HDFS_NAMENODE
– Node role: CES nodes that belong to an HDFS CES group.
– Task: Checks whether the HDFS_NAMENODE process is running correctly and is healthy. It also
monitors if the ACTIVE or STANDBY state of the name node is correct as only one HDFS node in
an HDFS group can be the active node.
f. NFS
– Task: Monitoring NFS-related functions.
g. OBJECT
– Node role: Monitors the IBM Storage Scale for object functions. Especially, the status of
relevant system services and accessibility to ports are checked.
h. SMB
– Node role: Monitoring SMB-related functions like the smbd process, the ports and ctdb
processes.
3. CESIP
• Node role: A cluster manager node, which can be detected by using the mmlsmgr -c command. This
node runs a special code module of the monitor, which checks the cluster-wide CES IP distribution.
• Task: Compares the effectively hosted CES IPs with the list of declared CES IPs in the address pool
and reports the result. There are three cases:
– All declared CES IPs are hosted. In this case, the state of the IPs is HEALTHY.
– None of the declared CES IPs are hosted. In this case, the state of the IPs is FAILED.
– Only a subset of the declared CES IPs is hosted. In this case, the state of the IPs is DEGRADED.
Note: A FAILED state does not trigger any failover.
4. CLOUDGATEWAY
• Node role: Identified as a Transparent cloud tiering node. All nodes that are listed in the
mmcloudgateway node list get this node role.
• Task: Check whether the cloud gateway service functions as expected.
5. HADOOPCONNECTOR
• Node role: Nodes where the Hadoop service is configured get the Hadoop connector node role.
• Task: Monitors the Hadoop data node and name node services.

16 IBM Storage Scale 5.1.9: Problem Determination Guide


6. HDFS_DATANODE
• Node role: CES HDFS service is enabled on the cluster and the node is configured as an HDFS data
node.
Note: The /usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode is-enabled command is used
to find out whether the node is configured as the data node.
• Task: Monitors the HDFS data node service process.

Stretch cluster monitoring


Stretch cluster monitoring is set up by defining the node classes, which specify one or more IBM Storage
Scale nodes that belong to a site. Each site node class name must start with SCALE_SITE_. The
character(s) after SCALE_SITE_ are used as the site name. For example, SCALE_SITE_A would define
a stretch cluster site named A. The site name is visible in the mmhealth command output when showing
site health events.
1. STRETCH CLUSTER
• Node role: A cluster manager node, which can be detected by using the mmlsmgr -c command. This
node runs a special code module to monitor the health of a stretch cluster configuration.
• Task: Checks the health of the sites that are defined in a stretch cluster configuration and reports any
discovered health issues, which include:
a. All sites have no issues that would affect the health of the stretch cluster. In this case, the state of
the sites is HEALTHY.
b. Sites are experiencing file system replication issues. In this case, the state of the sites is
DEGRADED.
c. Sites are experiencing process, network, or hardware issues. In this case, the state of the sites is
FAILED.

Note: Users can now create and raise custom events. For more information, see “Creating, raising, and
finding custom defined events” on page 19.
For a list of all the available events, see “Events” on page 559.

Running a user-defined script when an event is raised


The root user has the option to create a callback script that is triggered when an event is raised.
This callback script is called every time for a new event, irrespective of whether it is an information event
or a state-changing event. For state-changing events the script is triggered when the event becomes
active. For more information about the different events, see “Event type and monitoring status for system
health” on page 18.
When the script is created, ensure that:
• The script must be saved in the /var/mmfs/etc/ location, and must be named eventsCallback.
• The script is created by a root user. The script is called only when it is created by the root user. The uid
of the root user is 0.
The script is called with following space-separated arguments list:
version date time timezone event_name component_name identifier
severity_abbreviation state_abbreviation message event_arguments where:
version
The version argument displays the version number of the script. The version is set to 1 by default.
This value is incremented by 1 for every change in the callout's format or functionality. In the sample
output, the version is 1.
date
The date is in the yyyy-mm-dd format. In the sample output, the date is 2018-02-23.

Chapter 2. Monitoring system health by using the mmhealth command 17


time
The time is in the hour:minutes:seconds.milliseconds format. In the sample output, the time is
00:01:07.499834.
timezone
The timezone is based on the server settings. In the sample output, the timezone is EST.
event_name
The event_name argument displays the name of the event. In the sample output, the event_name is
pmsensors_up.
component_name
The component_name argument displays the name of the reporting component. In the sample
output, the component_name is perfmon.
identifier
The identifier argument displays the entity's name if the require_unique value for the event is set
to True. If there is no identifier, then ‘’ is returned. In the sample output, the identifier is ''.
severity_abbreviation
The severity_abbreviation argument usually displays the first letter of Info, TIP, Warning or Error. In
the sample output, the severity_abbreviation is I.
message
The message is framed with a single quotation mark (' '). Single ticks in the message are encoded to
\\'. The event_arguments are already included in the message, so parsing from the customer is not
needed. In the sample output, the message is pmsensors service as expected, state is
started.
event_arguments
The event_arguments is framed with single quotation mark (' '). Single ticks in the arguments are
encoded to \\'. Arguments are already included in the message. The arguments are comma separated,
and without a space. Arguments are %-encoded, so the following characters are encoded into their
hexadecimal unicode representative starting with a %-character:
\t\n !\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~
For example, % is encoded to %25, and # is encoded to %23.
A sample call of the script is displayed:

/var/mmfs/etc/eventsCallback 1 2018-02-23 00:01:07.499834 EST pmsensors_up perfmon


'' I H 'pmsensors service as expected, state is started' 'started'

Important: The script is started synchronously within the monitoring cycle, therefore it must be
lightweight and return a value quickly. The recommended runtime is less than 1 second. Long running
scripts are detected, logged and killed. The script has a hard timeout of 60 seconds.

Event type and monitoring status for system health


An event might trigger a change in the state of a system.
The following three types of events are reported in the system:
• State-changing events: The state-changing events change the state of a component or entity from good
to bad or from bad to good depending on the corresponding state of the event.
Note: An event is raised when the health status of the component goes from good to bad. For example,
an event is raised that changes the status of a component from HEALTHY to DEGRADED. However, if
the state was already DEGRADED based on another active event, there is no change in the status of the
component. Also, when the state of the entity was FAILED, a DEGRADED event would not change the
component's state because a FAILED status is more dominant than the DEGRADED status.
• Tip: The tips are similar to state-changing events, but can be hidden by the user. Like state-changing
events, a tip is removed automatically when the problem is resolved. A tip event always changes the
state to of a component from HEALTHY to TIPS if the event is not hidden.

18 IBM Storage Scale 5.1.9: Problem Determination Guide


Note: If the state of a component changes to TIPS, it can be hidden. However, you can still view the
active hidden events by using the mmhealth node show ComponentName --verbose command, if
the cause for the event still exists.
• Information events: The information events are notices that are shown in the event log or in brackets in
the mmhealth node show command. They do not change the state of the component. They disappear
after 24 hours or when they are resolved by the mmhealth event resolve command.
The monitoring interval is 15 - 30 seconds, depending on the component. However, the services that are
monitored less often (for example, once per 30 minutes), save the system resources. You can find more
information about the events from the Monitoring > Events page in the IBM Storage Scale GUI or by
issuing the mmhealth event show command.
The following are the possible status of nodes and services:
• UNKNOWN - Status of the node or the service that is hosted on the node is not known because of
a problem with monitoring. In most cases, this is accompanied by an exception in the /var/adm/ras/
mmsysmonitor.log file where the root cause of the problem is described.
• HEALTHY - The node or the service that is hosted on the node is working as expected. There are no
active error events.
• CHECKING - The monitoring of a service or a component that is hosted on the node is starting at the
moment. This state is a transient state, which changes to another state when the mmsysmon daemon
initialization is completed.
• TIPS - An issue might be reported with the configuration and tuning of the components. This status is
only assigned to a tip event.
• DEGRADED - The node or the service that is hosted on the node is not working as expected. This means
that a problem with the component did not cause a complete component failure.
• FAILED - The node or the service that is hosted on the node failed due to errors or cannot be reached
anymore.
• DEPEND - The node or the services that are hosted on the node failed due to the failure of some
components. For example, an NFS or SMB service shows this status whether authentication failed.
The status is graded as follows: HEALTHY < TIPS < DEGRADED < FAILED. For example, the status of the
service that is hosted on a node becomes FAILED if there is at least one active event in the FAILED
status for that corresponding service. The FAILED status gets more priority than the DEGRADED, which is
followed by TIPS and then HEALTHY, while setting the status of the service. That is, if a service has an
active event with a HEALTHY status and another active event with a FAILED status, then the system sets
the status of the service as FAILED.
Some directed maintenance procedures or DMPs are available to solve issues caused by tip events.
For information, see Directed maintenance procedures for tip events in IBM Storage Scale: Problem
Determination Guide.
New encryption events are added that are identified by their unique ID. Events with different IDs can be
raised multiple times, but they are listed only once for each unique ID. Therefore, multiple events can
be displayed at the same time, but only one for each unique ID, regardless of how many times they are
raised.
These events are cleared by using the mmhealth event resolve <event name> <event id>
command.
For more information, see the Encryption events in IBM Storage Scale: Problem Determination Guide.

Creating, raising, and finding custom defined events


The user can create and raise custom health events in IBM Storage Scale, which can be displayed by
using the mmhealth node eventlog command. You can also view these events in the IBM Storage
Scale GUI, under Monitoring > Events.
To raise a custom event, the event must first be created, and the health daemon must be made aware of
the existence of the custom event. Follow these steps to create and raise a custom event:

Chapter 2. Monitoring system health by using the mmhealth command 19


1. Create a /var/mmfs/mmsysmon/custom.json file and create a symlink to this file
under /usr/lpp/mmfs/lib/mmsysmon/.
This ensures that an upgrade or rpm delete does not remove this file. The file needs to include the
custom events in a .json format as shown:

{
"event_name_1":{
"cause":"",
"user_action":"",
"scope":"NODE",
"code":"cu_xyz",
"description":"",
"event_type":"INFO",
"message":"",
"severity": ["INFO" | "WARNING" | "ERROR"]
},
"event_name_2":{
[…]
}

Note: You might need to re-create the symlink after an update.


When you create an event in the custom.json file, ensure that the event name is unique and it is
specified in lowercase. Only the events that are specified in lowercase are displayed in the GUI.
You can also fill the cause, description, and message with descriptive text. The message can
include place holders in the form of n, where n is an integer value. These placeholders are filled with
arguments that are provided at the time of raising the event. A new line (\n) must not be included in
the message.
The code must be a unique alphanumeric value that does not exceed a length of six characters. It is
recommended to use a form like cu_xyz, where xyz must be replaced with a unique number. If there
is a duplication in the codes, the daemon throws an error and does not load the custom.json file. The
GUI ignores the events if the ID is longer than six characters. Using this format ensures that there is no
collision with the system-provided events after an update of the gpfs.base file.
Note: While the severity can be set as INFO, WARNING, or ERROR, event_type must be set to
INFO.
For example:

[root@test-21 ~]# cat /usr/lpp/mmfs/lib/mmsysmon/custom.json


{
"event_1":{
"cause":"cause text",
"user_action":"user_action text",
"scope":"NODE",
"entity_type":"NODE",
"code":"ev_001",
"description":"description text",
"event_type":"INFO",
"message":"message with argument {0} and {1}",
"severity": "INFO"
}
}

2. Restart the health daemon using the systemctl restart mmsysmon.service command.
The daemon does not load the custom.json file if there are duplicated codes in the events. The
daemon status can be checked using the systemctl status mmsysmon.service command.
3. Run the mmhealth event show <event_name> command, where <event_name> is the name of
the custom event.
The system gives output similar to the following:

[root@test-21 ~]# mmhealth event show event_1


Event Name: event_1

20 IBM Storage Scale 5.1.9: Problem Determination Guide


Description: description text
Cause: cause text
User Action: user_action text
Severity: INFO
State: Not state changing

If the custom.json file was loaded successfully, the command returns the event's information. If the
custom.json file was not loaded successfully, an error message is displayed stating that this event is
not known to the system.
4. Repeat steps 1-3 on all nodes.
5. Restart the GUI node using the systemctl restart gpfsgui.service command to make the
GUI aware of the new events.
6. Run the following command to raise the custom event:

mmsysmonc event custom [<event_name> | <event_code>] <arguments>

The system gives output similar to the following:

[root@test-21 ~]# mmsysmonc event custom event_1 "arg1,arg2"


Event event_1 raised
[root@test-21 ~]# mmhealth node eventlog
Node name: test-21d.localnet.com
Timestamp Event Name Severity Details
2018-07-30 10:07:08.404072 CEST eventlog_cleared INFO On the node
test-21d.localnet.com
the eventlog was cleared.
2018-07-30 10:07:20.228457 CEST event_1 INFO message with argument arg1 and
arg2

Note: You can raise a new custom event only after you restart the gpfsgui and mmsysmon daemon.
The <arguments> needs to be comma-separated list enclosed in double quotation mark. For
example, “arg1,arg2,…,argN”.
You can use the mmhealth node eventlog command to display the log of when an event was
raised. You can also configure emails to notify users when a custom event is raised. For more
information, see “Event notifications” on page 4.

Finding the event in the GUI


To find the custom events in IBM Storage Scale GUI, go to Monitoring > Events page. Change the filter
option of the Events page from Current Issues to Notices. The Components column in the Events page
indicates whether the event is a custom event.

Updating the event log after upgrade


When upgrading to IBM Storage Scale 5.0.5.3 or a later version, the nodes where no sqlite3 package is
installed have their RAS event logs converted to a new database format in order to prevent known issues.
The old RAS event log is emptied automatically. You can verify that the event log has been emptied either
using the mmhealth node eventlog command or in the IBM Storage Scale GUI.
Note: The event logs are updated only the first time IBM Storage Scale is upgraded to version 5.0.5.3 or
higher.

Using the REST API with custom events


Health information can be requested through the IBM Storage Scale management API. This is also true
for the custom events. For more information on how to use REST API with custom events, see the IBM
Storage Scale management API in the IBM Storage Scale: Concepts, Planning, and Installation Guide.

Chapter 2. Monitoring system health by using the mmhealth command 21


Threshold monitoring for system health
Threshold monitoring is a service that helps the user to identify performance issues by defining threshold
rules for the selected performance metrics.For more information on performance matrix, see the
“Configuring the performance monitoring tool” on page 106 and “List of performance metrics” on page
117.
You can configure IBM Storage Scale to raise events when certain thresholds are reached. As soon as one
of the metric values exceeds a threshold limit, the system health daemon receives an event notification
from the monitoring process. Then, the health daemon generates a log event and updates the health
status of the corresponding component.
You can set the following two types of threshold levels for the data that is collected through performance
monitoring sensors:
Warning level
When the data that is being monitored reaches the warning level, the system raises an event with
severity Warning. When the observed value exceeds the current threshold level, the system removes
the warning.
Error level
When the data that is being monitored reaches the error level, the system raises an event with
severity Error. When the observed value exceeds the current threshold level, the system removes
the error state.
Related concepts
“Monitoring system health by using IBM Storage Scale GUI” on page 1
The IBM Storage Scale system provides background monitoring capabilities to check the health of a
cluster and each node of the cluster, including all the services that are hosted on a node. You can view
the system health states or corresponding events for the selected health state on the individual pages,
widgets or panels of the IBM Storage Scale GUI. You can also view system health details by issuing the
mmhealth command options like mmhealth cluster show, mmhealth node show, or other similar
options.

Threshold monitoring prerequisites


If you did not use the IBM Storage Scale installation toolkit or disabled the performance monitoring
installation during your system setup, ensure that your system meets the following configuration
requirements:
• PMSensors and PMCollectors must be installed.
• CCR must be enabled on the cluster.

Predefined and user-defined thresholds


Threshold monitoring consists of the following two types of thresholds:
• Predefined thresholds
• User-defined thresholds
You can use the mmhealth thresholds list command to review all the threshold rules defined for
a cluster. The predefined and the user-defined threshold rules can be removed by using the mmhealth
thresholds delete command.

Predefined thresholds
In a cluster, the following three types of thresholds are predefined and enabled automatically:
• Thresholds monitoring the file system capacity usage
• Thresholds monitoring the memory usage
• Thresholds monitoring the number of SMB connections

22 IBM Storage Scale 5.1.9: Problem Determination Guide


A predefined threshold rule is deactivated in the following cases:
1. If no metric keys exist for the defined threshold rule in the performance monitoring tool metadata.
2. If the sensor corresponding to the defined threshold rule is not enabled.

Thresholds monitoring the file system capacity usage


The capacity available for the file systems depends on the fullness of the file system's fileset-inode
spaces and the capacity usage of each data or metadata pool. Therefore, the predefined capacity
threshold limit for a file system is broken down into the following thresholds rules:
• Fileset-inode spaces that use the InodeCapUtil_Rule rule.
• Datapool capacity that uses the DataCapUtil_Rule rule.
• Metadata pool capacity that uses the MetaDataCapUtil_Rule rule.
The violation of any of these rules results in the parent file system receiving a capacity issue notification.
The outcome of the file system capacity rules evaluation is included in the health status report of
the FILESYSTEM component and can be reviewed by using the mmhealth node show filesystem
command. For capacity usage rules, the default warn level is set to 80%, and the error level to 90%.
Since the file system capacity related thresholds are not node-specific, they are displayed on the current
active threshold monitor node. For more information, see Use case 2: Observe the filesystem capacity
usage by using default threshold rules.

Thresholds monitoring the memory usage

MemFree_Rule
The MemFree_Rule is a predefined threshold rule that monitors the free memory usage. The
MemFree_Rule rule observes the memory-free usage on each cluster node and prevents the device
from becoming unresponsive when memory is no longer available.
The memory-free usage rule is evaluated for each node in the cluster. The evaluation status is included in
the node health status of each particular node. For memory usage rule, the warn level is set to 100 MB,
and the error level to 50 MB.
The default value, MemFree_Rule evaluates the estimated available memory in relation to the total
memory allocation. For more information, see the MemoryAvailable_percent measurement definition
in the mmhealth command section in IBM Storage Scale: Command and Programming Reference Guide.
For the new MemFree_Rule, only a WARNING threshold level is defined. The node is tagged with a
WARNING status if the Memfree_util value goes less than 5%.
For the nodes that have greater than or equal to 40 GB of total memory allocation, the available memory
percentage is evaluated against a fixed value of 40 GB. This evaluation prevents the nodes that have more
than 2 GB free memory from sending warning messages.
Note: For IBM Storage Scale 5.0.4, the default MemFree_Rule is replaced automatically. The customer-
created rules remain unchanged.

AFMInQueue_Rule
The AFMInQueue_Rule is a predefined threshold rule that monitors the AFM gateway in-queue memory
usage. The AFMInQueue_Rule value must be set to 40-50% of the available memory on the gateway
node, which is considered to be a dedicated gateway node. If the value of the AFMInQueue_Rule rule is
not defined, then its default value is set to 8GiB.
The AFMInQueue_Rule memory usage rule, as a warning level, is set at 80% of assigned memory, and as
an error level, the memory usage rule is set at 90% of assigned memory. When either of these levels are
reached or exceeded, then an mmhealth event is raised. The mmhealth event can be viewed in the IBM
Storage Scale GUI or on the CLI by using the mmhealth command.
If the mmhealth events are raised, then a user can take the following steps to resolve the issue:

Chapter 2. Monitoring system health by using the mmhealth command 23


• Check and fix any issues with the network connectivity and bandwidth throughput.
• Adjust the AFM afmHardMemThreshold configuration to be in the 40-50% of available memory on the
gateway node.
Note: Ensure that the page pool setting on the AFM gateway is low to prevent a potential out-of-
memory state.
• Check whether the network throughput from all AFM gateway nodes is properly balanced.
• Add more AFM gateway nodes to help handle the workload.
For more information, see General recommendations for AFM gateway node configuration section in IBM
Storage Scale: Concepts, Planning, and Installation Guide.

Thresholds monitoring the number of SMB connections


IBM Storage Scale can host a maximum of 3,000 SMB connections per protocol node and not more than
20,000 SMB connections across all protocol nodes. This threshold monitors the following information:
• Number of SMB connections on each protocol node by using the SMBConnPerNode_Rule rule
• Number of SMB connections across all protocol nodes by using the SMBConnPerNode_Rule rule
SMBConnPerNode_Rule
The rule compares the count of SMB connections on each protocol node with the allowed maximum
and is evaluated for each protocol node in the cluster. The evaluation status is included in the node
health status of each protocol node entity.
SMBConnTotal_Rule
The rule monitors the sum of all SMB connections in the cluster and ensures that it does not exceed
3000. The evaluation status is reported to the node that has the ACTIVE THRESHOLD MONITOR role.
For more information about SMB connection limitations, see the Planning for protocols and Planning for
SMB sections in the IBM Storage Scale: Concepts, Planning, and Installation Guide.

User-defined thresholds
You can create individual thresholds for all metrics that are collected through the performance monitoring
sensors. You can use the mmhealth thresholds add command to create a new threshold rule.
If multiple thresholds rules have overlapping entities for the same metrics, then only one of the
concurrent rules is made actively eligible. All rules get a priority rank number. The highest possible
rank number is one. This rank is based on a metric's maximum number of filtering levels and the filter
granularity that is specified in the rule. As a result, a rule that monitors a specific entity or a set of entities
becomes high priority. This high-priority rule performs entity thresholds evaluation and status update for
a particular entity or a set of entities. This implies that a less specific rule, like the one that is valid for
all entities, is disabled for this particular entity or set of entities. For example, a threshold rule that is
applicable to a single file system takes precedence over a rule that is applicable to several or all the file
systems. For more information, see “Use case 4: Create threshold rules for specific filesets” on page 36.

Active threshold monitor role


To evaluate the thresholds of each particular rule the internal performance monitoring process retrieves
the latest metric values. The query is done by a node designated as the pmcollector. If the environment
has multiple pmcollectors, then the system designates one pmcollector to address the performance
monitoring queries. This pmcollector node is assigned the ACTIVE THRESHOLD MONITOR role. All other
pmcollector nodes are assigned the STANDBY role.
The ACTIVE THRESHOLD MONITOR role of the node is also used to assign the metric evaluation status
for the rules including aggregated results over multiple entities or having a cluster-wide metric in the rule
definition.
In case the ACTIVE THRESHOLD MONITOR node has lost connection or is unresponsive, one of the
STANDBY pmcollector nodes take over the responsibilities of the ACTIVE THRESHOLD MONITOR. In such

24 IBM Storage Scale 5.1.9: Problem Determination Guide


cases, the status report for the cluster-wide threshold rules is redirected to the new ACTIVE THRESHOLD
MONITOR.
You can use the mmhealth thresholds list command to identify which of the pmcollector nodes is
designated the ACTIVE THRESHOLD MONITOR role at the current time.
The following output demonstrates the pmcollector node that is designated the ACTIVE THRESHOLD
MONITOR role in real time.

# mmhealth thresholds list

active_thresholds_monitor: g5160-12d.localnet.com

System health monitoring use cases


The following sections describe the use case for the mmhealth command.
Use case 1: Checking the health status of the nodes and their corresponding services by using the
following commands:
1. To show the health status of the current node, issue the command:

mmhealth node show

The system displays output similar to this:

Node name: test_node


Node status: HEALTHY
Status Change: 39 min. ago

Component Status Status Change Reasons


-------------------------------------------------------------------
GPFS HEALTHY 39 min. ago -
NETWORK HEALTHY 40 min. ago -
FILESYSTEM HEALTHY 39 min. ago -
DISK HEALTHY 39 min. ago -
CES HEALTHY 39 min. ago -
PERFMON HEALTHY 40 min. ago -

2. To view the health status of a specific node, issue the command:

mmhealth node show -N test_node2

The system displays output similar to this:

Node name: test_node2


Node status: CHECKING
Status Change: Now

Component Status Status Change Reasons


-------------------------------------------------------------------
GPFS CHECKING Now -
NETWORK HEALTHY Now -
FILESYSTEM CHECKING Now -
DISK CHECKING Now -
CES CHECKING Now -
PERFMON HEALTHY Now -

3. To view the health status of all the nodes, issue the command:

mmhealth node show -N all

The system displays output similar to this:

Node name: test_node


Node status: DEGRADED

Component Status Status Change Reasons


-------------------------------------------------------------
GPFS HEALTHY Now -
CES FAILED Now smbd_down

Chapter 2. Monitoring system health by using the mmhealth command 25


FileSystem HEALTHY Now -

Node name: test_node2


Node status: HEALTHY

Component Status Status Change Reasons


------------------------------------------------------------
GPFS HEALTHY Now -
CES HEALTHY Now -
FileSystem HEALTHY Now -

4. To view the detailed health status of the component and its sub-component, issue the command:

mmhealth node show ces

The system displays output similar to this:

Node name: test_node

Component Status Status Change Reasons


-------------------------------------------------------------------
CES HEALTHY 2 min. ago -
AUTH DISABLED 2 min. ago -
AUTH_OBJ DISABLED 2 min. ago -
BLOCK DISABLED 2 min. ago -
CESNETWORK HEALTHY 2 min. ago -
NFS HEALTHY 2 min. ago -
OBJECT DISABLED 2 min. ago -
SMB HEALTHY 2 min. ago -

5. To view the health status of only unhealthy components, issue the command:

mmhealth node show --unhealthy

The system displays output similar to this:

Node name: test_node


Node status: FAILED
Status Change: 1 min. ago

Component Status Status Change Reasons


-------------------------------------------------------------------
GPFS FAILED 1 min. ago gpfs_down, quorum_down
FILESYSTEM DEPEND 1 min. ago unmounted_fs_check
CES DEPEND 1 min. ago ces_network_ips_down, nfs_in_grace

6. To view the health status of sub-components of a node's component, issue the command:

mmhealth node show --verbose

The system displays output similar to this:


Node name: gssio1-hs.gpfs.net
Node status: HEALTHY

Component Status Reasons


--------------------------------------------------------------------------
GPFS DEGRADED -
NETWORK HEALTHY -
bond0 HEALTHY -
ib0 HEALTHY -
ib1 HEALTHY -
FILESYSTEM DEGRADED stale_mount, stale_mount, stale_mount
Basic1 FAILED stale_mount
Basic2 FAILED stale_mount
Custom1 HEALTHY -
gpfs0 FAILED stale_mount
gpfs1 FAILED stale_mount
DISK DEGRADED disk_down
rg_gssio1_hs_Basic1_data_0 HEALTHY -
rg_gssio1_hs_Basic1_system_0 HEALTHY -
rg_gssio1_hs_Basic2_data_0 HEALTHY -
rg_gssio1_hs_Basic2_system_0 HEALTHY -
rg_gssio1_hs_Custom1_data1_0 HEALTHY -
rg_gssio1_hs_Custom1_system_0 DEGRADED disk_down
rg_gssio1_hs_Data_8M_2p_1_gpfs0 HEALTHY -
rg_gssio1_hs_Data_8M_3p_1_gpfs1 HEALTHY -
rg_gssio1_hs_MetaData_1M_3W_1_gpfs0 HEALTHY -
rg_gssio1_hs_MetaData_1M_4W_1_gpfs1 HEALTHY -
rg_gssio2_hs_Basic1_data_0 HEALTHY -
rg_gssio2_hs_Basic1_system_0 HEALTHY -
rg_gssio2_hs_Basic2_data_0 HEALTHY -
rg_gssio2_hs_Basic2_system_0 HEALTHY -
rg_gssio2_hs_Custom1_data1_0 HEALTHY -

26 IBM Storage Scale 5.1.9: Problem Determination Guide


rg_gssio2_hs_Custom1_system_0 HEALTHY -
rg_gssio2_hs_Data_8M_2p_1_gpfs0 HEALTHY -
rg_gssio2_hs_Data_8M_3p_1_gpfs1 HEALTHY -
rg_gssio2_hs_MetaData_1M_3W_1_gpfs0 HEALTHY -
rg_gssio2_hs_MetaData_1M_4W_1_gpfs1 HEALTHY -
NATIVE_RAID DEGRADED gnr_pdisk_replaceable, gnr_rg_failed,
enclosure_needsservice
ARRAY DEGRADED -
rg_gssio2-hs/DA1 HEALTHY -
rg_gssio2-hs/DA2 HEALTHY -
rg_gssio2-hs/NVR HEALTHY -
rg_gssio2-hs/SSD HEALTHY -
ENCLOSURE DEGRADED enclosure_needsservice
SV52122944 DEGRADED enclosure_needsservice
SV53058375 HEALTHY -
PHYSICALDISK DEGRADED gnr_pdisk_replaceable
rg_gssio2-hs/e1d1s01 FAILED gnr_pdisk_replaceable
rg_gssio2-hs/e1d1s07 HEALTHY -
rg_gssio2-hs/e1d1s08 HEALTHY -
rg_gssio2-hs/e1d1s09 HEALTHY -
rg_gssio2-hs/e1d1s10 HEALTHY -
rg_gssio2-hs/e1d1s11 HEALTHY -
rg_gssio2-hs/e1d1s12 HEALTHY -
rg_gssio2-hs/e1d2s07 HEALTHY -
rg_gssio2-hs/e1d2s08 HEALTHY -
rg_gssio2-hs/e1d2s09 HEALTHY -
rg_gssio2-hs/e1d2s10 HEALTHY -
rg_gssio2-hs/e1d2s11 HEALTHY -
rg_gssio2-hs/e1d2s12 HEALTHY -
rg_gssio2-hs/e1d3s07 HEALTHY -
rg_gssio2-hs/e1d3s08 HEALTHY -
rg_gssio2-hs/e1d3s09 HEALTHY -
rg_gssio2-hs/e1d3s10 HEALTHY -
rg_gssio2-hs/e1d3s11 HEALTHY -
rg_gssio2-hs/e1d3s12 HEALTHY -
rg_gssio2-hs/e1d4s07 HEALTHY -
rg_gssio2-hs/e1d4s08 HEALTHY -
rg_gssio2-hs/e1d4s09 HEALTHY -
rg_gssio2-hs/e1d4s10 HEALTHY -
rg_gssio2-hs/e1d4s11 HEALTHY -
rg_gssio2-hs/e1d4s12 HEALTHY -
rg_gssio2-hs/e1d5s07 HEALTHY -
rg_gssio2-hs/e1d5s08 HEALTHY -
rg_gssio2-hs/e1d5s09 HEALTHY -
rg_gssio2-hs/e1d5s10 HEALTHY -
rg_gssio2-hs/e1d5s11 HEALTHY -
rg_gssio2-hs/e2d1s07 HEALTHY -
rg_gssio2-hs/e2d1s08 HEALTHY -
rg_gssio2-hs/e2d1s09 HEALTHY -
rg_gssio2-hs/e2d1s10 HEALTHY -
rg_gssio2-hs/e2d1s11 HEALTHY -
rg_gssio2-hs/e2d1s12 HEALTHY -
rg_gssio2-hs/e2d2s07 HEALTHY -
rg_gssio2-hs/e2d2s08 HEALTHY -
rg_gssio2-hs/e2d2s09 HEALTHY -
rg_gssio2-hs/e2d2s10 HEALTHY -
rg_gssio2-hs/e2d2s11 HEALTHY -
rg_gssio2-hs/e2d2s12 HEALTHY -
rg_gssio2-hs/e2d3s07 HEALTHY -
rg_gssio2-hs/e2d3s08 HEALTHY -
rg_gssio2-hs/e2d3s09 HEALTHY -
rg_gssio2-hs/e2d3s10 HEALTHY -
rg_gssio2-hs/e2d3s11 HEALTHY -
rg_gssio2-hs/e2d3s12 HEALTHY -
rg_gssio2-hs/e2d4s07 HEALTHY -
rg_gssio2-hs/e2d4s08 HEALTHY -
rg_gssio2-hs/e2d4s09 HEALTHY -
rg_gssio2-hs/e2d4s10 HEALTHY -
rg_gssio2-hs/e2d4s11 HEALTHY -
rg_gssio2-hs/e2d4s12 HEALTHY -
rg_gssio2-hs/e2d5s07 HEALTHY -
rg_gssio2-hs/e2d5s08 HEALTHY -
rg_gssio2-hs/e2d5s09 HEALTHY -
rg_gssio2-hs/e2d5s10 HEALTHY -
rg_gssio2-hs/e2d5s11 HEALTHY -
rg_gssio2-hs/e2d5s12ssd HEALTHY -
rg_gssio2-hs/n1s02 HEALTHY -
rg_gssio2-hs/n2s02 HEALTHY -
RECOVERYGROUP DEGRADED gnr_rg_failed
rg_gssio1-hs FAILED gnr_rg_failed
rg_gssio2-hs HEALTHY -
VIRTUALDISK DEGRADED -
rg_gssio2_hs_Basic1_data_0 HEALTHY -
rg_gssio2_hs_Basic1_system_0 HEALTHY -
rg_gssio2_hs_Basic2_data_0 HEALTHY -
rg_gssio2_hs_Basic2_system_0 HEALTHY -
rg_gssio2_hs_Custom1_data1_0 HEALTHY -
rg_gssio2_hs_Custom1_system_0 HEALTHY -
rg_gssio2_hs_Data_8M_2p_1_gpfs0 HEALTHY -
rg_gssio2_hs_Data_8M_3p_1_gpfs1 HEALTHY -
rg_gssio2_hs_MetaData_1M_3W_1_gpfs0 HEALTHY -
rg_gssio2_hs_MetaData_1M_4W_1_gpfs1 HEALTHY -
rg_gssio2_hs_loghome HEALTHY -
rg_gssio2_hs_logtip HEALTHY -
rg_gssio2_hs_logtipbackup HEALTHY -
PERFMON HEALTHY -

7. To view the eventlog history of the node for the last hour, issue the command:

mmhealth node eventlog --hour

The system displays output similar to this:

Chapter 2. Monitoring system health by using the mmhealth command 27


Node name: test-21.localnet.com
Timestamp Event Name Severity Details
2016-10-28 06:59:34.045980 CEST monitor_started INFO The IBM Storage Scale monitoring
service has been started
2016-10-28 07:01:21.919943 CEST fs_remount_mount INFO The filesystem objfs was mounted internal
2016-10-28 07:01:32.434703 CEST disk_found INFO The disk disk1 was found
2016-10-28 07:01:32.669125 CEST disk_found INFO The disk disk8 was found
2016-10-28 07:01:36.975902 CEST filesystem_found INFO Filesystem objfs was found
2016-10-28 07:01:37.226157 CEST unmounted_fs_check WARNING The filesystem objfs is probably needed,
but not mounted
2016-10-28 07:01:52.113691 CEST mounted_fs_check INFO The filesystem objfs is mounted
2016-10-28 07:01:52.283545 CEST fs_remount_mount INFO The filesystem objfs was mounted normal
2016-10-28 07:02:07.026093 CEST mounted_fs_check INFO The filesystem objfs is mounted
2016-10-28 07:14:58.498854 CEST ces_network_ips_down WARNING No CES relevant NICs detected
2016-10-28 07:15:07.702351 CEST nodestatechange_info INFO A CES node state change:
Node 1 add startup flag
2016-10-28 07:15:37.322997 CEST nodestatechange_info INFO A CES node state change:
Node 1 remove startup flag
2016-10-28 07:15:43.741149 CEST ces_network_ips_up INFO CES-relevant IPs are served by found NICs
2016-10-28 07:15:44.028031 CEST ces_network_vanished INFO CES NIC eth0 has vanished

8. To view the eventlog history of the node for the last hour, issue the command:

mmhealth node eventlog --hour --verbose

The system displays output similar to this:

Node name: test-21.localnet.com


Timestamp Component Event Name Event ID Severity Details
2016-10-28 06:59:34.045980 CEST gpfs monitor_started 999726 INFO The IBM Storage Scale monitoring service has been started
2016-10-28 07:01:21.919943 CEST filesystem fs_remount_mount 999306 INFO The filesystem objfs was mounted internal
2016-10-28 07:01:32.434703 CEST disk disk_found 999424 INFO The disk disk1 was found
2016-10-28 07:01:32.669125 CEST disk disk_found 999424 INFO The disk disk8 was found
2016-10-28 07:01:36.975902 CEST filesystem filesystem_found 999299 INFO Filesystem objfs was found
2016-10-28 07:01:37.226157 CEST filesystem unmounted_fs_check 999298 WARNING The filesystem objfs is probably needed, but not mounted
2016-10-28 07:01:52.113691 CEST filesystem mounted_fs_check 999301 INFO The filesystem objfs is mounted
2016-10-28 07:01:52.283545 CEST filesystem fs_remount_mount 999306 INFO The filesystem objfs was mounted normal
2016-10-28 07:02:07.026093 CEST filesystem mounted_fs_check 999301 INFO The filesystem objfs is mounted
2016-10-28 07:14:58.498854 CEST cesnetwork ces_network_ips_down 999426 WARNING No CES relevant NICs detected
2016-10-28 07:15:07.702351 CEST gpfs nodestatechange_info 999220 INFO A CES node state change: Node 1 add startup flag
2016-10-28 07:15:37.322997 CEST gpfs nodestatechange_info 999220 INFO A CES node state change: Node 1 remove startup flag
2016-10-28 07:15:43.741149 CEST cesnetwork ces_network_ips_up 999427 INFO CES-relevant IPs are served by found NICs
2016-10-28 07:15:44.028031 CEST cesnetwork ces_network_vanished 999434 INFO CES NIC eth0 has vanished

9. To view the detailed description of an event, issue the mmhealth event show command. This is an
example for quorum_down event:

mmhealth event show quorum_down

The system displays output similar to this:

Event Name: quorum_down

2016-09-27 11:31:52.520002 CEST move_cesip_from INFO Address 192.168.3.27 was moved from this node to node 3
2016-09-27 11:32:40.576867 CEST nfs_dbus_ok INFO NFS check via DBus successful
2016-09-27 11:33:36.483188 CEST pmsensors_down ERROR pmsensors service should be started and is stopped
2016-09-27 11:34:06.188747 CEST pmsensors_up INFO pmsensors service as expected, state is started

2016-09-27 11:31:52.520002 CEST cesnetwork move_cesip_from 999244 INFO Address 192.168.3.27 was moved from this node to node 3
2016-09-27 11:32:40.576867 CEST nfs nfs_dbus_ok 999239 INFO NFS check via DBus successful
2016-09-27 11:33:36.483188 CEST perfmon pmsensors_down 999342 ERROR pmsensors service should be started and is stopped
2016-09-27 11:34:06.188747 CEST perfmon pmsensors_up 999341 INFO pmsensors service as expected, state is started

10. To view the detailed description of the cluster, issue the command:

mmhealth cluster show

The system displays output similar to this:

Component Total Failed Degraded Healthy Other


-----------------------------------------------------------------
NODE 50 1 1 48 -
GPFS 50 1 - 49 -
NETWORK 50 - - 50 -
FILESYSTEM 3 - - 3 -
DISK 50 - - 50 -
CES 5 - 5 - -
CLOUDGATEWAY 2 - - 2 -
PERFMON 48 - 5 43 -

Note: The cluster must have the minimum release level as 4.2.2.0 or higher to use mmhealth
cluster show command. Also, this command does not support Windows operating system.
11. To view more information of the cluster health status, issue the command:

mmhealth cluster show --verbose

28 IBM Storage Scale 5.1.9: Problem Determination Guide


The system displays output similar to this:

Component Total Failed Degraded Healthy Other


-----------------------------------------------------------------
NODE 50 1 1 48 -
GPFS 50 1 - 49 -
NETWORK 50 - - 50 -
FILESYSTEM
FS1 15 - - 15 -
FS2 5 - - 5 -
FS3 20 - - 20 -
DISK 50 - - 50 -
CES 5 - 5 - -
AUTH 5 - - - 5
AUTH_OBJ 5 5 - - -
BLOCK 5 - - - 5
CESNETWORK 5 - - 5 -
NFS 5 - - 5 -
OBJECT 5 - - 5 -
SMB 5 - - 5 -
CLOUDGATEWAY 2 - - 2 -
PERFMON 48 - 5 43 -

12. To view the state of the file system, issue the command:
mmhealth node show filesystem -v
Node name: ibmnode1.ibm.com
Component Status Status Change Reasons
--------------------------------------------------------------------------------------------------------
FILESYSTEM HEALTHY 2019-01-30 14:32:24 fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0)
gpfs0 SUSPENDED 2019-01-30 14:32:22 fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0)
objfs HEALTHY 2019-01-30 14:32:22 -

You can see an output similar to the following example:

Event Parameter Severity Active Since Event Message


-------------------------------------------------------------------------------------------------------
fs_maintenance_mode gpfs0 INFO 2019-01-30 14:32:20 Filesystem gpfs0 is set to
maintenance mode.
unmounted_fs_check gpfs0 WARNING 2019-01-30 14:32:21 The filesystem gpfs0 is
probably needed, but not mounted
fs_working_mode objfs INFO 2019-01-30 14:32:21 Filesystem objfs is
not in maintenance mode.
mounted_fs_check objfs INFO 2019-01-30 14:32:21 The filesystem objfs is mounted

Threshold monitoring use cases


The following sections describe the threshold use cases for the mmhealth command:

Use case 1: Create a threshold rule and use the mmhealth command to
observe the changed in the HEALTH status
This section describes the threshold use case to create a threshold rule and use the mmhealth
commands to observe the changed in the HEALTH status.
1. To monitor the memory_free usage on each node, create a new thresholds rule with the following
settings:

# mmhealth thresholds add mem_memfree --errorlevel 1000000 --warnlevel 1500000


--name myTest_memfree --groupby node

The system displays output similar to the following:

New rule 'myTest_memfree' is created. The monitor process is activated

2. To view the list of all threshold rules defined for the system, run the following command:

mmhealth thresholds list

The system displays output similar to the following:


### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity

Chapter 2. Monitoring system health by using the mmhealth command 29


------------------------------------------------------------------------------------------------------------------------------
myTest_memfree mem_memfree 1000000 1500000 None node 300
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,gpfs_diskpool_name 300
MemFree_Rule mem_memfree 50000 100000 low node 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,gpfs_diskpool_name 300

3. To show the THRESHOLD status of the current node, run the following command:

# mmhealth node show THRESHOLD

The system displays output similar to the following:

Component Status Status Change Reasons


-----------------------------------------------------------
THRESHOLD HEALTHY 13 hours ago -
MemFree_Rule HEALTHY 13 hours ago -
myTest_memfree HEALTHY 10 min ago -

4. To view the event log history of the node, run the following command on each node:
# mmhealth node eventlog
2017-03-17 11:52:33.063550 CET thresholds_error ERROR The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-14.novalocal exceeded
threshold error level 1000000 defined in
myTest_memfree.

# mmhealth node eventlog


2017-03-17 11:52:32.594932 CET thresholds_warn WARNING The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-11.novalocal exceeded
threshold warning level 1500000 defined in
myTest_memfree.
2017-03-17 12:00:31.653163 CET thresholds_normal INFO The value of mem_memfree defined in
myTest_memfree
for component myTest_memfree/
gpfsgui-11.novalocal
reached a normal level.

# mmhealth node eventlog


2017-03-17 11:52:35.389392 CET thresholds_error ERROR The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-13.novalocal exceeded
threshold error level 1000000 defined in
myTest_memfree.

5. You can view the actual metric values and compare with the rule boundaries by running the metric
query against the pmcollector node. The following example shows the mem_memfree metric
query command and metric values for each node in the output:

# date; echo "get metrics mem_memfree -x -r last 10 bucket_size 300 " |


/opt/IBM/zimon/zc gpfsgui-11

The system displays output similar to the following:


Fri Mar 17 12:09:00 CET 2017
1: gpfsgui-11.novalocal|Memory|mem_memfree
2: gpfsgui-12.novalocal|Memory|mem_memfree
3: gpfsgui-13.novalocal|Memory|mem_memfree
4: gpfsgui-14.novalocal|Memory|mem_memfree
Row Timestamp mem_memfree mem_memfree mem_memfree mem_memfree
1 2017-03-17 11:20:00 1558888 1598442 717029 768610
2 2017-03-17 11:25:00 1555256 1598596 717328 768207
3 2017-03-17 11:30:00 1554707 1597399 715988 767737
4 2017-03-17 11:35:00 1554945 1598114 715664 768056
5 2017-03-17 11:40:00 1553744 1597234 715559 766245
6 2017-03-17 11:45:00 1552876 1596891 715369 767282
7 2017-03-17 11:50:00 1450204 1596364 714640 766594
8 2017-03-17 11:55:00 1389649 1595493 714228 764839
9 2017-03-17 12:00:00 1549598 1594154 713059 765411
10 2017-03-17 12:05:00 1547029 1590308 706375 766655
...

6. To view the THRESHOLD status of all the nodes, run the following command:

# mmhealth cluster show THRESHOLD

The system displays output similar to the following:


Component Node Status Reasons
------------------------------------------------------------------------------------------
THRESHOLD gpfsgui-11.novalocal HEALTHY -
THRESHOLD gpfsgui-13.novalocal FAILED thresholds_error

30 IBM Storage Scale 5.1.9: Problem Determination Guide


THRESHOLD gpfsgui-12.novalocal HEALTHY -
THRESHOLD gpfsgui-14.novalocal FAILED thresholds_error

7. To view the details of the raised event, run the following command:

# mmhealth event show thresholds_error

The system displays output similar to this:


Event Name: thresholds_error
Description: The thresholds value reached an error level.
Cause: The thresholds value reached an error level.
User Action: N/A
Severity: ERROR
State: FAILED

8. To get an overview about the node that is reporting an unhealthy status, check the event log for this
node by running the following command:

# mmhealth node eventlog

The system displays output similar to the following:


...
2017-03-17 11:50:23.252419 CET move_cesip_from INFO Address 192.168.0.158 was moved from this node
to node 0
2017-03-17 11:50:23.400872 CET thresholds_warn WARNING The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-13.novalocal exceeded
threshold warning level 1500000 defined in
myTest_memfree.
2017-03-17 11:50:26.090570 CET mounted_fs_check INFO The filesystem fs2 is mounted
2017-03-17 11:50:26.304381 CET mounted_fs_check INFO The filesystem gpfs0 is mounted
2017-03-17 11:50:26.428079 CET fs_remount_mount INFO The filesystem gpfs0 was mounted normal
2017-03-17 11:50:27.449704 CET quorum_up INFO Quorum achieved
2017-03-17 11:50:28.283431 CET mounted_fs_check INFO The filesystem gpfs0 is mounted
2017-03-17 11:52:32.591514 CET mounted_fs_check INFO The filesystem objfs is mounted
2017-03-17 11:52:32.685953 CET fs_remount_mount INFO The filesystem objfs was mounted normal
2017-03-17 11:52:32.870778 CET fs_remount_mount INFO The filesystem fs1 was mounted normal
2017-03-17 11:52:35.752707 CET mounted_fs_check INFO The filesystem fs1 is mounted
2017-03-17 11:52:35.931688 CET mounted_fs_check INFO The filesystem objfs is mounted
2017-03-17 12:00:36.390594 CET service_disabled INFO The service auth is disabled
2017-03-17 12:00:36.673544 CET service_disabled INFO The service block is disabled
2017-03-17 12:00:39.636839 CET postgresql_failed ERROR postgresql-obj process should be started but
is stopped

2017-03-16 12:01:21.389392 CET thresholds_error ERROR The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-13.novalocal exceeded
threshold error level 1000000 defined in
myTest_memfree.

9. To check the last THRESHOLD event update for this node, run the following command:

# mmhealth node show THRESHOLD

The system displays output similar to the following:


Node name: gpfsgui-13.novalocal

Component Status Status Change Reasons


--------------------------------------------------------------------------------------------------------
THRESHOLD FAILED 15 minutes ago thresholds_error(myTest_memfree/gpfsgui-13.novalocal)
myTest_memfree FAILED 15 minutes ago thresholds_error

Event Parameter Severity Active Since Event Message


----------------------------------------------------------------------------------------------------------------------------
thresholds_error myTest_memfree ERROR 15 minutes ago The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-13.novalocal exceeded
threshold error level 1000000 defined in myTest_memfree.

10. To review the status of all services for this node, run the following command:

# mmhealth node show

The system displays output similar to the following:


Node name: gpfsgui-13.novalocal
Node status: TIPS
Status Change: 15 hours ago

Component Status Status Change Reasons


----------------------------------------------------------------------------------------------------------------------
GPFS TIPS 15 hours ago gpfs_maxfilestocache_small, gpfs_maxstatcache_high, gpfs_pagepool_small
NETWORK HEALTHY 15 hours ago -
FILESYSTEM HEALTHY 15 hours ago -
DISK HEALTHY 15 hours ago -
CES TIPS 15 hours ago nfs_sensors_inactive
PERFMON HEALTHY 15 hours ago -

Chapter 2. Monitoring system health by using the mmhealth command 31


THRESHOLD FAILED 15 minutes ago thresholds_error(myTest_memfree/gpfsgui-13.novalocal)
[root@gpfsgui-13 ~]#

Use case 2: Observe the file system capacity usage by using default threshold
rules
This use case demonstrates the use of mmhealth threshold list command for monitoring a file
system capacity event by using default threshold rules.
Since the file system capacity-related thresholds such as DataCapUtil_Rule,
MetaDataCapUtil_Rule, and InodeCapUtil_Rule are not node-specific. These thresholds are
reported on the node that has active threshold monitor role.
1. Issue the following command to view the node that has active threshold monitor role
and the predefined threshold rules: DataCapUtil_Rule, MetaDataCapUtil_Rule, and
InodeCapUtil_Rule enabled in a cluster.

mmhealth thresholds list

The preceding command shows output similar to the following as shown here:
active_thresholds_monitor: scale-12.vmlocal

### Threshold Rules ###


rule_name metric error warn direction filterBy
groupBy sensitivity
----------------------------------------------------------------------------------------------------------
------------------------------------------
MemFree_Rule MemoryAvailable_percent None 5.0 low
node 300-min
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300
SMBConnPerNode_Rule current_connections 3000 None high
node 300
SMBConnTotal_Rule current_connections 20000 None
high 300
AFMInQueue_Rule AFMInQueueMemory_percent 90.0 80.0 high
node 300

2. ssh to switch to the node that has active threshold monitor role:

[root@scale-12 ~]# ssh scale-12.vmlocal

3. Issue the following command to review file system events:

[root@scale-12 ~]# mmhealth node show filesystem -v

The preceding command gives output similar to the following as shown here.

Node name: scale-11.vmlocal

Component Status Status Change Reasons & Notices


--------------------------------------------------------------------------
FILESYSTEM HEALTHY 2022-12-07 21:12:03 -
cesSharedRoot HEALTHY 2022-12-07 10:38:55 -
localFS DEGRADED 2022-12-07 21:12:03 pool-metadata_high_warn
remote-fs HEALTHY 2022-12-15 14:23:24 -

Event Parameter Severity Active Since Event Message


----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------
...

inode_normal cesSharedRoot INFO 2022-12-07 10:39:25 The inode usage of


fileset root in file system cesSharedRoot reached a normal level.
inode_normal localFS INFO 2022-12-07 21:34:03 The inode usage of
fileset myFset1 in file system localFS reached a normal level.
inode_normal localFS INFO 2022-12-07 21:34:03 The inode usage of
fileset root in file system localFS reached a normal level.
...

pool-data_normal cesSharedRoot INFO 2022-12-07 10:38:55 The pool data of file


system cesSharedRoot has reached a normal data level.

32 IBM Storage Scale 5.1.9: Problem Determination Guide


pool-data_normal cesSharedRoot INFO 2022-12-07 10:38:55 The pool system of
file system cesSharedRoot has reached a normal data level.
pool-data_normal localFS INFO 2022-12-07 21:34:03 The pool system of
file system localFS has reached a normal data level.
pool-metadata_normal cesSharedRoot INFO 2022-12-07 10:38:55 The pool data of file
system cesSharedRoot has reached a normal metadata level.
pool-metadata_normal cesSharedRoot INFO 2022-12-07 10:38:55 The pool system of
file system cesSharedRoot has reached a normal metadata level.
pool-metadata_high_warn localFS WARNING 2022-12-07 21:34:03 The pool system of
file system localFS has reached a warning level for metadata. 80.0

As you can see in the preceding file system example output, everything looks correct except the
"pool-metadata_high_warn" event.
4. Issue the following command to get the "pool-metadata_high_warn" warning details:

[root@scale-12 ~]# mmhealth event show pool-metadatadata_high_warn

The preceding command shows the warning detail similar to the following as shown here.

Event Name: pool-metadata_high_warn


Description: The pool has reached a warning level.
Cause: The pool has reached a warning level.
User Action: Add more capacity to pool or move to different pool or delete data
and/or snapshots.
Severity: WARNING
State: DEGRADED

Tip: See File system events to get complete list of all the possible file system events.
5. Compare the metadata capacity values reported by MetaDataCapUtil_Rule of the system pool from
localFS file system with mmlspool command output.

[root@scale-11 ~]# mmlspool localFS

The preceding command shows the storage pools in file system at '/gpfs/localFS' similar to
following as shown:

Name Id BlkSize Data Meta Total Data in (KB) Free Data in (KB) Total Meta in
(KB) Free Meta in (KB)
system 0 4 MB yes yes 16777216 13320192 ( 79%)
16777216 2515582 ( 15%)

In the preceding output, you can see that the pool system has only 15% available space for
meta_data.

Use case 3: Observe the health status changes for a particular component
based on the specified threshold rules
This use case shows the usage of the mmhealth command to observe the health status changes for a
particular node based on the specified threshold rules.
Run the following command to view the threshold rules that are predefined and enabled automatically in
a cluster:

[root@rhel77-11 ~]# mmhealth thresholds list

The system displays output similar to the following:


active_thresholds_monitor: RHEL77-11.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
---------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
MemFree_Rule MemoryAvailable_percent None 5.0 low node 300-min
SMBConnPerNode_Rule current_connections 3000 None high node 300
SMBConnTotal_Rule current_connections 20000 None high 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300

The default MemFree_Rule rule monitors the estimated available memory in relation to the total memory
allocation on all cluster nodes. A WARNING event is sent for a node, if the MemoryAvailable_percent

Chapter 2. Monitoring system health by using the mmhealth command 33


value goes less than 5% for that node. Run the following command to review the details of the rule
settings:
[root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep MemFree_Rule

The system displays output similar to the following:

mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:MemFree_Rule:attribute:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:rule_name:MemFree_Rule:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:frequency:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:tags:thresholds:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_warn:The estimated available memory is less than 5%, calculated to the total RAM or 40
GB, whichever is lower.:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_error::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:priority:2:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:downsamplOp:min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:type:measurement:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metric:MemoryAvailable_percent:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metricOp:noOperation:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:bucket_size:1:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:computation:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:duration:n/a:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:filterBy::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:groupBy:node:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:error:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:warn:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:direction:low:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:hysteresis:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:sensitivity:300-min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:state:active:

Note: The MemFree_Rule rule has the same evaluation priority for all nodes.
Run the following command on a node to view the health state of all the threshold rules that are defined
for that node:

[root@rhel77-11 ~]# mmhealth node show threshold -v

The system displays output similar to the following:


Node name: RHEL77-11.novalocal

Component Status Status Change Reasons


------------------------------------------------------------------------
THRESHOLD HEALTHY 2020-04-27 12:07:07 -
MemFree_Rule HEALTHY 2020-04-27 12:07:22 -
active_thresh_monitor HEALTHY 2020-04-27 12:07:22 -

Event Parameter Severity Active Since Event Message


------------------------------------------------------------------------------------------------------------------------------
thresholds_normal MemFree_Rule INFO 2020-04-27 12:07:22 The value of MemoryAvailable_percent defined in
MemFree_Rule for component
MemFree_Rule/rhel77-11.novalocal
reached a normal level.

In a production environment, in certain cases, the memory availability observation settings need to be
defined for a particular host separately. Follow these steps to set the memory availability for a particular
node:
1. Run the following command to create a new rule, node11_mem_available, to set the
MemoryAvailable_percent threshold value for the node RHEL77-11.novalocal:

[root@rhel77-11 ~]# mmhealth thresholds add MemoryAvailable_percent


--filterby node=rhel77-11.novalocal --errorlevel 5.0
--warnlevel 50.0 --name node11_mem_available

The system displays output similar to the following:

New rule 'node11_mem_available' is created.

2. Run the following command to view all the defined rules on a cluster:

[root@rhel77-11 ~]# mmhealth thresholds list

The system displays output similar to the following:


active_thresholds_monitor: RHEL77-11.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
-----------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300
MemFree_Rule MemoryAvailable_percent None 5.0 low node 300-min
SMBConnPerNode_Rule current_connections 3000 None high node 300
node11_mem_available MemoryAvailable_percent 5.0 50.0 None node=rhel77-11.novalocal node 300
SMBConnTotal_Rule current_connections 20000 None high 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,

34 IBM Storage Scale 5.1.9: Problem Determination Guide


gpfs_fs_name,
gpfs_diskpool_name 300

Note:
The node11_mem_available rule has the priority one for the RHEL77-11.novalocal node:
[root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep node11_mem_available
mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:node11_mem_available:attribute:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:rule_name:node11_mem_available:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:frequency:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:tags:thresholds:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_warn:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_error:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:priority:1:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:downsamplOp:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:type:measurement:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metric:MemoryAvailable_percent:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metricOp:noOperation:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:bucket_size:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:computation:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:duration:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:filterBy:node=rhel77-11.novalocal:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:groupBy:node:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:error:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:warn:50.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:direction:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:hysteresis:0.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:sensitivity:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:state:active:

All the MemFree_Rule events are removed for RHEL77-11.novalocal, since the
node11_mem_available rule has the higher priority for this node:
[root@rhel77-11 ~]# mmhealth node show threshold -v

Node name: RHEL77-11.novalocal

Component Status Status Change Reasons


-------------------------------------------------------------------------------
THRESHOLD HEALTHY 2020-04-27 12:07:07 -
active_thresh_monitor HEALTHY 2020-04-27 12:07:22 -
node11_mem_available HEALTHY 2020-05-13 10:10:16 -

Event Parameter Severity Active Since Event Message


--------------------------------------------------------------------------------------------------------------------------------------
thresholds_normal node11_mem_available INFO 2020-05-13 10:10:16 The value of MemoryAvailable_percent defined
in node11_mem_available for component
node11_mem_available/rhel77-11.novalocal
reached a normal level.
thresholds_removed MemFree_Rule INFO 2020-05-13 10:06:15 The value of MemoryAvailable_percent defined
for the component(s)
MemFree_Rule/rhel77-11.novalocal
defined in MemFree_Rule was removed.

Since the warning boundary by node11_mem_available rule is higher than the MemFree_Rule rule,
the WARNING event might appear faster than before for this node.
[root@rhel77-11 ~]# mmhealth node show threshold -v

Node name: RHEL77-11.novalocal

Component Status Status Change Reasons


-------------------------------------------------------------------------------------------------------------------
THRESHOLD DEGRADED 2020-05-13 12:50:24 thresholds_warn(node11_mem_available)
active_thresh_monitor HEALTHY 2020-04-27 12:07:22 -
node11_mem_available DEGRADED 2020-05-13 12:50:24 thresholds_warn(node11_mem_available)

Event Parameter Severity Active Since Event Message


------------------------------------------------------------------------------------------------------------------------------------------------
thresholds_warn node11_mem_available WARNING 2020-05-13 12:50:23 The value of MemoryAvailable_percent for the component(s)
node11_mem_available/rhel77-11.novalocal exceeded
threshold warning level 50.0 defined
in node11_mem_available.
thresholds_removed MemFree_Rule INFO 2020-05-13 10:06:15 The value of MemoryAvailable_percent for the component(s)
MemFree_Rule/rhel77-11.novalocal defined
in MemFree_Rule was removed.

3. Run the following command to get the WARNING event details:

[root@rhel77-11 ~]# mmhealth event show thresholds_warn

The system displays output similar to the following:


Event Name: thresholds_warn
Description: The thresholds value reached a warning level.
Cause: The thresholds value reached a warning level.
User Action: Run 'mmhealth thresholds list -v' commmand
and review the user action recommendations
for the corresponding thresholds rule.
Severity: WARNING
State: DEGRADED

You can also review the event history by viewing the whole event log as shown:transfered
[root@rhel77-11 ~]# mmhealth node eventlog
Node name: RHEL77-11.novalocal
Timestamp Event Name Severity Details
2020-04-27 11:59:06.532239 CEST monitor_started INFO The IBM Storage Scale monitoring service has been started
2020-04-27 11:59:07.410614 CEST service_running INFO The service clusterstate is running on node RHEL77-11.novalocal
2020-04-27 11:59:07.784565 CEST service_running INFO The service network is running on node RHEL77-11.novalocal
2020-04-27 11:59:09.965934 CEST gpfs_down ERROR The Storage Scale service process not running on this node.
Normal operation cannot be done
2020-04-27 11:59:10.102891 CEST quorum_down ERROR The node is not able to reach enough quorum nodes/disks to work properly.

Chapter 2. Monitoring system health by using the mmhealth command 35


2020-04-27 11:59:10.329689 CEST service_running INFO The service gpfs is running on node RHEL77-11.novalocal
2020-04-27 11:59:38.399120 CEST gpfs_up INFO The Storage Scale service process is running
2020-04-27 11:59:38.498718 CEST callhome_not_enabled TIP Callhome is not installed, configured or enabled.
2020-04-27 11:59:38.511969 CEST gpfs_pagepool_small TIP The GPFS pagepool is smaller than or equal to 1G.
2020-04-27 11:59:38.526075 CEST csm_resync_forced INFO All events and state will be transferred to the cluster manager
2020-04-27 12:01:07.486549 CEST quorum_up INFO Quorum achieved
2020-04-27 12:01:41.906686 CEST service_running INFO The service disk is running on node RHEL77-11.novalocal
2020-04-27 12:02:22.319159 CEST fs_remount_mount INFO The filesystem gpfs01 was mounted internal
2020-04-27 12:02:22.322987 CEST disk_found INFO The disk nsd_1 was found
2020-04-27 12:02:22.337810 CEST fs_remount_mount INFO The filesystem gpfs01 was mounted normal
2020-04-27 12:02:22.369814 CEST mounted_fs_check INFO The filesystem gpfs01 is mounted
2020-04-27 12:02:22.443717 CEST service_running INFO The service filesystem is running on node RHEL77-11.novalocal
2020-04-27 12:04:43.842571 CEST service_running INFO The service threshold is running on node RHEL77-11.novalocal
2020-04-27 12:04:55.168176 CEST service_running INFO The service perfmon is running on node RHEL77-11.novalocal
2020-04-27 12:07:07.657284 CEST service_running INFO The service threshold is running on node RHEL77-11.novalocal
2020-04-27 12:07:22.609728 CEST thresh_monitor_set_active INFO The thresholds monitoring process is running in ACTIVE state
on the local node
2020-04-27 12:07:22.626369 CEST thresholds_new_rule INFO Rule MemFree_Rule was added
2020-04-27 12:09:08.275073 CEST local_fs_normal INFO The local file system with the mount point / used for /tmp/mmfs
reached a normal level with more than 1000 MB free space.
2020-04-27 12:14:07.997867 CEST singleton_sensor_on INFO The singleton sensors of pmsensors are turned on
2020-05-08 11:03:45.324399 CEST local_fs_path_not_found INFO The configured dataStructureDump path /tmp/mmfs does not exists.
Skipping monitoring.
2020-05-13 10:06:15.912457 CEST thresholds_removed INFO The value of MemoryAvailable_percent for the component(s)
MemFree_Rule/rhel77-11.novalocal defined in MemFree_Rule was removed.
2020-05-13 10:10:16.173478 CEST thresholds_new_rule INFO Rule node11_mem_available was added
2020-05-13 12:50:23.955531 CEST thresholds_warn WARNING The value of MemoryAvailable_percent for the component(s)
node11_mem_available/rhel77-11.novalocal exceeded threshold
warning level 50.0 defined in node11_mem_available.
2020-05-13 13:14:12.836070 CEST out_of_memory WARNING Detected Out of memory killer conditions in system log

Use case 4: Create threshold rules for specific filesets


This section describes the threshold use case to create threshold rules for specific fileset.
Consider that the initial cluster file system and fileset have the following setting:
[root@gpfsgui-11 ~]# mmlsfileset nfs_shareFS
Filesets in file system 'nfs_shareFS':
Name Status Path
root Linked /gpfs/nfs_shareFS
nfs_shareFILESET Linked /gpfs/nfs_shareFS/nfs_shareFILESET
test_share Linked /gpfs/nfs_shareFS/test_share

Per default, the following rules are defined and enabled in the cluster:
[root@gpfsgui-11 ~]# mmhealth thresholds list
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
---------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90 80 None gpfs_cluster_name,
gpfs_fs_name,
gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300
MemFree_Rule mem_memfree 50000 100000 low node 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300

1. Create new inode capacity usage rules for the specific filesets.
a. To create a threshold rule for all filesets in an individual file system, use the following command:

[root@gpfsgui-11 ~]# mmhealth thresholds add Fileset_inode --errorlevel 85.0


--warnlevel 75.0 --direction high --filterby 'gpfs_fs_name=nfs_shareFS'
--groupby gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name
--sensitivity 300 --hysteresis 5.0 --name rule_ForAllFsets_inFS

New rule 'rule_ForAllFsets_inFS' is created. The monitor process is activated

b. To create a threshold rule for individual fileset, use the following command:

[root@gpfsgui-11 ~]# mmhealth thresholds add Fileset_inode --errorlevel 70.0


--warnlevel 60.0 --direction high --filterby 'gpfs_fs_name=nfs_shareFS,
gpfs_fset_name=nfs_shareFILESET'
--groupby gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name --sensitivity 300
--hysteresis 5.0 --name rule_SingleFset_inFS

New rule 'rule_SingleFset_inFS' is created. The monitor process is activated

Note: In this case, for the nfs_shareFILESET fileset, you have specified both, the file system
name and the fileset name in the filter.
The mmhealth thresholds add command gives an output similar to the following:

# mmhealth thresholds list


### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
----------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90 80 None gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_fset_name

36 IBM Storage Scale 5.1.9: Problem Determination Guide


DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_diskpool_name
MemFree_Rule mem_memfree 50000 100000 low node 300
rule_SingleFset_inFS Fileset_inode 70.0 60.0 high gpfs_fs_name
=nfs_shareFS, gpfs_cluster_name, 300
gpfs_fset_name
=nfs_shareFILESET gpfs_fs_name,
gpfs_fset_name
rule_ForAllFsets_inFS Fileset_inode 85.0 75.0 high gpfs_fs_name
=nfs_shareFS gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_fset_name
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_diskpool_name

2. Run the mmhealth thresholds list command to list the individual rules' priorities. In this
example, the rule_SingleFset_inFS rule has the highest priority for the nfs_shareFILESET
fileset. The rule_ForAllFsets_inFS rule has the highest priority for the other filesets that belong
to the nfs_shareFS file system, and the InodeCapUtil_Rule rule is valid for all the remaining
filesets.

[root@gpfsgui-11 ~]# mmhealth thresholds list -v -Y|grep priority


mmsysmonc_thresholdslist::0:1::InodeCapUtil_Rule:priority:4:
mmsysmonc_thresholdslist::0:1::DataCapUtil_Rule:priority:4:
mmsysmonc_thresholdslist::0:1::MemFree_Rule:priority:2:
mmsysmonc_thresholdslist::0:1::rule_SingleFset_inFS:priority:1:
mmsysmonc_thresholdslist::0:1::rule_ForAllFsets_inFS:priority:2:
mmsysmonc_thresholdslist::0:1::MetaDataCapUtil_Rule:priority:4:
[root@gpfsgui-11 ~]#

Use case 5: Identify the ACTIVE PERFORMANCE MONITOR node


This section describes the threshold use case to identify the ACTIVE PERFORMANCE MONITOR node.
To see the pmcollector node that is granted the ACTIVE PERFORMANCE MONITOR role, use the following
command:

[root@gpfsgui-21 ~]# mmhealth thresholds list

The system displays output similar to the following:

active_thresholds_monitor: gpfsgui-22.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy
sensitivity
--------------------------------------------------------------------------------------------------------------
---
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_fset_name
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_diskpool_name
MemFree_Rule mem_memfree 50000 100000 low node
300
SMBConnPerNode_Rule connect_count 3000 None high node
300
SMBConnTotal_Rule connect_count 20000 None high
300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_diskpool_name

The information about the ACTIVE PERFORMANCE MONITOR node is also included in the THRESHOLD
service health state.

[root@gpfsgui-21 ~]# mmhealth node show threshold -N all

The health status of the active_threshold_monitor for the nodes that have the ACTIVE
PERFORMANCE MONITOR role is shown as a subprocess of the THRESHOLD service.

Node name: gpfsgui-21.novalocal

Component Status Status Change Reasons


--------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -

Chapter 2. Monitoring system health by using the mmhealth command 37


MemFree_Rule HEALTHY 58 min. ago -

There are no active error events for the component THRESHOLD on this node
(gpfsgui-21.novalocal).

Node name: gpfsgui-22.novalocal

Component Status Status Change Reasons


-----------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -
SMBConnTotal_Rule HEALTHY 23 min. ago -
active_thresh_monitor HEALTHY 1 hour ago -

There are no active error events for the component THRESHOLD on this node
(gpfsgui-22.novalocal).

Node name: gpfsgui-23.novalocal

Component Status Status Change Reasons


---------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 58 min. ago -
SMBConnPerNode_Rule HEALTHY 33 min. ago -
SMBConnTotal_Rule HEALTHY 33 min. ago -

There are no active error events for the component THRESHOLD on this node
(gpfsgui-23.novalocal).

Node name: gpfsgui-24.novalocal

Component Status Status Change Reasons


---------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -
SMBConnPerNode_Rule HEALTHY 23 min. ago -

There are no active error events for the component THRESHOLD on this node
(gpfsgui-24.novalocal).

Node name: gpfsgui-25.novalocal

Component Status Status Change Reasons


---------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 58 min. ago -
SMBConnPerNode_Rule HEALTHY 23 min. ago -

If the ACTIVE PERFORMANCE MONITOR node loses the connection or is unresponsive, another
pmcollector node takes over the role of the ACTIVE PERFORMANCE MONITOR node. After a new
pmcollector takes over the ACTIVE PERFORMANCE MONITOR role, the status of all the cluster-wide
thresholds is also reported by the new ACTIVE PERFORMANCE MONITOR node.

root@gpfsgui-22 ~]# systemctl status pmcollector


pmcollector.service - zimon collector daemon
Loaded: loaded (/usr/lib/systemd/system/pmcollector.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Tue 2019-03-05 15:41:52 CET; 28min ago
Process: 1233 ExecStart=/opt/IBM/zimon/sbin/pmcollector -C /opt/IBM/zimon/ZIMonCollector.cfg -R
/var/run/perfmon (code=exited, status=0/SUCCESS)
Main PID: 1233 (code=exited, status=0/SUCCESS)

Mar 05 14:27:24 gpfsgui-22.novalocal systemd[1]: Started zimon collector daemon.


Mar 05 14:27:24 gpfsgui-22.novalocal systemd[1]: Starting zimon collector daemon...
Mar 05 15:41:50 gpfsgui-22.novalocal systemd[1]: Stopping zimon collector daemon...
Mar 05 15:41:52 gpfsgui-22.novalocal systemd[1]: Stopped zimon collector daemon.

[root@gpfsgui-21 ~]# mmhealth thresholds list

active_thresholds_monitor: gpfsgui-21.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
-----------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300
MemFree_Rule mem_memfree 50000 100000 low node
300
SMBConnPerNode_Rule connect_count 3000 None high node
300
SMBConnTotal_Rule connect_count 20000 None high
300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,

38 IBM Storage Scale 5.1.9: Problem Determination Guide


gpfs_diskpool_name 300

[root@gpfsgui-21 ~]# mmhealth node show threshold -N all

Node name: gpfsgui-21.novalocal

Component Status Status Change Reasons


-----------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -
SMBConnTotal_Rule HEALTHY 30 min. ago -
active_thresh_monitor HEALTHY 30 min. ago -

There are no active error events for the component THRESHOLD on this node (gpfsgui-21.novalocal).

Node name: gpfsgui-22.novalocal

Component Status Status Change Reasons


-----------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -

There are no active error events for the component THRESHOLD on this node (gpfsgui-22.novalocal).

Node name: gpfsgui-23.novalocal

Component Status Status Change Reasons


---------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -
SMBConnPerNode_Rule HEALTHY 1 hour ago -

There are no active error events for the component THRESHOLD on this node (gpfsgui-23.novalocal).

Node name: gpfsgui-24.novalocal

Component Status Status Change Reasons


---------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -
SMBConnPerNode_Rule HEALTHY 1 hour ago -

There are no active error events for the component THRESHOLD on this node (gpfsgui-24.novalocal).

Node name: gpfsgui-25.novalocal

Component Status Status Change Reasons


---------------------------------------------------------------
THRESHOLD HEALTHY 1 hour ago -
MemFree_Rule HEALTHY 1 hour ago -
SMBConnPerNode_Rule HEALTHY 1 hour ago -

There are no active error events for the component THRESHOLD on this node (gpfsgui-25.novalocal).

The ACTIVE PERFORMANCE MONITOR switch over triggers new event entry in the Systemhealth event log:

2019-03-05 15:42:02.844214 CET thresh_monitor_set_active INFO The thresholds monitoring


process is running
in ACTIVE state on the local node

Use case 6: Observe the memory usage with MemFree_Rule


This section describes the threshold use case to observe the memory usage with MemFree_Rule.
The default MemFree_rule observes the estimated available memory in relation to the total memory
allocation on each cluster node. Run the following command to display all the active threshold rules:

[root@fscc-p8-23-c ~]# mmhealth thresholds list

The system displays output similar to the following:

active_thresholds_monitor: fscc-p8-23-c.mainz.de.ibm.com
### Threshold Rules ###
rule_name metric error warn direction filterBy
groupBy sensitivity
--------------------------------------------------------------------------------------------------------------------------------
----------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300m
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
MemFree_Rule MemoryAvailable_percent None 5.0 low
node 300-min
diskIOreadresponseTime DiskIoLatency_read 250 100 None node,
diskdev_name 300
SMBConnPerNode_Rule connect_count 3000 None high
node 300
diskIOwriteresponseTime DiskIoLatency_write 250 100 None node,
diskdev_name 300
SMBConnTotal_Rule connect_count 20000 None
high 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300

Chapter 2. Monitoring system health by using the mmhealth command 39


The MemFree_Rule throws a warning if the smallest value within the sensitivity period is lower than the
threshold warn boundary. The threshold warn boundary is set to 5% by default. All thresholds events can
be reviewed by using the mmhealth node show threshold command output.

[root@fscc-p8-23-c ~]# mmhealth node show threshold

The system displays output similar to the following:


Node name: fscc-p8-23-c.mainz.de.ibm.com

Component Status Status Change Reasons


----------------------------------------------------------------------------------------------
THRESHOLD DEGRADED 5 min. ago thresholds_warn(MemFree_Rule)
MemFree_Rule DEGRADED Now thresholds_warn(MemFree_Rule)
active_thresh_monitor HEALTHY 10 min. ago -
diskIOreadresponseTime HEALTHY 10 min. ago -
diskIOwriteresponseTime HEALTHY Now -

Event Parameter Severity Active Since Event Message


-------------------------------------------------------------------------------------------------------------------
thresholds_warn MemFree_Rule WARNING Now The value of MemoryAvailable_percent
for the component(s) MemFree_Rule/fscc-p8-23-c
exceeded threshold warning level 5.0
defined in MemFree_Rule.

All threshold events that are raised until now can also be reviewed by running the following command:

[root@fscc-p8-23-c ~]# mmhealth node eventlog | grep thresholds

The system displays output similar to the following:

...
2019-08-30 12:40:49.102217 CEST thresh_monitor_set_active INFO The thresholds monitoring process is
running in ACTIVE state on the local node
2019-08-30 12:41:04.092083 CEST thresholds_new_rule INFO Rule diskIOreadresponseTime was added
2019-08-30 12:41:04.127695 CEST thresholds_new_rule INFO Rule SMBConnTotal_Rule was added
2019-08-30 12:41:04.147223 CEST thresholds_new_rule INFO Rule diskIOwriteresponseTime was added
2019-08-30 12:41:19.117875 CEST thresholds_new_rule INFO Rule MemFree_Rule was added
2019-08-30 13:16:04.804887 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sda1
reached a normal level.
2019-08-30 13:16:04.831206 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sda2
reached a normal level.
2019-08-30 13:21:05.203115 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sdc
reached a normal level.
2019-08-30 13:21:05.227137 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sdd
reached a normal level.
2019-08-30 13:21:05.242787 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sde
reached a normal level.
2019-08-30 13:41:06.809589 CEST thresholds_removed INFO The value of DiskIoLatency_read for the component(s)
diskIOreadresponseTime/fscc-p8-23-c/sda1
defined in diskIOreadresponseTime was removed.
2019-08-30 13:41:06.902566 CEST thresholds_removed INFO The value of DiskIoLatency_read for the component(s)
diskIOreadresponseTime/fscc-p8-23-c/sda2
defined in diskIOreadresponseTime was removed.
2019-08-30 15:24:43.224013 CEST thresholds_warn WARNING The value of MemoryAvailable_percent for the
component(s)
MemFree_Rule/fscc-p8-23-c exceeded threshold warning
level
6.0 defined in MemFree_Rule.
2019-08-30 15:24:58.243273 CEST thresholds_normal INFO The value of DiskIoLatency_write defined in
diskIOwriteresponseTime for component
diskIOwriteresponseTime/fscc-p8-23-c/sda3
reached a normal level.
2019-08-30 15:24:58.289469 CEST thresholds_normal INFO The value of DiskIoLatency_write defined in
diskIOwriteresponseTime for component
diskIOwriteresponseTime/fscc-p8-23-c/sda
reached a normal level.
2019-08-30 15:29:43.648830 CEST thresholds_normal INFO The value of MemoryAvailable_percent defined
in MemFree_Rule for component MemFree_Rule/fscc-
p8-23-c
reached a normal level.
...

You can view the mmsysmonitor log located in the /var/adm/ras directory for more specific details
about the events that are raised.

[root@fscc-p8-23-c ~]# cat /var/adm/ras/mmsysmonitor.fscc-p8-23-c.log |grep


MemoryAvailable_percent
2019-08-30_14:39:06.681+0200: [I] ET_threshold Event=thresholds_normal
identifier=MemFree_Rule/fscc-p8-23-c
arg0= arg1=MemoryAvailable_percent arg2=MemFree_Rule

40 IBM Storage Scale 5.1.9: Problem Determination Guide


2019-08-30_15:24:43.252+0200: [I] ET_threshold Event=thresholds_warn identifier=MemFree_Rule/
fscc-p8-23-c
arg0=6.0 arg1=MemoryAvailable_percent arg2=MemFree_Rule

If the mmsysmonitor log is set to DEBUG, and the buffering option of the debug messages is turned off,
the log file must include all the messages about the threshold rule evaluation process.

[root@fscc-p8-23-c ~]# cat /var/adm/ras/mmsysmonitor.fscc-p8-23-c.log |grep MemoryAvailable_percent


2019-08-30_15:24:42.656+0200: [D] Thread-2096 doFilter _ColumnInfo(name='MemoryAvailable_percent', semType=15,
keys=(fscc-p8-23-b|Memory|mem_memtotal, fscc-p8-23-b|Memory|mem_memfree, fscc-p8-23-b|Memory|mem_buffers,
fscc-p8-23-b|Memory|mem_cached, fscc-p8-23-b|Memory|mem_memtotal, fscc-p8-23-b|Memory|mem_memtotal,
fscc-p8-23-b|Memory|mem_memfree, fscc-p8-23-b|Memory|mem_buffers, fscc-p8-23-b|Memory|mem_cached),
column=0) value 29.5395588982 ls=0 cs=0 - ThresholdStateFilter.doFilter:164
2019-08-30_15:24:42.658+0200: [D] Thread-2096 doFilter _ColumnInfo(name='MemoryAvailable_percent', semType=15,
keys=(fscc-p8-23-c|Memory|mem_memtotal, fscc-p8-23-c|Memory|mem_memfree, fscc-p8-23-c|Memory|mem_buffers,
fscc-p8-23-c|Memory|mem_cached, fscc-p8-23-c|Memory|mem_memtotal, fscc-p8-23-c|Memory|mem_memtotal,
fscc-p8-23-c|Memory|mem_memfree, fscc-p8-23-c|Memory|mem_buffers, fscc-p8-23-c|Memory|mem_cached),
column=1) value 4,36088625518 ls=0 cs=0 - ThresholdStateFilter.doFilter:164
2019-08-30_15:24:42.660+0200: [D] Thread-2096 doFilter _ColumnInfo(name='MemoryAvailable_percent', semType=15,
keys=(fscc-p8-23-d|Memory|mem_memtotal, fscc-p8-23-d|Memory|mem_memfree, fscc-p8-23-d|Memory|mem_buffers,
fscc-p8-23-d|Memory|mem_cached, fscc-p8-23-d|Memory|mem_memtotal, fscc-p8-23-d|Memory|mem_memtotal,
fscc-p8-23-d|Memory|mem_memfree, fscc-p8-23-d|Memory|mem_buffers, fscc-p8-23-d|Memory|mem_cached),
column=2) value 29.6084011311 ls=0 cs=0 - ThresholdStateFilter.doFilter:164
2019-08-30_15:24:42.661+0200: [D] Thread-2096 doFilter _ColumnInfo(name='MemoryAvailable_percent', semType=15,
keys=(fscc-p8-23-e|Memory|mem_memtotal, fscc-p8-23-e|Memory|mem_memfree, fscc-p8-23-e|Memory|mem_buffers,
fscc-p8-23-e|Memory|mem_cached, fscc-p8-23-e|Memory|mem_memtotal, fscc-p8-23-e|Memory|mem_memtotal,
fscc-p8-23-e|Memory|mem_memfree, fscc-p8-23-e|Memory|mem_buffers, fscc-p8-23-e|Memory|mem_cached),
column=3) value 10.5771109742 ls=0 cs=0 - ThresholdStateFilter.doFilter:164
2019-08-30_15:24:42.663+0200: [D] Thread-2096 doFilter _ColumnInfo(name='MemoryAvailable_percent', semType=15,
keys=(fscc-p8-23-f|Memory|mem_memtotal, fscc-p8-23-f|Memory|mem_memfree, fscc-p8-23-f|Memory|mem_buffers,
fscc-p8-23-f|Memory|mem_cached, fscc-p8-23-f|Memory|mem_memtotal, fscc-p8-23-f|Memory|mem_memtotal,
fscc-p8-23-f|Memory|mem_memfree, fscc-p8-23-f|Memory|mem_buffers, fscc-p8-23-f|Memory|mem_cached),
column=4) value 11.7536187253 ls=0 cs=0 - ThresholdStateFilter.doFilter:164

The estimation of the available memory on the node is based on the free buffers and the cached memory
values. The free buffers and the cached memory values are returned by the performance monitoring tool
that is derived from the /proc/meminfo file. The following queries show how the available memory
percentage value depends on the sample interval. The larger the bucket_size, the more the metrics
values are smoothed.

[root@fscc-p8-23-c ~]# date; echo "get metrics mem_memfree,mem_buffers,mem_cached,


mem_memtotal from node=fscc-p8-23-c last 10 bucket_size 300 " | /opt/IBM/zimon/zc localhost
Fri 30 Aug 15:26:31 CEST 2019
1: fscc-p8-23-c|Memory|mem_memfree
2: fscc-p8-23-c|Memory|mem_buffers
3: fscc-p8-23-c|Memory|mem_cached
4: fscc-p8-23-c|Memory|mem_memtotal
Row Timestamp mem_memfree mem_buffers mem_cached mem_memtotal
1 2019-08-30 14:40:00 4582652 0 225716 7819328
2 2019-08-30 14:45:00 4003383 0 307715 7819328
3 2019-08-30 14:50:00 3341518 0 344366 7819328
4 2019-08-30 14:55:00 3145560 0 359768 7819328
5 2019-08-30 15:00:00 1968256 0 378541 7819328
6 2019-08-30 15:05:00 710601 0 290097 7819328
7 2019-08-30 15:10:00 418108 0 157229 7819328
8 2019-08-30 15:15:00 365706 0 129333 7819328
9 2019-08-30 15:20:00 293367 0 146004 7819328 --> 5,6%
10 2019-08-30 15:25:00 1637727 3976 1665790 7819328

[root@fscc-p8-23-c ~]# date; echo "get metrics mem_memfree,mem_buffers,mem_cached,


mem_memtotal from node=fscc-p8-23-c last 10 bucket_size 60 " | /opt/IBM/zimon/zc localhost
Fri 30 Aug 15:26:07 CEST 2019
1: fscc-p8-23-c|Memory|mem_memfree
2: fscc-p8-23-c|Memory|mem_buffers
3: fscc-p8-23-c|Memory|mem_cached
4: fscc-p8-23-c|Memory|mem_memtotal
Row Timestamp mem_memfree mem_buffers mem_cached mem_memtotal
1 2019-08-30 15:17:00 310581 0 161529 7819328
2 2019-08-30 15:18:00 283733 0 116588 7819328
3 2019-08-30 15:19:00 265449 0 112012 7819328
4 2019-08-30 15:20:00 263733 0 102635 7819328 --> 4,7%
5 2019-08-30 15:21:00 251716 0 71268 7819328
6 2019-08-30 15:22:00 3222924 0 61258 7819328
7 2019-08-30 15:23:00 2576786 0 1164106 7819328
8 2019-08-30 15:24:00 1693056 6842 2877966 7819328
9 2019-08-30 15:25:00 2171244 7872 2341481 7819328
10 2019-08-30 15:26:00 2100834 7872 2358798 7819328
.

.
[root@fscc-p8-23-c ~]# date; echo "get metrics mem_memfree,mem_buffers,mem_cached,
mem_memtotal from node=fscc-p8-23-c last 600 bucket_size 1 " | /opt/IBM/zimon/zc localhost
Fri 30 Aug 15:27:02 CEST 2019
1: fscc-p8-23-c|Memory|mem_memfree

Chapter 2. Monitoring system health by using the mmhealth command 41


2: fscc-p8-23-c|Memory|mem_buffers
3: fscc-p8-23-c|Memory|mem_cached
4: fscc-p8-23-c|Memory|mem_memtotal
Row Timestamp mem_memfree mem_buffers mem_cached mem_memtotal
...
365 2019-08-30 15:20:17 600064 0 116864 7819328
366 2019-08-30 15:20:18 557760 0 117568 7819328
367 2019-08-30 15:20:19 550336 0 117888 7819328
368 2019-08-30 15:20:20 533312 0 117312 7819328
369 2019-08-30 15:20:21 508096 0 117632 7819328
370 2019-08-30 15:20:22 450816 0 116736 7819328
371 2019-08-30 15:20:23 414272 0 118976 7819328
372 2019-08-30 15:20:24 400384 0 119680 7819328
373 2019-08-30 15:20:25 372096 0 119680 7819328
374 2019-08-30 15:20:26 344448 0 123392 7819328
375 2019-08-30 15:20:27 289664 0 131840 7819328
376 2019-08-30 15:20:28 254848 0 131840 7819328
377 2019-08-30 15:20:29 252416 0 127680 7819328 --> 4,7%
378 2019-08-30 15:20:30 244224 0 125504 7819328 --> 4,8%
379 2019-08-30 15:20:31 271872 0 105920 7819328
380 2019-08-30 15:20:32 270400 0 70592 7819328 ---> 4,3%
381 2019-08-30 15:20:33 229312 0 111360 7819328
382 2019-08-30 15:20:34 null null null null
383 2019-08-30 15:20:35 null null null null
...

527 2019-08-30 15:22:59 3186880 0 1902976 7819328


528 2019-08-30 15:23:00 3045760 0 1957952 7819328
529 2019-08-30 15:23:01 2939456 0 2012224 7819328
530 2019-08-30 15:23:02 2858560 0 2038272 7819328
531 2019-08-30 15:23:03 2835840 0 2038144 7819328
532 2019-08-30 15:23:04 2774656 0 2059968 7819328
533 2019-08-30 15:23:05 2718720 0 2089920 7819328
534 2019-08-30 15:23:06 2681280 0 2095168 7819328
535 2019-08-30 15:23:07 2650688 0 2136896 7819328
536 2019-08-30 15:23:08 2531200 1216 2224832 7819328
537 2019-08-30 15:23:09 2407744 7872 2298880 7819328
538 2019-08-30 15:23:10 2326976 7872 2341440 7819328
539 2019-08-30 15:23:11 2258240 7872 2410624 7819328

The suffix -min in the rule sensitivity parameter prevents the averaging of the metric values. Of all the
data points returned by a metrics sensor for a specified sensitivity interval, the smallest value is involved
in the threshold evaluation process. Use the following command to get all parameter settings of the
default MemFree_rule:

[root@fscc-p8-23-c ~]# mmhealth thresholds list -v

The system displays output similar to the following:

### MemFree_Rule details ###


attribute
value

------------------------------------------------------------------------------
rule_name
MemFree_Rule

frequency
300

tags
thresholds

user_action_warn The estimated available memory is less than 5%,


calculated to the total RAM or 40 GB, whichever is lower.
The system performance and stability might be affected.
For more information see the IBM Storage Scale
performance tuning guidelines.
user_action_errorNone

priority

42 IBM Storage Scale 5.1.9: Problem Determination Guide


2

downsamplOpNone

type
measurement

metric
MemoryAvailable_percent

metricOp
noOperation

bucket_size300

computationNone

duration
None

filterBy

groupBy
node

error
None

warn
6.0

directionlow

hysteresis0.0

sensitivity 300-min

Use case 7: Observe the running state of the defined threshold rules
This section describes the threshold use case to observe the running state of the defined threshold rules.
1. To view the list of all the threshold rules that are defined for the system and their running state, run the
following command:

mmhealth thresholds list -v -Y |grep state

The system displays output similar to the following:

[root@gpfsgui-11 ~]# mmhealth thresholds list -v -Y |grep state


mmhealth_thresholds:THRESHOLD_RULE:0:1::InodeCapUtil_Rule:state:active:
mmhealth_thresholds:THRESHOLD_RULE:0:1::DataCapUtil_Rule:state:active:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:state:active:
mmhealth_thresholds:THRESHOLD_RULE:0:1::SMBConnPerNode_Rule:state:inactive:
mmhealth_thresholds:THRESHOLD_RULE:0:1::SMBConnTotal_Rule:state:inactive:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MetaDataCapUtil_Rule:state:active:

Note: A threshold rule can be in one of the following states:


Active
A rule is running.

Chapter 2. Monitoring system health by using the mmhealth command 43


Inactive
A rule is not running. Either no keys for the defined metric measurement are found in the
performance monitoring tool metadata or the corresponding sensor is not enabled. As soon as
the metric keys and metric values are detected for a rule, the state of the rule switches to active.
Unknown
The state of a rule is not determined because of an issue with querying internal data.
2. To check the state of the evaluation result for all the active or running threshold rules on each node,
run the following command:

mmhealth node show threshold -N all

Note: No state is reported for inactive threshold rules.

Configuring webhook by using the mmhealth command


Configure a web server for a user, which accepts webhook POST requests from the IBM Storage Scale
mmhealth command. The procedure includes the HTTP server authentication process, and how it
validates UUID value and acts based on the incoming events.
A webhook is a basic API that allows one-way sharing of data between a client and an HTTP server. In
most cases, the data is sent when an event or a trigger mechanism occurs. When a webhook is configured
in IBM Storage Scale by using the mmhealth command, health events occur during the system health
monitoring. These health events trigger the one-way transfer of data to the HTTP server that is defined in
the webhook.

Setting up an HTTP server


Consider an HTTP server that responds only on a specific webhook API. When the mmhealth command
posts data to this webhook API, the HTTP server prints a simple summary of the received health events.
The posted data can be examined for specific health events and based on the health events, further
actions can be taken. For example, if a disk_down health event is encountered, then the HTTP server can
trigger an email or alarm mechanism to inform the service team to take immediate action.
You can use programming languages, such as Rust, GO, or Python to configure a webhook server.
Example 1: Setting up a webhook server by using Rust

#![allow(non_camel_case_types)]
#![allow(non_snake_case)]

use actix_web::{web, App, HttpResponse, HttpServer};


use serde::Deserialize;
use std::collections::HashMap;
use std::env;

#[derive(Deserialize)]
struct HealthEvent {
cause: String,
code: String,
component: String,
container_restart: bool,
container_unready: bool,
description: String,
entity_name: String,
entity_type: String,
event: String,
event_type: String,
ftdc_scope: String,
identifier: String,
internalComponent: String,
is_resolvable: bool,
message: String,
node: String,
priority: u64,
remedy: Option<String>,
requireUnique: bool,
scope: String,

44 IBM Storage Scale 5.1.9: Problem Determination Guide


severity: String,
state: String,
time: String,
TZONE: String,
user_action: String,
}

#[derive(Deserialize)]
struct PostMsg {
version: String,
reportingController: String,
reportingInstance: String,
events: Vec<HealthEvent>,
}

fn post_handler(events: web::Json<PostMsg>) -> HttpResponse {


let mut counts: HashMap<String, u64> = HashMap::new();
for e in &events.events {
*counts.entry(e.severity.clone()).or_insert(0) += 1;
}
println!("{}: {:?}", events.reportingInstance, counts);
HttpResponse::Ok().content_type("text/html").body("OK")
}

pub fn main() {
let args: Vec<String> = env::args().collect();
if args.len() != 2 {
println!("Usage: {} IPAddr:Port", args[0]);
return;
}

println!("Listening on {}", args[1]);


HttpServer::new(|| App::new().route("/webhook", web::post().to(post_handler)))
.bind(&args[1])
.expect("error binding server to IP address:port")
.run()
.expect("error running server");
}

Example 2: Setting up a webhook server by using GO

import (
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"time"
)

type healthEvent struct {


Cause string `json:"cause"`
Code string `json:"code"`
Component string `json:"component"`
ContainerRestart bool `json:"container_restart"`
ContainerUnready bool `json:"container_unready"`
Description string `json:"description"`
EntityName string `json:"entity_name"`
EntityType string `json:"entity_type"`
Event string `json:"event"`
EventType string `json:"event_type"`
FTDCScope string `json:"ftdc_scope"`
Identifier string `json:"identifier"`
InternalComponent string `json:"internalComponent"`
IsResolvable bool `json:"is_resolvable"`
Message string `json:"message"`
Node string `json:"node"`
Priority int `json:"priority"`
Remedy string `json:"remedy"`
RequireUnique bool `json:"requireUnique"`
Scope string `json:"scope"`
Severity string `json:"severity"`
State string `json:"state"`
Time time.Time `json:"time"`
TZone string `json:"TZONE"`
UserAction string `json:"user_action"`
}

Chapter 2. Monitoring system health by using the mmhealth command 45


type postMsg struct {
Version string `json:"version"`
ReportingController string `json:"reportingController"`
ReportingInstance string `json:"reportingInstance"`
Events []healthEvent `json:"events"`
}

func webhook(w http.ResponseWriter, r *http.Request) {


decoder := json.NewDecoder(r.Body)
var data postMsg
if err := decoder.Decode(&data); err != nil {
fmt.Println(err)
return
}
counts := make(map[string]int)
for _, event := range data.Events {
_, matched := counts[event.Severity]
switch matched {
case true:
counts[event.Severity] += 1
case false:
counts[event.Severity] = 1
}
}
fmt.Println(data.ReportingInstance, ":", counts)
}

func main() {
http.HandleFunc("/webhook", webhook)

if len(os.Args) != 2 {
log.Fatal(fmt.Errorf("usage: %s IPAddr:Port", os.Args[0]))
}
fmt.Printf("Starting server listening on %s ...\n", os.Args[1])
if err := http.ListenAndServe(os.Args[1], nil); err != nil {
log.Fatal(err)
}
}

The following example shows starting the GO program that is compiled to a binary named webhook. You
must provide the IP address and port number the HTTP server can use.

[httpd-dev]# ./webhook 192.0.2.48:9000


Starting server listening on 192.0.2.48:9000 …

Example 3: Setting up a webhook server by using Python

#!/usr/bin/env python3
import argparse
from collections import Counter
import cherrypy
import json

class DataView(object):
exposed = True
@cherrypy.tools.accept(media='application/json')
def POST(self):
rawData = cherrypy.request.body.read(int(cherrypy.request.headers['Content-
Length']))
b = json.loads(rawData)
eventCounts = Counter([e['severity'] for e in b['events']])
print(f"{b['reportingInstance']}: {json.dumps(eventCounts)}")
if dump_json: print(json.dumps(b, indent=4))
return "OK"

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--json', action='store_true')
parser.add_argument('IP', type=str)
parser.add_argument('PORT', type=int)
args = parser.parse_args()
conf = {
'/': {'request.dispatch': cherrypy.dispatch.MethodDispatcher(),}
}
dump_json = args.json
cherrypy.server.socket_host = args.IP

46 IBM Storage Scale 5.1.9: Problem Determination Guide


cherrypy.server.socket_port = args.PORT
cherrypy.quickstart(DataView(), '/webhook/', conf)

Configuring a webhook
In an IBM Storage Scale cluster, you can configure the mmhealth framework to interact with the webhook
server by using the mmhealth config webhook add command. For more information, see mmhealth
command section in IBM Storage Scale: Command and Programming Reference Guide.
For example, run the mmhealth command by using the IP address 192.0.2.48 and port 9000 that was
used when starting up the webhook server.

[root@frodo-master ~]# mmhealth config webhook add http://192.0.2.48:9000/webhook


Successfully connected to http://192.0.2.48:9000/webhook

To list all webhooks that are currently configured, use the command:

[root@frodo-master ~]# mmhealth config webhook list http://192.0.2.48:9000/webhook

You can add the -Y option for extended information and to make the output that is more easily processed
by other programs and scripts:

[root@frodo-master ~]# mmhealth config webhook list -Y


mmhealth:webhook:HEADER:version:reserved:reserved:url:uuid:status:
mmhealth:webhook:0:1:::http%3A//192.0.2.48%3A9000/
webhook:af0f2f36-771e-4e3d-930d-40bc30ab41f9:enabled:

The –Y output shows the UUID value that is associated with this webhook. The UUID value is set in the
HTTP POST header when the mmhealth command posts to the webhook.
When the webhook is configured in the mmhealth framework, the webhook server starts to receive health
events.
Note: Events are sent only when a health event is triggered in IBM Storage Scale.

[httpd-dev]# ./webhook 192.0.2.48:9000


Starting server listening on 192.0.2.48:9000 ...
frodo-master.fyre.ibm.com : map[INFO:29]
frodo-master.fyre.ibm.com : map[INFO:8]
frodo-master.fyre.ibm.com : map[INFO:3]
frodo-master.fyre.ibm.com : map[INFO:9]
frodo-master.fyre.ibm.com : map[INFO:1]
frodo-master.fyre.ibm.com : map[INFO:4]
frodo-master.fyre.ibm.com : map[INFO:2]
frodo-master.fyre.ibm.com : map[INFO:2]

Important: If the mmhealth webhook framework faces issues with the configured webhook URL, then
it is disabled over time with a default setting of 24 hours. The -Y output shows the enabled or disabled
status of each webhook URL. If the webhook URL gets disabled, then rerun the mmhealth config
webhook command to readd the URL.

Webhook JSON data


The sample code in the following example is used to create a webhook server that truncates the
information, which is posted by the mmhealth command. The structure of the posted JSON data helps
you to parse the data to get the information, which is needed to take further actions.
The posted data is in the JSON format and contains information on the reporting node name that sends
health events with event-specific details.
The JSON data for the POST event is defined as:
version: string reportingController: string reportingInstance: string events:
array
Each entry in the events array is defined as follows:

Chapter 2. Monitoring system health by using the mmhealth command 47


cause: string
code: string
component: string
container_restart: boolean
container_unready: boolean
description: string
entity_name: string
entity_type: string
event: string
event_type: string
ftdc_scope: string
identifier: string
internalComponent: string
is_resolvable: boolean
message: string
node: string
priority: integer
remedy: string
requireUnique: boolean
scope: string
severity: string
state: string
time: DateTime
TZONE: string
user_action: string
full_identifier: string
From IBM Storage Scale 5.1.8, the webhook format version 2 is used with a new field,
full_identifier, to correlate events. For example, two different events: disk_failed and disk_ok
occur for a disk in a file system. These events can be correlated by using the full_identifier to know the
latest disk state such as recovered.
• For node local events, such as gpfs_down, gpfs_up, the full_identifier field has the following
properties:
– clusterid
– nodeid
– component
– internal_component
– identifier
• For cluster-wide events, such as pool_data_high, gpfs_up, the full_identifier field has the
following properties:
– clusterid
– component

48 IBM Storage Scale 5.1.9: Problem Determination Guide


– internal_component
– identifier
The following example shows what the JSON data might contain when the mmhealth command posts to
the configured webhook.

{
"version": "2",
"reportingController": "spectrum-scale",
"reportingInstance": "gpfs-14.localnet.com",
"events": [
{
"cause": "A file system was unmounted.",
"code": "999305",
"component": "filesystem",
"container_restart": false,
"container_unready": false,
"description": "A file system was unmounted.",
"entity_name": "t123fs",
"entity_type": "FILESYSTEM",
"event": "fs_unmount_info",
"event_type": "INFO_EXTERNAL",
"ftdc_scope": "",
"identifier": "t123fs",
"internalComponent": "",
"is_resolvable": false,
"message": "The file system t123fs was unmounted normal.",
"node": "6",
"priority": 99,
"remedy": null,
"requireUnique": true,
"scope": "NODE",
"severity": "INFO",
"state": "UNKNOWN",
"time": "2023-05-03T16:44:56+02:00",
"TZONE": "CEST",
"user_action": "N/A",
"full_identifier": "317908494475311923/6/filesystem//t123fs"
}
]
}

Additional checks on file system availability for CES exported data


A CES cluster exports file systems to its clients by using NFS or SMB. These exports might be fully or
partially located on the CES cluster directly, or might be remote-mounted from other storage systems.
If such a mount is not available at the time when the NFS or SMB services starts up or at run time, the
system throws an error. There are events that set the NFS or SMB state to a DEGRADED or FAILED state in
case all the necessary file system are not available.
The NFS and SMB monitoring checks that the file systems required by the declared exports are all
available. If one or more of these file systems is unavailable, then they are marked as FAILED in the
mmhealth node show filesystem -v command output. The corresponding components of the NFS
or SMB are set into a DEGRADED state. For NFS, the nfs_exports_down event is created initially. For
SMB, the smb_exports_down event is created initially.
Alternatively, the CES nodes can be set automatically to a FAILED state instead of a DEGRADED state if
the required remote or local file systems are not mounted. The change in state can be done only by the
Cluster State Manager (CSM). If the CSM detects that some of the CES nodes are in a DEGRADED state,
then it can overrule the DEGRADED state with a FAILED state to trigger a failover of the CES-IPs to healthy
node.
Note: This overrule is limited to the file-system-related events nfs_exports_down and
smb_exports_down. Other events that cause a DEGRADED state are not handled by this procedure.
For NFS, the nfs_exports_down warning event is countered by a nfs_exported_fs_down error
event from the CSM to mark it as FAILED. Similarly, for SMB, the smb_exports_down warning event is
countered by a smb_exported_fs_down error event to mark it as FAILED.

Chapter 2. Monitoring system health by using the mmhealth command 49


After the CSM detects that all the CES nodes report a nfs_exports_down or smb_exports_down
status, it clears the nfs_exported_fs_down or smb_exported_fs_down events to allow each node to
rediscover its own state again. This prevents a cluster outage if only one protocol is affected, but others
are active. However, such a state might not be stable for a while and must be fixed as soon as possible. If
the file systems are mounted back again, then the SMB or NFS service monitors detect this and are able to
refresh their health state information.
This CSM feature can be configured as follows:
1. Make a backup copy of the current /var/mmfs/mmsysmon/mmsysmonitor.conf file.
2. Open the file with a text editor, and search for the [clusterstate]section to set the value of
csmsetmissingexportsfailed to true or false:

[clusterstate]
...

# true = allow CSM to override NFS/SMB missing export events on the CES nodes (set to FAILED)
# false = CSM does not override NFS/SMB missing export events on the CES nodes
csmsetmissingexportsfailed = true

3. Close the editor and restart the system health monitor using the following command:

mmsysmoncontrol restart

4. Run this procedure on all the nodes or copy the modified files to all nodes and restart the system
health monitor on all nodes.
Important: During the restart of a node, some internal checks are done by the system health monitor for
a file system's availability if NFS or SMB is enabled. These checks detect if all the required file systems
for the declared exports are available. There might be cases where file systems are not available or are
unmounted at the time of the check. This might be a timing issue, or because some file systems are not
automatically mounted. In such cases, the NFS service is not started and remains in a STOPPED state
even if all relevant file systems are available at a later point in time.
This feature can be configured as follows:
1. Make a backup copy of the current mmsysmonitor.conf file.
2. Open the file with a text editor, and search for the nfs section to set the value of
preventnfsstartuponmissingfs to true or false:

# NFS settings
#
[nfs]
...
# prevent NFS startup after reboot/mmstartup if not all required filesystems for exports are
available
# true = prevent startup / false = allow startup
preventnfsstartuponmissingfs = true

3. Close the editor and restart the system health monitor using the following command:

mmsysmoncontrol restart

4. Run this procedure on all the nodes or copy the modified files to all nodes and restart the system
health monitor on all nodes.

Proactive system health alerts


The following section describes how the HEALTHCHECK component shows system health events, resulting
from proactive system health alerts in the IBM Storage Scale system.
The new system health component, HEALTHCHECK, has been introduced to improve the experience of
users with Call Home enabled, by providing a more direct support from IBM to those users without any
detours.

50 IBM Storage Scale 5.1.9: Problem Determination Guide


The HEALTHCHECK component monitors the call home scheduled uploads, which are sent to the
IBM server and analyzed by the IBM health check service. In case there are any security issues
or misconfiguration, alerts are generated by the health check service. The HEALTHCHECK component
monitors the proactive system health alerts.
Note: For a detailed description about the received health events, check the output of the mmhealth
event show command. All these information updates are retrieved from the IBM support server once a
day.
The given procedure is followed to generate system alerts:
1. Perform a scheduled call home upload to the IBM health check service.
2. Analyze the new data.
3. Raise system health alerts.
4. Download data from IBM using the monitoring cycle of the HEALTHCHECK component.
This procedure could take more than 24 hours to complete. Therefore, if the root cause of the alert is
fixed, then the event related to that alert is not immediately reflected in the mmhealth command output.
In order to get rid of the event immediately, use the following command:

mmhealth event resolve <event_name>

Note: Ensure that in this case, the issue described by this health event is resolved. Otherwise, it would
reappear in the mmhealth command output.
If a user intentionally does not want to solve the reported issue and no longer wants to be warned, then
use the following command to disable this particular check:

mmhealth event hide <event_name>

To unhide such a hidden event, issue the following command:

mmhealth event unhide <event_name>

Only the HEALTHCHECK events, which are received from the IBM health check service, can be hidden.
These event names start with hc_.

Chapter 2. Monitoring system health by using the mmhealth command 51


52 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 3. Dynamic page pool monitoring
The size and status of the dynamic page pool can be checked by issuing the mmdiag --pagepool
command.
The command output provides the following details:
• The minimum-allowed size of the page pool
• The smallest size of the page pool while the daemon was running: “low watermark”
• The current size of the page pool
• The largest size of the page pool while the daemon was running: “high watermark”
• The maximum allowed size of the page pool
• The detected memory on the node
The IBM Storage Scale performance monitoring also includes the following metric for the current size of
the page pool. For more information, see “GPFS metrics” on page 120.
• pfs_bufm_tot_poolSizeK: Total size of the page pool.
Note: Due to calculations and rounding operations, the pagepool size can be initially slightly less than the
minimum size.

© Copyright IBM Corp. 2015, 2024 53


54 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 4. Performance monitoring
With IBM Storage Scale, system administrators can monitor the performance of GPFS and the
communications protocols that it uses.

Network performance monitoring


Network performance can be monitored either by using Remote Procedure Call (RPC) statistics or it can
be monitored by using the IBM Storage Scale graphical user interface (GUI).

Monitoring networks by using RPC statistics


You can monitor the network performance of IBM Storage Scale nodes and of the communication
protocols, which is used for exchanging information between them. One of such communication protocols
is RPC that is used to send request or response messages between IBM Storage Scale nodes over an
Ethernet or an InfiniBand interface.
Each IBM Storage Scale node has a set of seven RPC statistics that are cached per node, and one RPC
statistic that is cached per size of the RPC message to monitor the network performance. The counters
are measured in seconds and milliseconds.
Following statistics are cached per IBM Storage Scale node:
Channel Wait Time
The amount of time the RPC must wait to access a communication channel to the destination IBM
Storage Scale node.
Send Time TCP
The amount of time, which is needed to transfer an RPC message over an Ethernet interface.
Send Time Verbs
The amount of time, which is needed to transfer an RPC message to an InfiniBand interface.
Receive Time TCP
The amount of time to transfer an RPC message from an Ethernet interface to the GPFS daemon.
Latency TCP
The latency of an RPC message when sent and received over an Ethernet interface.
Latency Verbs
The latency of an RPC message when sent and received over an InfiniBand interface.
Latency Mixed
The latency of an RPC message when sent over one type of interface (Ethernet or InfiniBand) and
received over the other (InfiniBand or Ethernet).
Note: If an InfiniBand interface is not configured, then no statistics are cached for Send Time Verbs,
Latency Verbs, and Latency Mixed.
The GPFS daemon considers the RPC latency as a relative measure of GPFS network performance. The
RPC latency is defined as the difference between the round-trip time and the execution time. Here,
the round-trip time is measured as the time from the start of writing an RPC request message over an
interface till an RPC response message is received. Whereas, execution time is measured as the time
an RPC request message is received on a GPFS destination node till an RPC response message is sent.
Therefore, the RPC latency can be defined as the amount of time the RPC is being transmitted and
received over a network.
There is an RPC statistic that is associated with each of a set of size ranges, each with an upper bound
that is a power of 2. The first range is 0-64, then 65-128, then 129-256, and then continuing until the last
range has an upper bound of twice the maxBlockSize. For example, if the maxBlockSize is 1 MB, the
upper bound of the last range is 2,097,152 (2 MB). For each of these ranges, the associated statistic is the
latency of the RPC whose size falls within that range. The size of an RPC is the amount of data that is sent

© Copyright IBM Corp. 2015, 2024 55


plus the amount of data received. However, if one amount is more than 16 times greater than the other,
only the larger amount is used as the size of the RPC.
The final statistic associated with each type of RPC message, on the node where the RPC is received, is
the execution time of the RPC.
The RPC statistics, which are used for network performance monitoring are described as an aggregation
of values. By default, an aggregation consists of 60 one-second intervals, 60 one-minute intervals, 24
one-hour intervals, and 30 one-day intervals.
Each time interval consists of the following values:
• Sum of values that are accumulated during the interval.
• Count of values that are added to the aggregation total.
• Minimum value that is added to the aggregation total.
• Maximum value that is added to the aggregation total.
After 60 seconds from the time GPFS daemon starts, the oldest 1-second interval is discarded, and a new
1-second interval with latest RPC data is added.
After receiving each RPC response message, the following information is saved in a raw statistics buffer:
• Channel wait time
• Send time
• Receive time
• Latency
• Length of data sent
• Length of data received
• Flags indicating whether the RPC was sent or received over InfiniBand
• Target node identifier
As each RPC completes execution, the execution time for the RPC and the message type of the RPC is
saved in a raw execution buffer. The raw buffers are processed per second, and then, the values are
added to the appropriate aggregated statistic. For each value, the value is added to the statistic's sum, the
count is incremented, and the value is compared to the minimum and maximum, which are adjusted as
needed. Upon completion of this processing, for each statistic the sum, count, minimum, and maximum
values are entered into the next 1-second interval.
Every 60 seconds, the sums, and counts in the 60 1-second intervals are added into a 1-minute sum
and count. The smallest value of the 60 minimum values is determined, and the largest value of the 60
maximum values is determined. This 1-minute sum, count, minimum, and maximum are then entered into
the next 1-minute interval.
An analogous pattern holds for the minute, hour, and day periods. For any one particular interval, the sum
is the sum of all raw values that are processed during that interval, the count is the count of all values
during that interval, the minimum is the minimum of all values during that interval, and the maximum is
the maximum of all values during that interval.
When statistics are displayed for any particular interval, an average is calculated from the sum and count,
then the average, minimum, maximum, and count are displayed. The average, minimum, and maximum
are displayed in units of milliseconds, to three decimal places (1-microsecond granularity).
The RPC buffers and intervals can be controlled by using the following mmchconfig command attributes:
• rpcPerfRawStatBufferSize
• rpcPerfRawExecBufferSize
• rpcPerfNumberSecondIntervals
• rpcPerfNumberMinuteIntervals
• rpcPerfNumberHourIntervals

56 IBM Storage Scale 5.1.9: Problem Determination Guide


• rpcPerfNumberDayIntervals
The mmdiag command with the --rpc parameter can be used to query RPC statistics.
For more information, see the topics mmchconfig command, mmnetverify command, and mmdiag
command in the IBM Storage Scale: Administration Guide.
Related concepts
Monitoring I/O performance with the mmpmon command
Use the mmpmon command to monitor the I/O performance of IBM Storage Scale on the node on which it
is run and on other specified nodes.
Using the performance monitoring tool
The performance monitoring tool collects metrics from GPFS and protocols and provides system
performance information. By default, the performance monitoring tool is enabled, and it consists of
Collectors, Sensors, and Proxies.
Viewing and analyzing the performance data
The performance monitoring tool displays the performance metrics that are associated with GPFS and
the associated protocols. It helps you get a graphical representation of the status and trends of the key
performance indicators, and analyze IBM Storage Scale performance problems.

Monitoring networks by using GUI


The Network page provides an easy way to monitor the performance, health status, and configuration
aspects of all available networks and interfaces that are part of the networks.
A dedicated network is used within the cluster for certain operations. For example, the system uses
the administration network when an administration command is issued. It is also used for sharing
administration-related information. This network is used for node-to-node communication within the
cluster. The daemon network is used for sharing file system or other resources data. Remote clusters also
establish communication path through the daemon network. Similarly, the dedicated network types like
CES network and external network can also be configured in the cluster.
The performance of network is monitored by monitoring the data transfer that is managed through the
respective interfaces. The following types of network interfaces can be monitored through the GUI:
• IP interfaces on Ethernet and InfiniBand adapters.
• Remote Direct Memory Access (RDMA) interfaces on InfiniBand adapters with Open Fabrics Enterprise
Distribution (OFED) drivers.
The GUI retrieves performance data from the performance monitoring tool. The IP-adapter-based metrics
are taken from the Network sensor and the RDMA metrics are taken from the InfiniBand sensor. If no
performance data appears in the GUI, verify that the monitoring tool is correctly set up and that these two
sensors are enabled.
The Network page also exposes adapters and IPs that are not bound to a service, to provide a full view of
the network activity on a node.
The details of the networks and their components can be obtained both in graphical as well as tabular
formats. The Network page provides the following options to analyze the performance and status of
networks and adapters:
1. A quick view that gives graphical representation of overall IP throughput, overall RDMA throughput,
IP interfaces by bytes sent and received, and RDMA interfaces by bytes sent and received. You can
access this view by selecting the expand button that is placed next to the title of the page. You can
close this view if not required.
Graphs in the overview are refreshed regularly. The refresh intervals of the top three entities depend
on the displayed timeframe as shown:
• Every minute for the 5-minutes timeframe
• Every 15 minutes for the 1-hour timeframe
• Every 6 hours for the 24 hours timeframe

Chapter 4. Performance monitoring 57


• Every two days for the 7 days timeframe
• Every seven days for the 30 days timeframe
• Every four months for the 365 days timeframe
If you click a block in the IP interfaces charts, the corresponding details are displayed in the IP
interfaces table. The table is filtered by the IP interfaces that are part of the selected block. You can
remove the filter by clicking the link that appears on the top of the table header row.
2. A table that provides different performance metrics that are available under the following tabs of the
table.
IP Interfaces
Shows all network interfaces that are part of the Ethernet and InfiniBand networks in the cluster.
To view performance details in graphical format or to see that the events reported against the
individual adapter, select the adapter in the table and then select View Details from the Actions
menu.
RDMA Interfaces
Shows the details of the InfiniBand RDMA networks that are configured in the cluster. To view
performance details in graphical format or to see that the events reported against the individual
adapter, select the adapter in the table and then select View Details from the Actions menu.
The system displays the RDMA Interfaces tab only if there are RDMA interfaces available.
Networks
Shows all networks in the cluster and provides information on network types, health status, and
number of nodes and adapters that are part of the network.
IP Addresses
Lists all IP addresses that are configured in the cluster.
To find networks or adapters with extreme values, you can sort the values that are displayed in the
tables by different performance metrics. Click the performance metric in the table header to sort
the data based on that metric. You can select the time range that determines the averaging of the
values that are displayed in the table and the time range of the charts in the overview from the time
range selector, which is placed in the upper right corner. The metrics in the table do not update
automatically. The refresh button allows to refresh the table content with more recent data.
3. A detailed view of performance aspects and events reported against each adapter. To access this view,
select the adapter in the table and then select View Details from the Actions menu. The detailed view
is available for both IP and RDMA interfaces.
The Network page depends on the InfiniBand sensor to display RDMA metrics. The InfiniBand sensor
does not run if the IBM Storage Scale system is not installed by using the IBM Storage Scale install toolkit.
In such cases, the RDMA graphs in the Network page shows "No Object". You can enable the InfiniBand
sensors from the GUI by performing the following steps:
1. Go to Services > GUI > Performance Monitoring.
2. Select the InfiniBand sensor that is listed under the Sensors tab.
3. Click Edit. The Edit Sensor InfiniBand dialog appears.
4. Select the toggle button that is placed next to the Interval field.
5. Select the frequency in which the sensor data is collected at the collector.
6. Select the scope of the sensor data collection in the Node scope field.
7. Click Save.
You can use the same steps to modify or start any performance sensors.

58 IBM Storage Scale 5.1.9: Problem Determination Guide


Monitoring I/O performance with the mmpmon command
Use the mmpmon command to monitor the I/O performance of IBM Storage Scale on the node on which it
is run and on other specified nodes.
Before you attempt to use the mmpmon command, it is a good idea to review the topic mmpmon command
in the IBM Storage Scale: Command and Programming Reference Guide.
Important: The performance monitoring information that is driven by the IBM Storage Scale's internal
monitoring tool and those driven by the users by using the mmpmon command might affect each other.
Next, read all of the following relevant topics of the mmpmon command:
• “Overview of mmpmon” on page 59
• “Specifying input to the mmpmon command” on page 60
• “Example mmpmon scenarios and how to analyze and interpret their results” on page 94
• “Other information about mmpmon output” on page 103
Related concepts
Network performance monitoring
Network performance can be monitored either by using Remote Procedure Call (RPC) statistics or it can
be monitored by using the IBM Storage Scale graphical user interface (GUI).
Using the performance monitoring tool
The performance monitoring tool collects metrics from GPFS and protocols and provides system
performance information. By default, the performance monitoring tool is enabled, and it consists of
Collectors, Sensors, and Proxies.
Viewing and analyzing the performance data
The performance monitoring tool displays the performance metrics that are associated with GPFS and
the associated protocols. It helps you get a graphical representation of the status and trends of the key
performance indicators, and analyze IBM Storage Scale performance problems.

Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
The collected data can be used for the following purposes:
• Track I/O demand over longer periods of time - weeks or months.
• Record I/O patterns over time (when peak usage occurs, and so forth).
• Determine whether some nodes service more application demand than others.
• Monitor the I/O patterns of a single application, which is spread across multiple nodes.
• Record application I/O request service times.
Figure 1 on page 59 shows the software layers in a typical system with GPFS. The mmpmon command is
built into GPFS.

Figure 1. Node running the mmpmon command

Chapter 4. Performance monitoring 59


Related concepts
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Specifying input to the mmpmon command


The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
The mmpmon command must be run by using root authority. For command syntax, see the mmpmon
command in the IBM Storage Scale: Administration Guide.
The mmpmon command is controlled by an input file that contains a series of requests, one per line. This
input can be specified with the -i flag, or read from standard input (stdin). Providing input by using
stdin allows the mmpmon command to take keyboard input or output that is piped from a user script or
application.
Leading blanks in the input file are ignored. A line beginning with a pound sign (#) is treated as a
comment. Leading blanks in a line whose first nonblank character is a pound sign (#) are ignored.
Table 9 on page 60 describes the mmpmon command input requests.

Table 9. Input requests to the mmpmon command


Request Description
fs_io_s “Display I/O statistics per mounted file system” on page 62
io_s “Display I/O statistics for the entire node” on page 65

60 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 9. Input requests to the mmpmon command (continued)
Request Description
nlist add name[ name...] “Add node names to a list of nodes for mmpmon processing” on page 68
nlist del “Delete a node list” on page 69
nlist new name[ name...] “Create a new node list” on page 70
nlist s “Show the contents of the current node list” on page 70
nlist sub name[ name...] “Delete node names from a list of nodes for mmpmon processing” on page
71
once request Indicates that the request is to be performed only once.
reset “Reset statistics to zero” on page 75
rhist nr “Changing the request histogram facility request size and latency ranges” on
page 79
rhist off “Disabling the request histogram facility” on page 81. This is the default.
rhist on “Enabling the request histogram facility” on page 82
rhist p “Displaying the request histogram facility pattern” on page 82
rhist reset “Resetting the request histogram facility data to zero” on page 85
rhist s “Displaying the request histogram facility statistics values” on page 86
rpc_s “Displaying the aggregation of execution time for Remote Procedure Calls
(RPCs)” on page 89
rpc_s size “Displaying the Remote Procedure Call (RPC) execution time according to the
size of messages” on page 91
source filename “Using request source and prefix directive once” on page 98
ver “Displaying mmpmon version” on page 93
vio_s "Displaying vdisk I/O statistics". For more information, see IBM Storage Scale
RAID: Administration.
vio_s_reset "Resetting vdisk I/O statistics". For more information, see IBM Storage Scale
RAID: Administration.

Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Display I/O statistics per mounted file system

Chapter 4. Performance monitoring 61


The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Running mmpmon on multiple nodes


Invoke mmpmon list requests on a single node for mmpmon request processing on multiple nodes in a
local cluster.
The mmpmon command may be invoked on one node to submit requests to multiple nodes in a local GPFS
cluster by using the nlist requests. See “Understanding the node list facility” on page 67.

Running mmpmon concurrently from multiple users on the same node


Multiple instances of the mmpmon command can run on the same node so that different performance
analysis applications and scripts can use the same performance data.
Five instances of the mmpmon command might be run on a given node concurrently. This action is intended
primarily to allow different user-written performance analysis applications or scripts to work with the
performance data. For example, one analysis application might deal with fs_io_s and io_s data, while
the other one deals with rhist data, and the next one gathers data from other nodes in the cluster. The
applications might be separately written or separately maintained, or have different sleep and wake-up
schedules.
Be aware that there is only one set of counters for fs_io_s and io_s data, and a separate set for rhist
data. Multiple analysis applications dealing with the same set of data must coordinate any activities that
could reset the counters, or in the case of rhist requests, disable the feature or modify the ranges.

Display I/O statistics per mounted file system


The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
The fs_io_s (file system I/O statistics) request returns strings containing I/O statistics taken over all
mounted file systems as seen by that node, and are presented as total values for each file system. The
values are cumulative since the file systems were mounted or since the last reset request, whichever is
most recent. When a file system is unmounted, its statistics are lost.
Read and write statistics are recorded separately. The statistics for a given file system are for the file
system activity on the node running the mmpmon command, not the file system in total (across the cluster).
Table 10 on page 63 describes the keywords for the fs_io_s response, in the order that they appear in
the output. These keywords are used only when the mmpmon command is invoked with the -p flag.

62 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 10. Keywords and values for the mmpmon fs_io_s response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_cl_ Name of the cluster that owns the file system.
_fs_ The name of the file system for which data are being presented.
_d_ The number of disks in the file system.
_br_ Total number of bytes read from both disk and cache.
_bw_ Total number of bytes written to both disk and cache.
_oc_ Count of open() call requests serviced by GPFS. This also includes creat() call counts.
_cc_ Number of close() call requests serviced by GPFS.
_rdc_ Number of application read requests serviced by GPFS.
_wc_ Number of application write requests serviced by GPFS.
_dir_ Number of readdir() call requests serviced by GPFS.
_iu_ Number of inode updates to disk.

Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version

Chapter 4. Performance monitoring 63


The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Example of mmpmon fs_io_s request


This is an example of the fs_io_s input request to the mmpmon command and the resulting output that
displays the I/O statistics per mounted file system.
Assume that commandFile contains this line:

fs_io_s

and this command is issued:

mmpmon -p -i commandFile

The output is two lines in total, and similar to this:

_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407431 _cl_ myCluster.xxx.com
_fs_ gpfs2 _d_ 2 _br_ 6291456 _bw_ 314572800 _oc_ 10 _cc_ 16 _rdc_ 101 _wc_ 300 _dir_ 7 _iu_ 2
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407455 _cl_ myCluster.xxx.com
_fs_ gpfs1 _d_ 3 _br_ 5431636 _bw_ 173342800 _oc_ 6 _cc_ 8 _rdc_ 54 _wc_ 156 _dir_ 3 _iu_ 6

The output consists of one string per mounted file system. In this example, there are two mounted file
systems, gpfs1 and gpfs2.
If the -p flag is not specified, then the output is similar to:

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: myCluster.xxx.com
filesystem: gpfs2
disks: 2
timestamp: 1066660148/407431
bytes read: 6291456
bytes written: 314572800
opens: 10
closes: 16
reads: 101
writes: 300
readdir: 7
inode updates: 2

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: myCluster.xxx.com
filesystem: gpfs1
disks: 3
timestamp: 1066660148/407455
bytes read: 5431636
bytes written: 173342800
opens: 6
closes: 8
reads: 54
writes: 156
readdir: 3
inode updates: 6

When no file systems are mounted, the responses are similar to:

_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 1 _t_ 1066660148 _tu_ 407431 _cl_ - _fs_ -

64 IBM Storage Scale 5.1.9: Problem Determination Guide


The _rc_ field is non-zero, and both the _fs_ and _cl_ fields contain a minus sign. If the -p flag is not
specified, then the results are similar to:

mmpmon node 199.18.1.8 name node1 fs_io_s status 1


no file systems mounted

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Display I/O statistics for the entire node


The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
The io_s (I/O statistics) request returns strings containing I/O statistics taken over all mounted file
systems as seen by that node, and are presented as total values for the entire node. The values are
cumulative since the file systems were mounted or since the last reset, whichever is most recent. When
a file system is unmounted, its statistics are lost and its contribution to the total node statistics vanishes.
Read and write statistics are recorded separately.
Table 11 on page 65 describes the keywords for the io_s response, in the order that they appear in the
output. These keywords are used only when the mmpmon command is invoked with the -p flag.

Table 11. Keywords and values for the mmpmon io_s response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_br_ Total number of bytes that are read from both disk and cache.
_bw_ Total number of bytes that are written to both disk and cache.
_oc_ Count of open() call requests that are serviced by GPFS. The open count also
includes creat() call counts.
_cc_ Number of close() call requests that are serviced by GPFS.
_rdc_ Number of application read requests that are serviced by GPFS.
_wc_ Number of application write requests that are serviced by GPFS.
_dir_ Number of readdir() call requests that are serviced by GPFS.
_iu_ Number of inode updates to disk, which includes inodes flushed to disk because of
access time updates.

Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility

Chapter 4. Performance monitoring 65


Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Example of mmpmon io_s request


This is an example of the io_s input request to the mmpmon command and the resulting output that
displays the I/O statistics for the entire node.
Assume that commandFile contains this line:

io_s

and this command is issued:

mmpmon -p -i commandFile

The output is one line in total, and similar to this:

_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407431 _br_ 6291456
_bw_ 314572800 _oc_ 10 _cc_ 16 _rdc_ 101 _wc_ 300 _dir_ 7 _iu_ 2

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 io_s OK


timestamp: 1066660148/407431
bytes read: 6291456
bytes written: 314572800
opens: 10
closes: 16
reads: 101
writes: 300
readdir: 7
inode updates: 2

66 IBM Storage Scale 5.1.9: Problem Determination Guide


Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.

Table 12. nlist requests for the mmpmon command


Request Description
nlist add “Add node names to a list of nodes for mmpmon processing” on page 68
name[ name...]
nlist del “Delete a node list” on page 69
nlist new “Create a new node list” on page 70
name[ name...]
nlist s “Show the contents of the current node list” on page 70
nlist sub “Delete node names from a list of nodes for mmpmon processing” on page 71
name[ name...]

When specifying node names, keep these points in mind:


1. A node name of '.' (dot) indicates the current node.
2. A node name of '*' (asterisk) indicates all currently connected local cluster nodes.
3. The nodes named in the node list must belong to the local cluster. Nodes in remote clusters are not
supported.
4. A node list can contain nodes that are currently down. When an inactive node comes up, mmpmon
attempts to gather data from it.
5. If a node list contains an incorrect or unrecognized node name, then all other entries in the list are
processed. Suitable messages are issued for an incorrect node name.
6. When the mmpmon command gathers responses from the nodes in a node list, the full response from
one node is presented before the next node. Data is not interleaved. There is no guarantee of the order
of node responses.
7. The node that issues the mmpmon command need not appear in the node list. The case of this node
serving only as a collection point for data from other nodes is a valid configuration.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node

Chapter 4. Performance monitoring 67


The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Add node names to a list of nodes for mmpmon processing


The nlist add (node list add) request is used to add node names to a list of nodes for the mmpmon
command to collect their data. The node names are separated by blanks.
Table 13 on page 68 describes the keywords for the nlist add response, in the order that they appear
in the output. These keywords are used only when the mmpmon command is started with the -p flag.

Table 13. Keywords and values for the mmpmon nlist add response
Keyword Description
_n_ IP address of the node that is processing the node list. This is the address by which
GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is add.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_c_ The number of nodes in the user supplied list.
_ni_ Node name input. A user-supplied node name from the offered list of names.
_nx_ Node name conversion. The preferred GPFS name for the node.
_nxip_ Node name converted IP address. The preferred GPFS IP address for the node.
_did_ The number of nodes names considered valid and processed by the requests.
_nlc_ The number of nodes in the node list now (after all processing).

If the nlist add request is issued when no node list exists, it is handled as if it were an nlist new
request.

Example of mmpmon nlist add request


This topic is an example of the nlist add request to add node names to a list of nodes for mmpmon
processing and the output that displays.
A two- node cluster has nodes node1 (199.18.1.2), a non-quorum node, and node2 (199.18.1.5), a
quorum node. A remote cluster has node node3 (199.18.1.8). The mmpmon command is run on node1.
Assume that commandFile contains this line:

nlist add n2 199.18.1.2

68 IBM Storage Scale 5.1.9: Problem Determination Guide


and this command is issued:

mmpmon -p -i commandFile

Note in this example that an alias name n2 was used for node2, and an IP address was used for node1.
Notice how the values for _ni_ and _nx_ differ in these cases.
The output is similar to this:

_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _c_ 2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _ni_ n2 _nx_
node2 _nxip_ 199.18.1.5
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _ni_
199.18.1.2 _nx_ node1 _nxip_ 199.18.1.2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _did_ 2 _nlc_
2

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.2 name node1 nlist add


initial status 0
name count 2
timestamp 1121955879/468858
node name n2, OK (name used: node2, IP address 199.18.1.5)
node name 199.18.1.2, OK (name used: node1, IP address 199.18.1.2)
final status 0
node names processed 2
current node list count 2

The requests nlist add and nlist sub behave in a similar way and use the same keyword and
response format.
These requests are rejected if issued while quorum has been lost.

Delete a node list


The nlist del (node list delete) request deletes a node list if one exists. If no node list exists, then the
request succeeds, and no error code is produced.
Table 14 on page 69 describes the keywords for the nlist del response, in the order that they appear
in the output. These keywords are used only when the mmpmon command is invoked with the -p flag.

Table 14. Keywords and values for the mmpmon nlist del response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is del.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

Example of mmpmon nlist del request


This topic is an example of the nlist del request to delete a node list and the output that displays.
Assume that commandFile contains this line:

nlist del

and this command is issued:

mmpmon -p -i commandFile

Chapter 4. Performance monitoring 69


The output is similar to this:

_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ del _rc_ 0 _t_ 1121956817 _tu_ 46050

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.2 name node1 nlist del status OK timestamp 1121956908/396381

Create a new node list


The nlist new (node list new) request deletes the current node list if one exists, creates a new, empty
node list, and then attempts to add the specified node names to the node list. The node names are
separated by blanks.
Table 15 on page 70 describes the keywords for the nlist new response, in the order that they appear
in the output. These keywords are used only when the mmpmon command is started with the -p flag.

Table 15. Keywords and values for the mmpmon nlist new response
Keyword Description
_n_ IP address of the node that is responding. This is the address by which GPFS knows
the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is new.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

Show the contents of the current node list


The nlist s (node list show) request displays the current contents of the node list. If no node list exists,
then a count of zero is returned and no error is produced.
Table 16 on page 70 describes the keywords for the nlist s response, in the order that they appear in
the output. These keywords are used only when the mmpmon command is started with the -p flag.

Table 16. Keywords and values for the mmpmon nlist s response
Keyword Description
_n_ IP address of the node that is processing the request. This is the address by which
GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is s.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_c_ Number of nodes in the node list.
_mbr_ GPFS preferred node name for the list member.
_ip_ GPFS preferred IP address for the list member.

70 IBM Storage Scale 5.1.9: Problem Determination Guide


Example of mmpmon nlist s request
This topic is an example of the nlist s request to show the contents of the current node list and the
output that displays.
Assume that commandFile contains this line:

nlist s

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121956950 _tu_ 863292 _c_ 2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121956950 _tu_ 863292 _mbr_ node1
_ip_ 199.18.1.2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121956950 _tu_ 863292 _mbr_
node2 _ip_ 199.18.1.5

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.2 name node1 nlist s


status 0
name count 2
timestamp 1121957505/165931
node name node1, IP address 199.18.1.2
node name node2, IP address 199.18.1.5

If there is no node list, the response looks like:

_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121957395 _tu_ 910440 _c_ 0

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.2 name node1 nlist s


status 0
name count 0
timestamp 1121957436/353352
the node list is empty

The nlist s request is rejected if issued while quorum has been lost. Only one response line is
presented.

_failed_ _n_ 199.18.1.8 _nn_ node2 _rc_ 668 _t_ 1121957395 _tu_ 910440

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node2: failure status 668 timestamp 1121957395/910440
lost quorum

Delete node names from a list of nodes for mmpmon processing


The nlist sub (subtract a node from the node list) request removes a node from a list of node names.
These keywords and responses are similar to the nlist add request. The _req_ keyword (action
requested) for nlist sub is sub.
For more information, see the topic “Add node names to a list of nodes for mmpmon processing” on page
68.

Node list examples and error handling


The nlist facility can be used to obtain GPFS performance data from nodes other than the one on which
the mmpmon command is invoked. This information is useful to see the flow of GPFS I/O from one node to
another, and spot potential problems.

Chapter 4. Performance monitoring 71


A successful fs_io_s request propagated to two nodes
This topic is an example of a successful fs_io_s request to two nodes to display the I/O statistics per
mounted file system and the resulting system output.
This command is issued:

mmpmon -p -i command_file

where command_file has this:

nlist new node1 node2


fs_io_s

The output is similar to this:

_fs_io_s_ _n_ 199.18.1.2 _nn_ node1 _rc_ 0 _t_ 1121974197 _tu_ 278619 _cl_
xxx.localdomain _fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.2 _nn_ node1 _rc_ 0 _t_ 1121974197 _tu_ 278619 _cl_
xxx.localdomain _fs_ gpfs1 _d_ 1 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974167 _tu_ 116443 _cl_
cl1.xxx.com _fs_ fs3 _d_ 3 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 3
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974167 _tu_ 116443 _cl_
cl1.xxx.comm _fs_ fs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974167 _tu_ 116443 _cl_
xxx.localdomain _fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_dir_ 0 _iu_ 0

The responses from a propagated request are the same as they are issued on each node separately.
If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.2 name node1 fs_io_s OK


cluster: xxx.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1121974088/463102
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.2 name node1 fs_io_s OK


cluster: xxx.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1121974088/463102
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.5 name node2 fs_io_s OK


cluster: cl1.xxx.com
filesystem: fs3
disks: 3
timestamp: 1121974058/321741
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 2

72 IBM Storage Scale 5.1.9: Problem Determination Guide


mmpmon node 199.18.1.5 name node2 fs_io_s OK
cluster: cl1.xxx.com
filesystem: fs2
disks: 2
timestamp: 1121974058/321741
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.5 name node2 fs_io_s OK


cluster: xxx.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1121974058/321741
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Failure on a node accessed by mmpmon


This is an example of the system output for a failed request to two nodes to display the I/O statistics per
mounted file system.
In this example, the same scenario described in “A successful fs_io_s request propagated to two nodes”
on page 72 is run on node2, but with a failure on node1 (a non-quorum node) because node1 was shut
down:

_failed_ _n_ 199.18.1.5 _nn_ node2 _fn_ 199.18.1.2 _fnn_ node1 _rc_ 233
_t_ 1121974459 _tu_ 602231
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974459 _tu_ 616867 _cl_
cl1.xxx.com _fs_ fs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974459 _tu_ 616867 _cl_
cl1.xxx.com _fs_ fs3 _d_ 3 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974459 _tu_ 616867 _cl_
node1.localdomain _fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.5 name node2:


from node 199.18.1.2 from name node1: failure status 233 timestamp 1121974459/602231
node failed (or never started)
mmpmon node 199.18.1.5 name node2 fs_io_s OK
cluster: cl1.xxx.com
filesystem: fs2
disks: 2
timestamp: 1121974544/222514
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.5 name node2 fs_io_s OK


cluster: cl1.xxx.com
filesystem: fs3
disks: 3
timestamp: 1121974544/222514
bytes read: 0

Chapter 4. Performance monitoring 73


bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.5 name node2 fs_io_s OK


cluster: xxx.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1121974544/222514
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

Node shutdown and quorum loss


In this example, the quorum node (node2) is shutdown, causing quorum loss on node1. Running the
same example on node2, the output is similar to:

_failed_ _n_ 199.18.1.2 _nn_ node1 _rc_ 668 _t_ 1121974459 _tu_ 616867

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.2 name node1: failure status 668 timestamp 1121974459/616867
lost quorum

In this scenario there can be a window where node2 is down and node1 has not yet lost quorum. When
quorum loss occurs, the mmpmon command does not attempt to communicate with any nodes in the node
list. The goal with failure handling is to accurately maintain the node list across node failures, so that
when nodes come back up they again contribute to the aggregated responses.

Node list failure values


Table 17 on page 74 describes the keywords and values produced by the mmpmon command on a node
list failure:

Table 17. Keywords and values for the mmpmon nlist failures
Keyword Description
_n_ IP address of the node processing the node list. This is the address by which GPFS
knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_fn_ IP address of the node that is no longer responding to mmpmon requests.
_fnn_ The name by which GPFS knows the node that is no longer responding to mmpmon
requests
_rc_ Indicates the status of the operation. See “Return codes from mmpmon” on page 104.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

74 IBM Storage Scale 5.1.9: Problem Determination Guide


Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Table 18 on page 75 describes the keywords for the reset response, in the order that they appear in
the output. These keywords are used only when the mmpmon command is started with the -p flag. The
response is a single string.

Table 18. Keywords and values for the mmpmon reset response
Keyword Description
_n_ IP address of the node that is responding. This is the address by which GPFS knows
the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output

Chapter 4. Performance monitoring 75


When interpreting the results from the mmpmon command output there are several points to consider.

Example of mmpmon reset request


This topic is an example of how to reset file system I/O and I/O statistics to zero.
Assume that commandFile contains this line:

reset

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_reset_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407431

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 reset OK

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Understanding the request histogram facility


Use the mmpmon rhist command requests to control the request histogram facility.
The request histogram facility tallies I/O operations using a set of counters. Counters for reads and writes
are kept separately. They are categorized according to a pattern that might be customized by the user. A
default pattern is also provided. The size range and latency range input parameters to the rhist
nr request are used to define the pattern.
The first time that you run the rhist requests, assess if there is a noticeable performance degradation.
Collecting histogram data might cause performance degradation. This action is possible once the
histogram facility is enabled, but is probably not noticed while the commands themselves are running. It
is more of a long-term issue as the GPFS daemon runs with histograms enabled.
The histogram lock is used to prevent two rhist requests from being processed simultaneously. If an
rhist request fails with an _rc_ of 16, then the lock is in use. Reissue the request.
The histogram data survives file system mounts and unmounts. To reset this data, use the rhist reset
request.
Table 19 on page 76 describes the rhist requests:

Table 19. rhist requests for the mmpmon command


Request Description
rhist nr “Changing the request histogram facility request size and latency ranges” on page 79
rhist off “Disabling the request histogram facility” on page 81. This is the default.
rhist on “Enabling the request histogram facility” on page 82
rhist p “Displaying the request histogram facility pattern” on page 82
rhist reset “Resetting the request histogram facility data to zero” on page 85
rhist s “Displaying the request histogram facility statistics values” on page 86

Related concepts
Overview of mmpmon

76 IBM Storage Scale 5.1.9: Problem Determination Guide


The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Specifying the size ranges for I/O histograms


The I/O histogram size ranges are used to categorize the I/O according to the size, in bytes, of the I/O
operation.
The size ranges are specified using a string of positive integers separated by semicolons (;). No white
space is allowed within the size range operand. Each number represents the upper bound, in bytes, of
the I/O request size for that range. The numbers must be monotonically increasing. Each number may
be optionally followed by the letters K or k to denote multiplication by 1024, or by the letters M or m to
denote multiplication by 1048576 (1024*1024).
For example, the size range operand:

512;1m;4m

represents these four size ranges

0 to 512 bytes
513 to 1048576 bytes
1048577 to 4194304 bytes
4194305 and greater bytes

In this example, a read of size 3 MB would fall in the third size range, a write of size 20 MB would fall in
the fourth size range.

Chapter 4. Performance monitoring 77


A size range operand of = (equal sign) indicates that the current size range is not to be changed. A size
range operand of * (asterisk) indicates that the current size range is to be changed to the default size
range. A maximum of 15 numbers may be specified, which produces 16 total size ranges.
The default request size ranges are:

0 to 255 bytes
256 to 511 bytes
512 to 1023 bytes
1024 to 2047 bytes
2048 to 4095 bytes
4096 to 8191 bytes
8192 to 16383 bytes
16384 to 32767 bytes
32768 to 65535 bytes
65536 to 131071 bytes
131072 to 262143 bytes
262144 to 524287 bytes
524288 to 1048575 bytes
1048576 to 2097151 bytes
2097152 to 4194303 bytes
4194304 and greater bytes

The last size range collects all request sizes greater than or equal to 4 MB. The request size ranges can be
changed by using the rhist nr request.
For more information, see “Processing of rhist nr” on page 79.

Specifying the latency ranges for I/O


The I/O histogram latency ranges are used to categorize the I/O according to the latency time, in
milliseconds, of the I/O operation.
A full set of latency ranges are produced for each size range. The latency ranges are the same for each
size range.
The latency ranges are changed using a string of positive decimal numbers separated by semicolons (;).
No white space is allowed within the latency range operand. Each number represents the upper bound of
the I/O latency time (in milliseconds) for that range. The numbers must be monotonically increasing. If
decimal places are present, they are truncated to tenths.
For example, the latency range operand:

1.3;4.59;10

represents these four latency ranges:

0.0 to 1.3 milliseconds


1.4 to 4.5 milliseconds
4.6 to 10.0 milliseconds
10.1 and greater milliseconds

In this example, a read that completes in 0.85 milliseconds falls into the first latency range. A write that
completes in 4.56 milliseconds falls into the second latency range, due to the truncation.
A latency range operand of = (equal sign) indicates that the current latency range is not to be changed.
A latency range operand of * (asterisk) indicates that the current latency range is to be changed to the
default latency range. If the latency range operand is missing, * (asterisk) is assumed. A maximum of 15
numbers may be specified, which produces 16 total latency ranges.
The latency times are in milliseconds. The default latency ranges are:

0.0 to 1.0 milliseconds


1.1 to 10.0 milliseconds
10.1 to 30.0 milliseconds
30.1 to 100.0 milliseconds
100.1 to 200.0 milliseconds
200.1 to 400.0 milliseconds
400.1 to 800.0 milliseconds

78 IBM Storage Scale 5.1.9: Problem Determination Guide


800.1 to 1000.0 milliseconds
1000.1 and greater milliseconds

The last latency range collects all latencies greater than or equal to 1000.1 milliseconds. The latency
ranges can be changed by using the rhist nr request.
For more information, see “Processing of rhist nr” on page 79.

Changing the request histogram facility request size and latency ranges
The rhist nr (new range) request allows the user to change the size and latency ranges used in the
request histogram facility.
The use of rhist nr implies an rhist reset. Counters for read and write operations are recorded
separately. If there are no mounted file systems at the time rhist nr is issued, the request still runs.
The size range operand appears first, followed by a blank, and then the latency range operand.
Table 20 on page 79 describes the keywords for the rhist nr response, in the order that they appear
in the output. These keywords are used only when mmpmon is invoked with the -p flag.

Table 20. Keywords and values for the mmpmon rhist nr response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is nr.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

Processing of rhist nr
The rhist nr request changes the request histogram facility request size and latency ranges.
Processing of rhist nr is as follows:
1. The size range and latency range operands are parsed and checked for validity. If they are not valid, an
error is returned and processing terminates.
2. The histogram facility is disabled.
3. The new ranges are created, by defining the following histogram counters:
a. Two sets, one for read and one for write.
b. Within each set, one category for each size range.
c. Within each size range category, one counter for each latency range.
For example, if the user specifies 11 numbers for the size range operand and 2 numbers for the
latency range operand, this produces 12 size ranges, each having 3 latency ranges, because there is
one additional range for the top endpoint. The total number of counters is 72: 36 read counters and
36 write counters.
4. The new ranges are made current.
5. The old ranges are discarded. Any accumulated histogram data is lost.
The histogram facility must be explicitly enabled again using rhist on to begin collecting histogram data
using the new ranges.
The mmpmon command does not have the ability to collect data only for read operations, or only for write
operations. The mmpmon command does not have the ability to specify size or latency ranges that have

Chapter 4. Performance monitoring 79


different values for read and write operations. The mmpmon command does not have the ability to specify
latency ranges that are unique to a given size range.
For more information, see “Specifying the size ranges for I/O histograms” on page 77 and “Specifying the
latency ranges for I/O” on page 78.

Example of mmpmon rhist nr request


This topic is an example of using rhist nr to change the request histogram facility request size and
latency changes.
Assume that commandFile contains this line:

rhist nr 512;1m;4m 1.3;4.5;10

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ nr 512;1m;4m 1.3;4.5;10 _rc_ 0 _t_ 1078929833 _tu_
765083

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist nr 512;1m;4m 1.3;4.5;10 OK

In this case, mmpmon has been instructed to keep a total of 32 counters. There are 16 for read and 16 for
write. For the reads, there are four size ranges, each of which has four latency ranges. The same is true for
the writes. They are as follows:

size range 0 to 512 bytes


latency range 0.0 to 1.3 milliseconds
latency range 1.4 to 4.5 milliseconds
latency range 4.6 to 10.0 milliseconds
latency range 10.1 and greater milliseconds
size range 513 to 1048576 bytes
latency range 0.0 to 1.3 milliseconds
latency range 1.4 to 4.5 milliseconds
latency range 4.6 to 10.0 milliseconds
latency range 10.1 and greater milliseconds
size range 1048577 to 4194304 bytes
latency range 0.0 to 1.3 milliseconds
latency range 1.4 to 4.5 milliseconds
latency range 4.6 to 10.0 milliseconds
latency range 10.1 and greater milliseconds
size range 4194305 and greater bytes
latency range 0.0 to 1.3 milliseconds
latency range 1.4 to 4.5 milliseconds
latency range 4.6 to 10.0 milliseconds
latency range 10.1 and greater milliseconds

In this example, a read of size 15 MB that completes in 17.8 milliseconds would fall in the last latency
range listed here. When this read completes, the counter for the last latency range is increased by one.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
An example of an unsuccessful response is:

_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ nr 512;1m;4m 1;4;8;2 _rc_ 22 _t_ 1078929596 _tu_ 161683

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist nr 512;1m;4m 1;4;8;2 status 22 range error

In this case, the last value in the latency range, 2, is out of numerical order.

80 IBM Storage Scale 5.1.9: Problem Determination Guide


Note that the request rhist nr = = does not make any changes. It is ignored.
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Disabling the request histogram facility


The rhist off request disables the request histogram facility. This is the default value.
The data objects remain persistent, and the data they contain is not disturbed. This data is not updated
again until rhist on is issued. rhist off may be combined with rhist on as often as desired. If
there are no mounted file systems at the time rhist off is issued, the facility is still disabled. The
response is a single string.
Table 21 on page 81 describes the keywords for the rhist off response, in the order that they appear
in the output. These keywords are used only when mmpmon is invoked with the -p flag.

Table 21. Keywords and values for the mmpmon rhist off response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is off.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

Example of mmpmon rhist off request


This topic is an example of the rhist off request to disable the histogram facility and the output that
displays.
Assume that commandFile contains this line:

rhist off

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ off _rc_ 0 _t_ 1066938820 _tu_ 5755

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist off OK

An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

mmpmon node 199.18.1.8 name node1 rhist off status 16


lock is busy

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Chapter 4. Performance monitoring 81


Enabling the request histogram facility
The rhist on request enables the request histogram facility.
When rhist on is invoked the first time, this request creates the necessary data objects to support
histogram data gathering. This request may be combined with rhist off (or another rhist on) as
often as desired. If there are no mounted file systems at the time rhist on is issued, the facility is still
enabled. The response is a single string.
Table 22 on page 82 describes the keywords for the rhist on response, in the order that they appear
in the output. These keywords are used only when mmpmon is invoked with the -p flag.

Table 22. Keywords and values for the mmpmon rhist on response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is on.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

Example of mmpmon rhist on request


This topic is an example of the rhist on request to enable the request histogram facility and the output
that displays.
Assume that commandFile contains this line:

rhist on

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ on _rc_ 0 _t_ 1066936484 _tu_ 179346

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist on OK

An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

mmpmon node 199.18.1.8 name node1 rhist on status 16


lock is busy

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Displaying the request histogram facility pattern


The rhist p request displays the request histogram facility pattern.
The rhist p request returns the entire enumeration of the request size and latency ranges. The facility
must be enabled for a pattern to be returned. If there are no mounted file systems at the time this request
is issued, then the request still runs and returns data. The pattern is displayed for both read and write.

82 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 23 on page 83 describes the keywords for the rhist p response, in the order that they appear in
the output. These keywords are used only when the mmpmon command is invoked with the -p flag.

Table 23. Keywords and values for the mmpmon rhist p response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is p.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_k_ The kind, r or w, (read or write) depending on what the statistics are for.
_R_ Request size range, minimum, and maximum number of bytes.
_L_ Latency range, minimum and maximum, in milliseconds.

The request size ranges are in bytes. The zero-value used for the higher limit of the last size range means
'and higher'. The request size ranges can be changed by using the rhist nr request.
The latency times are in milliseconds. The zero-value used for the higher limit of the last latency range
means 'and higher'. The latency ranges can be changed by using the rhist nr request.
The rhist p request allows an application to query for the entire latency pattern. The application
can then configure itself accordingly. Since latency statistics are reported only for ranges with non-zero
counts, the statistics responses might be sparse. By querying for the pattern, an application can be
certain to learn the complete histogram set. The user may have changed the pattern by using the rhist
nr request. For this reason, an application should query for the pattern and analyze it before requesting
statistics.
If the facility has never been enabled, then the _rc_ field is non-zero. An _rc_ value of 16 indicates that
the histogram operations lock is busy. Retry the request.
If the facility has been previously enabled, then the rhist p request still displays the pattern even when
rhist off is currently in effect.
If there are no mounted file systems at the time rhist p is issued, then the pattern is still displayed.

Example of mmpmon rhist p request


This topic is an example of the rhist p request to display the request histogram facility pattern and the
output that displays.
Assume that commandFile contains this line:

rhist p

and this command is issued:

mmpmon -p -i commandFile

The response contains all the latency ranges inside each of the request ranges. The data are separate for
read and write:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ r
... data for reads ...
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ w
... data for writes ...
_end_

Chapter 4. Performance monitoring 83


If the -p flag is not specified, then the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist p OK read


... data for reads ...
mmpmon node 199.188.1.8 name node1 rhist p OK write
... data for writes ...

Here is an example of data for reads:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ r
_R_ 0 255
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
_R_ 256 511
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
_R_ 512 1023
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
...
_R_ 4194304 0
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0

If the -p flag is not specified, then the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist p OK read


size range 0 to 255
latency range 0.0 to 1.0
latency range 1.1 to 10.0
latency range 10.1 to 30.0
latency range 30.1 to 100.0
latency range 100.1 to 200.0
latency range 200.1 to 400.0
latency range 400.1 to 800.0
latency range 800.1 to 1000.0
latency range 1000.1 to 0
size range 256 to 511
latency range 0.0 to 1.0
latency range 1.1 to 10.0
latency range 10.1 to 30.0
latency range 30.1 to 100.0
latency range 100.1 to 200.0
latency range 200.1 to 400.0
latency range 400.1 to 800.0
latency range 800.1 to 1000.0
latency range 1000.1 to 0
size range 512 to 1023
latency range 0.0 to 1.0
latency range 1.1 to 10.0
latency range 10.1 to 30.0

84 IBM Storage Scale 5.1.9: Problem Determination Guide


latency range 30.1 to 100.0
latency range 100.1 to 200.0
latency range 200.1 to 400.0
latency range 400.1 to 800.0
latency range 800.1 to 1000.0
latency range 1000.1 to 0
...
size range 4194304 to 0
latency range 0.0 to 1.0
latency range 1.1 to 10.0
latency range 10.1 to 30.0
latency range 30.1 to 100.0
latency range 100.1 to 200.0
latency range 200.1 to 400.0
latency range 400.1 to 800.0
latency range 800.1 to 1000.0
latency range 1000.1 to 0

If the facility has never been enabled, then the _rc_ field is non-zero.

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 1 _t_ 1066939007 _tu_ 386241

If the -p flag is not specified, then the output is similar to this:

mmpmon node 199.18.1.8 name node1 rhist p status 1


not yet enabled

For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.

Resetting the request histogram facility data to zero


The rhist reset request resets the histogram statistics.
Table 24 on page 85 describes the keywords for the rhist reset response, in the order that they
appear in the output. These keywords are used only when the mmpmon command is invoked with the -p
flag. The response is a single string.

Table 24. Keywords and values for the mmpmon rhist reset response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is reset.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.

If the facility has been previously enabled, then the reset request still resets the statistics even when
rhist off is currently in effect. If there are no mounted file systems at the time rhist reset is
issued, then the statistics are still reset.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

Example of mmpmon rhist reset request


This topic is an example of the rhist reset request to reset the histogram facility data to zero and the
output that displays.
Assume that commandFile contains this line:

rhist reset

and this command is issued:

Chapter 4. Performance monitoring 85


mmpmon -p -i commandFile

The output is similar to this:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 0 _t_ 1066939007 _tu_ 386241

If the -p flag is not specified, then the output is similar to:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 0 _t_ 1066939007 _tu_ 386241

If the facility has never been enabled, then the _rc_ value is non-zero:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 1 _t_ 1066939143 _tu_ 148443

If the -p flag is not specified, then the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist reset status 1


not yet enabled

For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.

Displaying the request histogram facility statistics values


The rhist s request returns the current values for all latency ranges which have a nonzero count.
Table 25 on page 86 describes the keywords for the rhist s response, in the order that they appear in
the output. These keywords are used only when the mmpmon command is invoked with the -p flag.

Table 25. Keywords and values for the mmpmon rhist s response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is s.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_k_ The kind, r or w, (read or write) depending on what the statistics are for.
_R_ Request size range, minimum and maximum number of bytes.
_NR_ Number of requests that fell in this size range.
_L_ Latency range, minimum and maximum, in milliseconds.
_NL_ Number of requests that fell in this latency range. The sum of all _NL_ values for a
request size range equals the _NR_ value for that size range.

If the facility has been previously enabled, the rhist s request still displays the statistics even if rhist
off is currently in effect. This allows turning the histogram statistics on and off between known points
and reading them later. If there are no mounted file systems at the time rhist s is issued, then the
statistics are still displayed.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.

86 IBM Storage Scale 5.1.9: Problem Determination Guide


Example of mmpmon rhist s request
This topic is an example of the rhist s request to display the request histogram facility statistics values
and the output that displays.
Assume that commandFile contains this line:

rhist s

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ s _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ r
_R_ 65536 131071 _NR_ 32640
_L_ 0.0 1.0 _NL_ 25684
_L_ 1.1 10.0 _NL_ 4826
_L_ 10.1 30.0 _NL_ 1666
_L_ 30.1 100.0 _NL_ 464
_R_ 262144 524287 _NR_ 8160
_L_ 0.0 1.0 _NL_ 5218
_L_ 1.1 10.0 _NL_ 871
_L_ 10.1 30.0 _NL_ 1863
_L_ 30.1 100.0 _NL_ 208
_R_ 1048576 2097151 _NR_ 2040
_L_ 1.1 10.0 _NL_ 558
_L_ 10.1 30.0 _NL_ 809
_L_ 30.1 100.0 _NL_ 673
_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ s _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ w
_R_ 131072 262143 _NR_ 12240
_L_ 0.0 1.0 _NL_ 10022
_L_ 1.1 10.0 _NL_ 1227
_L_ 10.1 30.0 _NL_ 783
_L_ 30.1 100.0 _NL_ 208
_R_ 262144 524287 _NR_ 6120
_L_ 0.0 1.0 _NL_ 4419
_L_ 1.1 10.0 _NL_ 791
_L_ 10.1 30.0 _NL_ 733
_L_ 30.1 100.0 _NL_ 177
_R_ 524288 1048575 _NR_ 3060
_L_ 0.0 1.0 _NL_ 1589
_L_ 1.1 10.0 _NL_ 581
_L_ 10.1 30.0 _NL_ 664
_L_ 30.1 100.0 _NL_ 226
_R_ 2097152 4194303 _NR_ 762
_L_ 1.1 2.0 _NL_ 203
_L_ 10.1 30.0 _NL_ 393
_L_ 30.1 100.0 _NL_ 166
_end_

This small example shows that the reports for read and write may not present the same number of ranges
or even the same ranges. Only those ranges with non-zero counters are represented in the response. This
is true for both the request size ranges and the latency ranges within each request size range.
If the -p flag is not specified, then the output is similar to:

mmpmon node 199.18.2.5 name node1 rhist s OK timestamp 1066933849/93804 read


size range 65536 to 131071 count 32640
latency range 0.0 to 1.0 count 25684
latency range 1.1 to 10.0 count 4826
latency range 10.1 to 30.0 count 1666
latency range 30.1 to 100.0 count 464
size range 262144 to 524287 count 8160
latency range 0.0 to 1.0 count 5218
latency range 1.1 to 10.0 count 871
latency range 10.1 to 30.0 count 1863
latency range 30.1 to 100.0 count 208
size range 1048576 to 2097151 count 2040
latency range 1.1 to 10.0 count 558
latency range 10.1 to 30.0 count 809
latency range 30.1 to 100.0 count 673
mmpmon node 199.18.2.5 name node1 rhist s OK timestamp 1066933849/93968 write
size range 131072 to 262143 count 12240
latency range 0.0 to 1.0 count 10022
latency range 1.1 to 10.0 count 1227

Chapter 4. Performance monitoring 87


latency range 10.1 to 30.0 count 783
latency range 30.1 to 100.0 count 208
size range 262144 to 524287 count 6120
latency range 0.0 to 1.0 count 4419
latency range 1.1 to 10.0 count 791
latency range 10.1 to 30.0 count 733
latency range 30.1 to 100.0 count 177
size range 524288 to 1048575 count 3060
latency range 0.0 to 1.0 count 1589
latency range 1.1 to 10.0 count 581
latency range 10.1 to 30.0 count 664
latency range 30.1 to 100.0 count 226
size range 2097152 to 4194303 count 762
latency range 1.1 to 2.0 count 203
latency range 10.1 to 30.0 count 393
latency range 30.1 to 100.0 count 166

If the facility has never been enabled, then the _rc_ value is non-zero:

_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 1 _t_ 1066939143 _tu_ 148443

If the -p flag is not specified, then the output is similar to:

mmpmon node 199.18.1.8 name node1 rhist reset status 1


not yet enabled

An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.

Understanding the Remote Procedure Call (RPC) facility


The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Table 26 on page 88 describes the rpc_s requests:

Table 26. rpc_s requests for the mmpmon command


Request Description
rpc_s “Displaying the aggregation of execution time for Remote Procedure Calls (RPCs)” on
page 89
rpc_s size “Displaying the Remote Procedure Call (RPC) execution time according to the size of
messages” on page 91

The information displayed with rpc_s is similar to what is displayed with the mmdiag --rpc command.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Related tasks
Specifying input to the mmpmon command

88 IBM Storage Scale 5.1.9: Problem Determination Guide


The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Displaying the aggregation of execution time for Remote Procedure Calls


(RPCs)
The rpc_s request returns the aggregation of execution time for RPCs.
Table 27 on page 89 describes the keywords for the rpc_s response, in the order that they appear in the
output.

Table 27. Keywords and values for the mmpmon rpc_s response
Keyword Description
_req_ Indicates the action requested. The action can be either size, node, or message. If
no action is requested, the default is the rpc_s action.
_n_ Indicates the IP address of the node responding. This is the address by which GPFS
knows the node.
_nn_ Indicates the hostname that corresponds to the IP address (the _n_ value).
_rn_ Indicates the IP address of the remote node responding. This is the address by which
GPFS knows the node. The statistics displayed are the averages from _nn_ to this
_rnn_.
_rnn_ Indicates the hostname that corresponds to the remote node IP address (the _rn_
value). The statistics displayed are the averages from _nn_ to this _rnn_.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Indicates the microseconds part of the current time of day.
_rpcObj_ Indicates the beginning of the statistics for _obj_.
_obj_ Indicates the RPC object being displayed.
_nsecs_ Indicates the number of one-second intervals maintained.
_nmins_ Indicates the number of one-minute intervals maintained.
_nhours_ Indicates the number of one-hour intervals maintained.

Chapter 4. Performance monitoring 89


Table 27. Keywords and values for the mmpmon rpc_s response (continued)
Keyword Description
_ndays_ Indicates the number of one-day intervals maintained.
_stats_ Indicates the beginning of the RPC statistics.
_tmu_ Indicates the time unit (seconds, minutes, hours, or days).
_av_ Indicates the average value of execution time for _cnt_ RPCs during this time unit.
_min_ Indicates the minimum value of execution time for _cnt_ RPCs during this time unit.
_max_ Indicates the maximum value of execution time for _cnt_ RPCs during this time unit.
_cnt_ Indicates the count of RPCs that occurred during this time unit.

The values allowed for _rpcObj_ are the following:


• AG_STAT_CHANNEL_WAIT
• AG_STAT_SEND_TIME_TCP
• AG_STAT_SEND_TIME_VERBS
• AG_STAT_RECEIVE_TIME_TCP
• AG_STAT_RPC_LATENCY_TCP
• AG_STAT_RPC_LATENCY_VERBS
• AG_STAT_RPC_LATENCY_MIXED
• AG_STAT_LAST

Example of mmpmon rpc_s request


This topic is an example of the rpc_s request to display the aggregation of execution time for remote
procedure calls (RPCs).
Assume that the file commandFile contains the following line:

rpc_s

The following command is issued:

mmpmon -p -i commandFile

The output is similar to the following example:

_response_ begin mmpmon rpc_s


_mmpmon::rpc_s_ _req_ node _n_ 192.168.56.168 _nn_ node3 _rn_ 192.168.56.167 _rnn_ node2 _rc_
0 _t_ 1388417709 _tu_ 641530
_rpcObj_ _obj_ AG_STAT_CHANNEL_WAIT _nsecs_ 60 _nmins_ 60 _nhours_ 24 _ndays_ 30
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
…............
…............
…............
_rpcObj_ _obj_ AG_STAT_SEND_TIME_TCP _nsecs_ 60 _nmins_ 60 _nhours_ 24 _ndays_ 30
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
…..........................
…..........................
…..........................
_response_ end

90 IBM Storage Scale 5.1.9: Problem Determination Guide


If the -p flag is not specified, the output is similar to the following example:

Object: AG_STAT_CHANNEL_WAIT
nsecs: 60
nmins: 60
nhours: 24
ndays: 30
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.00

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Displaying the Remote Procedure Call (RPC) execution time according to the
size of messages
The rpc_s size request returns the cached RPC-related size statistics.
Table 28 on page 91 describes the keywords for the rpc_s size response, in the order that they
appear in the output.

Table 28. Keywords and values for the mmpmon rpc_s size response
Keyword Description
_req_ Indicates the action requested. In this case, the value is rpc_s size.
_n_ Indicates the IP address of the node responding. This is the address by which GPFS
knows the node.
_nn_ Indicates the hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Indicates the microseconds part of the current time of day.
_rpcSize_ Indicates the beginning of the statistics for this _size_ group.
_size_ Indicates the size of the messages for which statistics are collected.
_nsecs_ Indicates the number of one-second intervals maintained.
_nmins_ Indicates the number of one-minute intervals maintained.
_nhours_ Indicates the number of one-hour intervals maintained.

Chapter 4. Performance monitoring 91


Table 28. Keywords and values for the mmpmon rpc_s size response (continued)
Keyword Description
_ndays_ Indicates the number of one-day intervals maintained.
_stats_ Indicates the beginning of the RPC-size statistics.
_tmu_ Indicates the time unit.
_av_ Indicates the average value of execution time for _cnt_ RPCs during this time unit.
_min_ Indicates the minimum value of execution time for _cnt_ RPCs during this time unit.
_max_ Indicates the maximum value of execution time for _cnt_ RPCs during this time unit.
_cnt_ Indicates the count of RPCs that occurred during this time unit.

Example of mmpmon rpc_s size request


This topic is an example of the rpc_s size request to display the RPC execution time according to the
size of messages.
Assume that the file commandFile contains the following line:

rpc_s size

The following command is issued:

mmpmon -p -i commandFile

The output is similar to the following example:

_mmpmon::rpc_s_ _req_ size _n_ 192.168.56.167 _nn_ node2 _rc_ 0 _t_ 1388417852 _tu_ 572950
_rpcSize_ _size_ 64 _nsecs_ 60 _nmins_ 60 _nhours_ 24 _ndays_ 30
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
…...................
…...................
…...................
_rpcSize_ _size_ 256 _nsecs_ 60 _nmins_ 60 _nhours_ 24 _ndays_ 30
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
…..................
…..................
_stats_ _tmu_ min _av_ 0.692, _min_ 0.692, _max_ 0.692, _cnt_ 1
_stats_ _tmu_ min _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ min _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ min _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_response_ end

If the -p flag is not specified, the output is similar to the following example:

Bucket size: 64
nsecs: 60
nmins: 60
nhours: 24
ndays: 30
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.000

92 IBM Storage Scale 5.1.9: Problem Determination Guide


MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

TimeUnit: sec
AverageValue: 0.131
MinValue: 0.131
MaxValue: 0.131
Countvalue: 1

TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Displaying mmpmon version


The ver request returns a string containing version information.
Table 29 on page 93 Describes the keywords for the ver (version) response, in the order that they
appear in the output. These keywords are used only when mmpmon is invoked with the -p flag.

Table 29. Keywords and values for the mmpmon ver response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_v_ The version of mmpmon.
_lv_ The level of mmpmon.
_vt_ The fix level variant of mmpmon.

Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system

Chapter 4. Performance monitoring 93


The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

Example of mmpmon ver request


This topic is an example of the ver request to display the mmpmon version and the output that displays.
Assume that commandFile contains this line:

ver

and this command is issued:

mmpmon -p -i commandFile

The output is similar to this:

_ver_ _n_ 199.18.1.8 _nn_ node1 _v_ 3 _lv_ 3 _vt_ 0

If the -p flag is not specified, the output is similar to:

mmpmon node 199.18.1.8 name node1 version 3.3.0

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
The fs_io_s and io_s requests are used to determine a number of GPFS I/O parameters and their
implication for overall performance. The rhist requests are used to produce histogram data about I/O
sizes and latency times for I/O requests. The request source and prefix directive once allow the user of
mmpmon to more finely tune its operation.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility

94 IBM Storage Scale 5.1.9: Problem Determination Guide


The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Other information about mmpmon output
When interpreting the results from the mmpmon command output there are several points to consider.

fs_io_s and io_s output - how to aggregate and analyze the results
The fs_io_s and io_s requests can be used to determine a number of GPFS I/O parameters and their
implication for overall performance.
The output from the fs_io_s and io_s requests can be used to determine:
1. The I/O service rate of a node, from the application point of view. The io_s request presents this as a
sum for the entire node, while fs_io_s presents the data per file system. A rate can be approximated
by taking the _br_ (bytes read) or _bw_ (bytes written) values from two successive invocations of
fs_io_s (or io_s_) and dividing by the difference of the sums of the individual _t_ and _tu_ values
(seconds and microseconds).
This must be done for a number of samples, with a reasonably small time between samples, in order
to get a rate which is reasonably accurate. Since we are sampling the information at a given interval,
inaccuracy can exist if the I/O load is not smooth over the sampling time.
For example, here is a set of samples taken approximately one second apart, when it was known that
continuous I/O activity was occurring:

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862476 _tu_ 634939 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 3737124864 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3570 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862477 _tu_ 645988 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 3869245440 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3696 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862478 _tu_ 647477 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4120903680 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3936 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862479 _tu_ 649363 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4309647360 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4116 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862480 _tu_ 650795 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4542431232 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4338 _dir_ 0 _iu_ 5

Chapter 4. Performance monitoring 95


_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862481 _tu_ 652515 _cl_ cluster1.ibm.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4743757824 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4530 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862482 _tu_ 654025 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4963958784 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4740 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862483 _tu_ 655782 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5177868288 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4944 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862484 _tu_ 657523 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5391777792 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 5148 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862485 _tu_ 665909 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5599395840 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 5346 _dir_ 0 _iu_ 5

This simple awk script performs a basic rate calculation:

BEGIN {
count=0;
prior_t=0;
prior_tu=0;
prior_br=0;
prior_bw=0;
}

{
count++;

t = $9;
tu = $11;
br = $19;
bw = $21;

if(count > 1)
{
delta_t = t-prior_t;
delta_tu = tu-prior_tu;
delta_br = br-prior_br;
delta_bw = bw-prior_bw;
dt = delta_t + (delta_tu / 1000000.0);
if(dt > 0) {
rrate = (delta_br / dt) / 1000000.0;
wrate = (delta_bw / dt) / 1000000.0;

printf("%5.1f MB/sec read %5.1f MB/sec write\n",rrate,wrate);


}
}

prior_t=t;
prior_tu=tu;
prior_br=br;
prior_bw=bw;
}

The calculated service rates for each adjacent pair of samples is:

0.0 MB/sec read 130.7 MB/sec write


0.0 MB/sec read 251.3 MB/sec write
0.0 MB/sec read 188.4 MB/sec write
0.0 MB/sec read 232.5 MB/sec write
0.0 MB/sec read 201.0 MB/sec write
0.0 MB/sec read 219.9 MB/sec write
0.0 MB/sec read 213.5 MB/sec write
0.0 MB/sec read 213.5 MB/sec write
0.0 MB/sec read 205.9 MB/sec write

Since these are discrete samples, there can be variations in the individual results. For example, there
may be other activity on the node or interconnection fabric. I/O size, file system block size, and
buffering also affect results. There can be many reasons why adjacent values differ. This must be taken
into account when building analysis tools that read mmpmon output and interpreting results.
For example, suppose a file is read for the first time and gives results like this.

96 IBM Storage Scale 5.1.9: Problem Determination Guide


0.0 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write
92.1 MB/sec read 0.0 MB/sec write
89.0 MB/sec read 0.0 MB/sec write
92.1 MB/sec read 0.0 MB/sec write
90.0 MB/sec read 0.0 MB/sec write
96.3 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write

If most or all of the file remains in the GPFS cache, the second read may give quite different rates:

0.0 MB/sec read 0.0 MB/sec write


0.0 MB/sec read 0.0 MB/sec write
235.5 MB/sec read 0.0 MB/sec write
287.8 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write

Considerations such as these need to be taken into account when looking at application I/O service
rates calculated from sampling mmpmon data.
2. Usage patterns, by sampling at set times of the day (perhaps every half hour) and noticing when the
largest changes in I/O volume occur. This does not necessarily give a rate (since there are too few
samples) but it can be used to detect peak usage periods.
3. If some nodes service significantly more I/O volume than others over a given time span.
4. When a parallel application is split across several nodes, and is the only significant activity in the
nodes, how well the I/O activity of the application is distributed.
5. The total I/O demand that applications are placing on the cluster. This is done by obtaining results
from fs_io_s and io_s in aggregate for all nodes in a cluster.
6. The rate data may appear to be erratic. Consider this example:

0.0 MB/sec read 0.0 MB/sec write


6.1 MB/sec read 0.0 MB/sec write
92.1 MB/sec read 0.0 MB/sec write
89.0 MB/sec read 0.0 MB/sec write
12.6 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write
8.9 MB/sec read 0.0 MB/sec write
92.1 MB/sec read 0.0 MB/sec write
90.0 MB/sec read 0.0 MB/sec write
96.3 MB/sec read 0.0 MB/sec write
4.8 MB/sec read 0.0 MB/sec write
0.0 MB/sec read 0.0 MB/sec write

The low rates which appear before and after each group of higher rates can be due to the I/O requests
occurring late (in the leading sampling period) and ending early (in the trailing sampling period.) This
gives an apparently low rate for those sampling periods.
The zero rates in the middle of the example could be caused by reasons such as no I/O requests
reaching GPFS during that time period (the application issued none, or requests were satisfied by
buffered data at a layer on top of GPFS), the node becoming busy with other work (causing the
application to be undispatched), or other reasons.
For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.

Request histogram (rhist) output - how to aggregate and analyze the results
The rhist requests are used to produce histogram data about I/O sizes and latency times for I/O
requests.
The output from the rhist requests can be used to determine:
1. The number of I/O requests in a given size range. The sizes may vary based on operating system,
explicit application buffering, and other considerations. This information can be used to help

Chapter 4. Performance monitoring 97


determine how well an application or set of applications is buffering its I/O. For example, if there
are many very small or many very large I/O transactions. A large number of overly small or overly large
I/O requests may not perform as well as an equivalent number of requests whose size is tuned to the
file system or operating system parameters.
2. The number of I/O requests in a size range that have a given latency time. Many factors can affect the
latency time, including but not limited to: system load, interconnection fabric load, file system block
size, disk block size, disk hardware characteristics, and the operating system on which the I/O request
is issued.
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Using request source and prefix directive once


The request source and prefix directive once allow mmpmon users to more finely tune their operations.
The source request causes mmpmon to read requests from a file, and when finished return to reading
requests from the input stream.
The prefix directive once can be placed in front of any mmpmon request. The once prefix indicates that the
request be run only once, irrespective of the setting of the -r flag on the mmpmon command. It is useful
for requests that do not need to be issued more than once, such as to set up the node list or turn on the
request histogram facility.
These rules apply when using the once prefix directive and source request:
1. once with nothing after it is an error that terminates mmpmon processing.
2. A file invoked with the source request may contain source requests, causing file nesting of arbitrary
depth. No check is done for loops in this situation.
3. The request once source filename causes the once prefix to be applied to all the mmpmon requests in
filename, including any source requests in the file.
4. If a filename specified with the source request cannot be opened for read, an error is returned and
mmpmon terminates.
5. If the -r flag on the mmpmon command has any value other than one, and all requests are prefixed with
once, mmpmon runs all the requests once, issues a message, and then terminates.

An example of once and source usage


This topic provides an example of the once and source requests and the output that displays.
This command is issued:

mmpmon -p -i command.file -r 0 -d 5000 | tee output.file

File command.file consists of this:

once source mmpmon.header


once rhist nr 512;1024;2048;4096 =
once rhist on
source mmpmon.commands

File mmpmon.header consists of this:

ver
reset

File mmpmon.commands consists of this:

fs_io_s
rhist s

98 IBM Storage Scale 5.1.9: Problem Determination Guide


The output.file is similar to this:

_ver_ _n_ 199.18.1.8 _nn_ node1 _v_ 2 _lv_ 4 _vt_ 0


_reset_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770129 _tu_ 511981
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ nr 512;1024;2048;4096 = _rc_ 0 _t_ 1129770131 _tu_
524674
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ on _rc_ 0 _t_ 1129770131 _tu_ 524921
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770131 _tu_ 525062 _cl_ node1.localdomain
_fs_ gpfs1 _d_ 1 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770131 _tu_ 525062 _cl_ node1.localdomain
_fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0 _iu_ 0
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770131 _tu_ 525220 _k_ r
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770131 _tu_ 525228 _k_ w
_end_
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770136 _tu_ 526685 _cl_ node1.localdomain
_fs_ gpfs1 _d_ 1 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770136 _tu_ 526685 _cl_ node1.localdomain
_fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 395018 _oc_ 504 _cc_ 252 _rdc_ 0 _wc_ 251 _dir_ 0 _iu_ 147
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770136 _tu_ 526888 _k_ r
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770136 _tu_ 526896 _k_ w
_R_ 0 512 _NR_ 169
_L_ 0.0 1.0 _NL_ 155
_L_ 1.1 10.0 _NL_ 7
_L_ 10.1 30.0 _NL_ 1
_L_ 30.1 100.0 _NL_ 4
_L_ 100.1 200.0 _NL_ 2
_R_ 513 1024 _NR_ 16
_L_ 0.0 1.0 _NL_ 15
_L_ 1.1 10.0 _NL_ 1
_R_ 1025 2048 _NR_ 3
_L_ 0.0 1.0 _NL_ 32
_R_ 2049 4096 _NR_ 18
_L_ 0.0 1.0 _NL_ 18
_R_ 4097 0 _NR_ 16
_L_ 0.0 1.0 _NL_ 16
_end_
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770141 _tu_ 528613 _cl_ node1.localdomain
_fs_ gpfs1 _d_ 1 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770141 _tu_ 528613 _cl_ node1.localdomain
_fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 823282 _oc_ 952 _cc_ 476 _rdc_ 0 _wc_ 474 _dir_ 0 _iu_ 459
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770141 _tu_ 528812 _k_ r
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770141 _tu_ 528820 _k_ w
_R_ 0 512 _NR_ 255
_L_ 0.0 1.0 _NL_ 241
_L_ 1.1 10.0 _NL_ 7
_L_ 10.1 30.0 _NL_ 1
_L_ 30.1 100.0 _NL_ 4
_L_ 100.1 200.0 _NL_ 2
_R_ 513 1024 _NR_ 36
_L_ 0.0 1.0 _NL_ 35
_L_ 1.1 10.0 _NL_ 1
_R_ 1025 2048 _NR_ 90
_L_ 0.0 1.0 _NL_ 90
_R_ 2049 4096 _NR_ 55
_L_ 0.0 1.0 _NL_ 55
_R_ 4097 0 _NR_ 38
_L_ 0.0 1.0 _NL_ 37
_L_ 1.1 10.0 _NL_ 1
_end_
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770146 _tu_ 530570 _cl_ node1.localdomain
_fs_ gpfs1 _d_ 1 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0 _iu_ 1
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1129770146 _tu_ 530570 _cl_ node1.localdomain
_fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 3069915 _oc_ 1830 _cc_ 914 _rdc_ 0 _wc_ 901 _dir_ 0 _iu_ 1070
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770146 _tu_ 530769 _k_ r
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ s _rc_ 0 _t_ 1129770146 _tu_ 530778 _k_ w
_R_ 0 512 _NR_ 526
_L_ 0.0 1.0 _NL_ 501
_L_ 1.1 10.0 _NL_ 14
_L_ 10.1 30.0 _NL_ 2
_L_ 30.1 100.0 _NL_ 6
_L_ 100.1 200.0 _NL_ 3
_R_ 513 1024 _NR_ 74
_L_ 0.0 1.0 _NL_ 70
_L_ 1.1 10.0 _NL_ 4
_R_ 1025 2048 _NR_ 123
_L_ 0.0 1.0 _NL_ 117
_L_ 1.1 10.0 _NL_ 6
_R_ 2049 4096 _NR_ 91
_L_ 0.0 1.0 _NL_ 84
_L_ 1.1 10.0 _NL_ 7
_R_ 4097 0 _NR_ 87

Chapter 4. Performance monitoring 99


_L_ 0.0 1.0 _NL_ 81
_L_ 1.1 10.0 _NL_ 6
_end_
.............. and so forth ......................

If this command is issued with the same file contents:

mmpmon -i command.file -r 0 -d 5000 | tee output.file.english

The file output.file.english is similar to this:

mmpmon node 199.18.1.8 name node1 version 3.1.0


mmpmon node 199.18.1.8 name node1 reset OK
mmpmon node 199.18.1.8 name node1 rhist nr 512;1024;2048;4096 = OK
mmpmon node 199.18.1.8 name node1 rhist on OK
mmpmon node 199.18.1.8 name node1 fs_io_s OK
cluster: node1.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1129770175/950895
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: node1.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1129770175/950895
bytes read: 0
bytes written:

opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0
mmpmon node 199.18.1.8 name node1 rhist s OK read timestamp 1129770175/951117
mmpmon node 199.18.1.8 name node1 rhist s OK write timestamp 1129770175/951125
mmpmon node 199.18.1.8 name node1 fs_io_s OK
cluster: node1.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1129770180/952462
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: node1.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1129770180/952462
bytes read: 0
bytes written: 491310
opens: 659
closes: 329
reads: 0
writes: 327
readdir: 0
inode updates: 74
mmpmon node 199.18.1.8 name node1 rhist s OK read timestamp 1129770180/952711
mmpmon node 199.18.1.8 name node1 rhist s OK write timestamp 1129770180/952720
size range 0 to 512 count 214
latency range 0.0 to 1.0 count 187
latency range 1.1 to 10.0 count 15
latency range 10.1 to 30.0 count 6
latency range 30.1 to 100.0 count 5
latency range 100.1 to 200.0 count 1

100 IBM Storage Scale 5.1.9: Problem Determination Guide


size range 513 to 1024 count 27
latency range 0.0 to 1.0 count 26
latency range 100.1 to 200.0 count 1
size range 1025 to 2048 count 32
latency range 0.0 to 1.0 count 29
latency range 1.1 to 10.0 count 1
latency range 30.1 to 100.0 count 2
size range 2049 to 4096 count 31
latency range 0.0 to 1.0 count 30
latency range 30.1 to 100.0 count 1
size range 4097 to 0 count 23
latency range 0.0 to 1.0 count 23
mmpmon node 199.18.1.8 name node1 fs_io_s OK
cluster: node1.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1129770185/954401
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: node1.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1129770185/954401
bytes read: 0
bytes written: 1641935
opens: 1062
closes: 531
reads: 0
writes: 529
readdir: 0
inode updates: 523
mmpmon node 199.18.1.8 name node1 rhist s OK read timestamp 1129770185/954658
mmpmon node 199.18.1.8 name node1 rhist s OK write timestamp 1129770185/954667
size range 0 to 512 count 305
latency range 0.0 to 1.0 count 270
latency range 1.1 to 10.0 count 21
latency range 10.1 to 30.0 count 6
latency range 30.1 to 100.0 count 6
latency range 100.1 to 200.0 count 2
size range 513 to 1024 count 39
latency range 0.0 to 1.0 count 36
latency range 1.1 to 10.0 count 1
latency range 30.1 to 100.0 count 1
latency range 100.1 to 200.0 count 1
size range 1025 to 2048 count 89
latency range 0.0 to 1.0 count 84
latency range 1.1 to 10.0 count 2
latency range 30.1 to 100.0 count 3
size range 2049 to 4096 count 56
latency range 0.0 to 1.0 count 54
latency range 1.1 to 10.0 count 1
latency range 30.1 to 100.0 count 1
size range 4097 to 0 count 40
latency range 0.0 to 1.0 count 39
latency range 1.1 to 10.0 count 1
mmpmon node 199.18.1.8 name node1 fs_io_s OK
cluster: node1.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1129770190/956480
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: node1.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1129770190/956480

Chapter 4. Performance monitoring 101


bytes read: 0
bytes written: 3357414
opens: 1940
closes: 969
reads: 0
writes: 952
readdir: 0
inode updates: 1101
mmpmon node 199.18.1.8 name node1 rhist s OK read timestamp 1129770190/956723
mmpmon node 199.18.1.8 name node1 rhist s OK write timestamp 1129770190/956732
size range 0 to 512 count 539
latency range 0.0 to 1.0 count 494
latency range 1.1 to 10.0 count 29
latency range 10.1 to 30.0 count 6
latency range 30.1 to 100.0 count 8
latency range 100.1 to 200.0 count 2
size range 513 to 1024 count 85
latency range 0.0 to 1.0 count 81
latency range 1.1 to 10.0 count 2
latency range 30.1 to 100.0 count 1
latency range 100.1 to 200.0 count 1
size range 1025 to 2048 count 133
latency range 0.0 to 1.0 count 124
latency range 1.1 to 10.0 count 5
latency range 10.1 to 30.0 count 1
latency range 30.1 to 100.0 count 3
size range 2049 to 4096 count 99
latency range 0.0 to 1.0 count 91
latency range 1.1 to 10.0 count 6
latency range 10.1 to 30.0 count 1
latency range 30.1 to 100.0 count 1
size range 4097 to 0 count 95
latency range 0.0 to 1.0 count 90
latency range 1.1 to 10.0 count 4
latency range 10.1 to 30.0 count 1
mmpmon node 199.18.1.8 name node1 fs_io_s OK
cluster: node1.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1129770195/958310
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0

mmpmon node 199.18.1.8 name node1 fs_io_s OK


cluster: node1.localdomain
filesystem: gpfs2
disks: 2
timestamp: 1129770195/958310
bytes read: 0
bytes written: 3428107
opens: 2046
closes: 1023
reads: 0
writes: 997
readdir: 0
inode updates: 1321
mmpmon node 199.18.1.8 name node1 rhist s OK read timestamp 1129770195/958568
mmpmon node 199.18.1.8 name node1 rhist s OK write timestamp 1129770195/958577
size range 0 to 512 count 555
latency range 0.0 to 1.0 count 509
latency range 1.1 to 10.0 count 30
latency range 10.1 to 30.0 count 6
latency range 30.1 to 100.0 count 8
latency range 100.1 to 200.0 count 2
size range 513 to 1024 count 96
latency range 0.0 to 1.0 count 92
latency range 1.1 to 10.0 count 2
latency range 30.1 to 100.0 count 1
latency range 100.1 to 200.0 count 1
size range 1025 to 2048 count 143
latency range 0.0 to 1.0 count 134
latency range 1.1 to 10.0 count 5
latency range 10.1 to 30.0 count 1
latency range 30.1 to 100.0 count 3
size range 2049 to 4096 count 103
latency range 0.0 to 1.0 count 95

102 IBM Storage Scale 5.1.9: Problem Determination Guide


latency range 1.1 to 10.0 count 6
latency range 10.1 to 30.0 count 1
latency range 30.1 to 100.0 count 1
size range 4097 to 0 count 100
latency range 0.0 to 1.0 count 95
latency range 1.1 to 10.0 count 4
latency range 10.1 to 30.0 count 1
.............. and so forth ......................

For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.

Other information about mmpmon output


When interpreting the results from the mmpmon command output there are several points to consider.
Consider these important points:
• On a node acting as a server of a GPFS file system to NFS clients, NFS I/O is accounted for in the
statistics. However, the I/O is that which goes between GPFS and NFS. If NFS caches data, in order to
achieve better performance, this activity is not recorded.
• I/O requests made at the application level may not be exactly what is reflected to GPFS. This is
dependent on the operating system, and other factors. For example, an application read of 100 bytes
may result in obtaining, and caching, a 1 MB block of data at a code level om top of GPFS (such as the
libc I/O layer.) . Subsequent reads within this block result in no additional requests to GPFS.
• The counters kept by mmpmon are not atomic and may not be exact in cases of high parallelism or heavy
system load. This design minimizes the performance impact associated with gathering statistical data.
• Reads from data cached by GPFS is reflected in statistics and histogram data. Reads and writes to data
cached in software layers on top of GPFS are reflected in statistics and histogram data when those
layers actually call GPFS for I/O.
• Activity from snapshots affects statistics. I/O activity necessary to maintain a snapshot is counted in the
file system statistics.
• Some (generally minor) amount of activity in the root directory of a file system is reflected in the
statistics of the file system manager node, and not the node which is running the activity.
• The open count also includes creat() call counts.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node

Chapter 4. Performance monitoring 103


The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.

Counter sizes and counter wrapping


The mmpmon command may be run continuously for extended periods of time. The user must be aware
that counters may wrap.
This information applies to the counters involved:
• The statistical counters used for the io_s and fs_io_s requests are maintained by GPFS at all times,
even when mmpmon has not been invoked. It is suggested that you use the reset request prior to
starting a sequence of io_s or fs_io_s requests.
• The bytes read and bytes written counters are unsigned 64-bit integers. They are used in the fs_io_s
and io_s requests, as the _br_ and _bw_ fields.
• The counters associated with the rhist requests are updated only when the request histogram facility
has been enabled.
• The counters used in the rhist requests are unsigned 64-bit integers.
• All other counters are unsigned 32-bit integers.
For more information, see “fs_io_s and io_s output - how to aggregate and analyze the results” on page
95 and “Request histogram (rhist) output - how to aggregate and analyze the results” on page 97.

Return codes from mmpmon


This topic provides the mmpmon return codes and explanations for the codes.
These are the return codes that can appear in the _rc_ field:
0
Successful completion.
1
One of these has occurred:
1. For the fs_io_s request, no file systems are mounted.
2. For an rhist request, a request was issued that requires the request histogram facility to be
enabled, but it is not. The facility is not enabled if:
• Since the last mmstartup was issued, rhist on was never issued.
• rhist nr was issued and rhist on was not issued afterwards.
2
For one of the nlist requests, the node name is not recognized.
13
For one of the nlist requests, the node name is a remote node, which is not allowed.
16
For one of the rhist requests, the histogram operations lock is busy. Retry the request.
17
For one of the nlist requests, the node name is already in the node list.

104 IBM Storage Scale 5.1.9: Problem Determination Guide


22
For one of the rhist requests, the size or latency range parameters were not in ascending order or
were otherwise incorrect.
233
For one of the nlist requests, the specified node is not joined to the cluster.
668
For one of the nlist requests, quorum has been lost in the cluster.

Using the performance monitoring tool


The performance monitoring tool collects metrics from GPFS and protocols and provides system
performance information. By default, the performance monitoring tool is enabled, and it consists of
Collectors, Sensors, and Proxies.

Collector
In the older versions of IBM Storage Scale, the performance monitoring tool was configured only with
a single collector, which supported up to 150 sensor nodes. The performance monitoring tool can be
configured with multiple collectors to increase scalability and fault-tolerance, and this configuration is
referred to as multi-collector federation.
In a multi-collector federated configuration, the collectors need to be aware of each other. Otherwise,
a collector returns only the data that is stored in its own measurement database. When the collectors
are aware of their peer collectors, they can collaborate with each other to collate measurement data for
a specific measurement query. All collectors that are part of the federation are specified in the peers
configuration option in the collector’s configuration file as shown in the following example:

peers = { host = "collector1.mydomain.com" port = "9085" },


{ host = "collector2.mydomain.com" port = "9085" }

The port number is the one specified by the federationport configuration option, typically set to 9085.
You can also list the current host so that the same configuration file can be used for all the collectors.
Note: A Linux operating system user is added to the host. This user ID, scalepm, is used by the
pmcollector to run the process in the context of the new user. However, the scalepm ID does not
have privilege to log in to the system.
When the peers are specified, any query for measurement data is directed to any of the collectors that
are listed in the peers section. The collector collects and assembles a response based on all relevant data
from all collectors. Hence, clients need to contact only a single collector instead of all of them to get all
the measurements available in the system.
To distribute the measurement data reported by sensors over multiple collectors, multiple collectors
might be specified when the sensors are configured.
If multiple collectors are specified, the sensors pick one to report their measurement data to. The sensors
use stable hashes to pick the collector such that the sensor-collector relationship does not change too
much if new collectors are added or if a collector is removed.
Additionally, sensors and collectors can be configured for high availability. In this setting, sensors report
their measurement data to more than one collector such that the failure of a single collector would
not lead to any data loss. For instance, if the collector redundancy is increased to two, every sensor
reports to two collectors. As a side-effect of increasing the redundancy to two, the bandwidth that
is used for reporting measurement data is duplicated. The collector redundancy must be configured
before the sensor configuration is stored in IBM Storage Scale by changing the colRedundancy option
in /opt/IBM/zimon/ZIMonSensors.cfg.

Sensor
A sensor is a component that collects performance data from a node. Typically, multiple sensors run on
any node that is needed to collect metrics. By default, the sensors are started on every node.

Chapter 4. Performance monitoring 105


Sensors identify the collector from the information present in the sensor configuration. The sensor
configuration is managed by IBM Storage Scale and can be retrieved and changed by using the
mmperfmon command. A copy is stored in /opt/IBM/zimon/ZIMonSensors.cfg. However, this copy
must not be edited by users.

Proxy
A proxy is run for each of the protocols to collect the metrics for that protocol.
By default, the NFS, and SMB proxies are started automatically with those protocols. They do not need
to be started or stopped. However, to retrieve metrics for SMB, NFS, or Object, these protocols must be
active on the specific node.
For information, see the “Enabling protocol metrics” on page 152 topic.
For information on enabling Transparent cloud tiering metrics, see Integrating Transparent Cloud Tiering
metrics with performance monitoring tool in IBM Storage Scale: Administration Guide.
Important: When the performance monitoring tool is used, ensure that the clocks of all of the nodes in
the cluster are synchronized. The Network Time Protocol (NTP) must be configured on all nodes.
Note: The performance monitoring information, which is driven by the IBM Storage Scale internal
monitoring tool and users by using the mmpmon command might affect each other.
Related concepts
Network performance monitoring
Network performance can be monitored either by using Remote Procedure Call (RPC) statistics or it can
be monitored by using the IBM Storage Scale graphical user interface (GUI).
Monitoring I/O performance with the mmpmon command
Use the mmpmon command to monitor the I/O performance of IBM Storage Scale on the node on which it
is run and on other specified nodes.
Viewing and analyzing the performance data
The performance monitoring tool displays the performance metrics that are associated with GPFS and
the associated protocols. It helps you get a graphical representation of the status and trends of the key
performance indicators, and analyze IBM Storage Scale performance problems.

Configuring the performance monitoring tool


Describes about configuring the performance monitoring tool in the IBM Storage Scale.
The performance monitoring tool, collector, sensors, and proxies are a part of the IBM Storage Scale
distribution as detailed here:
• The "mmperfmon" tool and the performance monitoring proxies are a part of the IBM Storage Scale
core package "gpfs.base", which is installed on all the nodes.
• The sensors package "gpfs.gss.pmsensors" must be also installed on all the nodes with IBM
Storage Scale if the performance monitoring is expected to provide information from these nodes.
• The collector package "gpfs.gss.pmcollector" must be installed on all the nodes, which are
expected to host the performance monitoring database. At least one node per 150 cluster nodes is
recommended. Furthermore, it is recommended to not use NSD server nodes or quorum nodes for the
installation. Using nodes with the IBM Storage Scale GUI makes most sense, since the IBM Storage
Scale GUI queries the collector for its charts.
The performance monitoring tool uses the GPFS™ cluster daemon node names and network to
communicate between nodes.
Note: The tool is supported on Linux nodes only.
For information on the usage of ports for the performance monitoring tool, see the Firewall
recommendations for Performance Monitoring tool in IBM Storage Scale: Administration Guide.

106 IBM Storage Scale 5.1.9: Problem Determination Guide


The performance monitoring has a security feature to limit the connections of performance monitoring
sensors to a list of known IBM Storage Scale node IPs. However, a user can modify the list of allowed IBM
Storage Scale node IPs based on their requirements.
By default, this security feature is disabled, and can be enabled by modifying the ipfixiplist and
ipfixiplistrefresh attributes in the /opt/IBM/zimon/ZIMonCollector.cfg configuration file.
For more information, see the Configuring the collector in the IBM Storage Scale: Problem Determination
Guide.

Configuring the sensor


Performance monitoring sensors can either be managed manually as individual files on each node or
managed automatically by IBM Storage Scale. Only IBM Storage Scale-managed sensor configuration is
actively supported.
Note: The GPFSFilesetQuota, GPFSFileset, GPFSPool, and GPFSDiskCap sensors are designed to
run on a single node in the cluster. This node requires all the file systems to be mounted. A GUI node
should be the preferred node for these single-node sensors. The system health daemon can be used to
automatically select this single node.

Identifying the type of configuration in use


If the performance monitoring infrastructure was installed previously, you might need to identify the type
of configuration the system is currently using.
If the sensor configuration is managed automatically, the configuration is stored within IBM Storage Scale.
If it is managed automatically, it can be viewed with the mmperfmon config show command. The
set of nodes where this configuration is enabled can be identified through the mmlscluster command.
Those nodes where performance monitoring metrics collection is enabled are marked with the perfmon
designation as shown in the following sample:
prompt# mmlscluster

GPFS cluster information


========================
GPFS cluster name: s1.zimon.zc2.ibm.com
GPFS cluster id: 13860500485217864948
GPFS UID domain: s1.zimon.zc2.ibm.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name Designation
---------------------------------------------------------------------------------
1 s1.zimon.zc2.ibm.com 9.4.134.196 s1.zimon.zc2.ibm.com quorum-perfmon
2 s2.zimon.zc2.ibm.com 9.4.134.197 s2.zimon.zc2.ibm.com quorum-perfmon
3 s3.zimon.zc2.ibm.com 9.4.134.198 s3.zimon.zc2.ibm.com quorum-perfmon
4 s4.zimon.zc2.ibm.com 9.4.134.199 s4.zimon.zc2.ibm.com quorum-perfmon
5 s5.zimon.zc2.ibm.com 9.4.134.2 s5.zimon.zc2.ibm.com quorum-perfmon

If mmperfmon config show does not show any configuration and no nodes are designated perfmon,
the configuration can be managed manually.

Automated configuration
In the performance monitoring tool, sensors can be configured on nodes that are part of an IBM Storage
Scale cluster through an IBM Storage Scale based configuration mechanism. However, this requires the
execution of the mmchconfig release=LATEST command.
The automated configuration method allows the sensor configuration to be stored as part
of the IBM Storage Scale configuration. Automated configuration is available for the sensor
configuration files, /opt/IBM/zimon/ZIMonSensors.cfg, and partly for the collector configuration
files, /opt/IBM/zimon/ZIMonCollector.cfg. Only the peers section for federation is available for
the collector configuration files. In this setup, the /opt/IBM/zimon/ZIMonSensors.cfg configuration
file on each IBM Storage Scale node is maintained by IBM Storage Scale. As a result, the file must not be
edited manually because whenever IBM Storage Scale needs to update a configuration parameter, the file
is regenerated and any manual modifications are overwritten. Before using the automated configuration,

Chapter 4. Performance monitoring 107


an initial configuration needs to be stored within IBM Storage Scale. You can store this initial configuration
by using the mmperfmon config generate command as shown:
prompt# mmperfmon config generate \
--collectors collector1.domain.com,collector2.domain.com,...
The mmperfmon config generate command uses a template configuration file for generating
the automated configuration. The default location for that template configuration is /opt/IBM/zimon/
defaults/ZIMonSensors.cfg.
The template configuration includes the initial settings for all the sensors and may be modified prior
to invoking the mmperfmon config generate command. This file also includes a parameter called
colCandidates. This parameter specifies the number of collectors that each sensor must report its data
to. This may be of interest for high-availability setups, where each metric must be sent to two collectors
in case one collector becomes unavailable. The colCandidates parameter is used to automatically
configure federation between the specified collectors. For more information, see “Configuring multiple
collectors” on page 116.
Once the configuration file is stored within IBM Storage Scale, it can be activated as follows:
prompt# mmchnode --perfmon –N nodeclass1,nodeclass2,…
Note: Any previously existing configuration file is overwritten. Configuration changes result in a new
version of the configuration file, which is then propagated through the IBM Storage Scale cluster at the file
level.
To deactivate the performance monitoring tool, the same command is used but with the --noperfmon
switch supplied instead. Configuration parameters can be changed with the following command where
parami is of the form sensorname.sensorattribute:
prompt# mmperfmon config update param1=value1 param2=value2 …
Sensors that collect per cluster metrics such as GPFSDiskCap, GPFSFilesetQuota, GPFSFileset, and
GPFSPool must only run on a single node in the cluster for the following reasons:
1. They typically impose some overhead.
2. The data reported is the same, independent of the node the sensor is running on
Other sensors such, as the cluster export services sensors, must also only run on a specific set of nodes.
For all these sensors, the restrict function is especially intended.
Some sensors, such as VFS, are not enabled by default even though they have associated predefined
queries with the mmperfmon query command. To enable VFS sensors, use the mmfsadm vfsstats
enable command on the node. To enable a sensor, set the period value to an integer greater than 0 and
restart the sensors on that node by using the systemctl restart pmsensors command.

Removing an automated configuration


When upgrading the performance monitoring tool, it is important to note how the previous version was
configured and if the configuration mechanism is to be changed. If the configuration mechanism is to
be changed, it is important to verify that the installed versions of both IBM Storage Scale and the
performance monitoring tool support the new configuration method. However, if you want to use the
manual configuration method, then take care of the following:
1. None of the nodes in the cluster must be designated perfmon nodes. If the nodes in the cluster are
designated as perfmon nodes then run mmchnode --noperfmon –N all command.
2. Delete the centrally stored configuration information by issuing mmperfmon config delete --all
command.
3. Manual configuration is no longer actively supported.
The /opt/IBM/zimon/ZIMonSensors.cfg file is then maintained manually. This mode is useful if
sensors are to be installed on non-IBM Storage Scale nodes or if you want to have a cluster with multiple
levels of IBM Storage Scale running.

108 IBM Storage Scale 5.1.9: Problem Determination Guide


Manual configuration
Performance monitoring tools can also be configured manually by the user.
Important: The performance monitoring tool gets automatically configured. This configuration
automatically overrides any manual changes you try to make to the configuration. If you wish to change
an automated configuration to a manual one, then follow the steps given in Removing an automated
configuration in the Automated configuration section in the IBM Storage Scale: Administration Guide.
When configuring the performance monitoring tool manually, the installation toolkit sets up a default set
of sensors to monitor on each node. You can modify the sensors on each individual node.
The configuration file of the sensors, ZimonSensors.cfg, is located on each node in the /opt/IBM/
zimon folder. The file lists all groups of sensors in it. The configuration file includes the parameter setting
of the sensors, such as the reporting frequency, and controls the sensors that are active within the cluster.
The file also contains the host name of the node where the collector is running that the sensor must be
reporting to.
For example:

sensors =
{
name = "CPU"
period = 1
},
{ name = "Load"
period = 1
},
{
name = "Memory"
period = 1
},
{
name = "Network"
period = 1
filter = "eth*"

},
{
name = "Netstat"
period = 1
},

The period in the example specifies the interval size in number of seconds when a sensor group gathers
data. 0 means that the sensor group is disabled and 1 runs the sensor group every second. You can
specify a higher value to decrease the frequency at which the data is collected.
Whenever the configuration file is changed, you must stop and restart the pmsensor daemon by using the
following commands:
1. Issue the systemctl stop pmsensors command to stop (deactivate) the sensor.
2. Issue the systemctl start pmsensors command to restart (activate) the sensor.
Some sensors, such as the cluster export services sensors run on a specific set of nodes. Other sensors,
such as the GPFSDiskCap sensor must run on a single node in the cluster since the data reported is the
same, independent of the node the sensor is running on. For these types of sensors, the restrict function
is especially intended. For example, to restrict a NFSIO sensor to a node class and change the reporting
period to once every 10 hours, you can specify NFSIO.period=36000 NFSIO.restrict=nodeclass1
as attribute value pairs in the update command.
Some sensors, such as VFS, are not enabled by default even though they have associated predefined
queries with the mmperfmon query command. This is so because the collector might display
performance issues of its own if it is required to collect more than 1000000 metrics per second. To enable
VFS sensors, use the mmfsadm vfsstats enable command on the node. To enable a sensor, set the
period value to an integer greater than 0 and restart the sensors on that node by using the systemctl
restart pmsensors command.

Chapter 4. Performance monitoring 109


You can set filters for the Network and DiskFree sensors that prevent specific disk or network interface
card names from being recorded. These filters have default values for new installs as shown:

{ filter = "netdev_name=veth.*"
name = "Network"
period = 1},

{filter = "mountPoint=/var/lib/docker.*"
name = "DiskFree"
period = 600}

You can update the filter using the mmperfmon config update command.

Adding or removing a sensor from an existing automated configuration


The performance monitoring system can be configured manually or through an automated process. To
add a set of sensors for an automatic configuration, generate a file containing the sensors and the
configuration parameters to be used.
The following example shows the file /tmp/new-pmsensors.conf that is used to add the following
sensors:
• A new sensor NFSIO, which is not activated yet (period=0).
• Another sensor SMBStats, whose metrics are reported every second (period=1).
The restrict field is set to cesNodes so that these sensors only run on nodes from the cesNodes node
class:

sensors = {
name = "NFSIO"
period = 0
restrict = "cesNodes"
type = "Generic"
},
{
name = "SMBStats"
period = 1
restrict = "cesNodes"
type = "Generic"
}

Ensure that the sensors are added and listed as part of the performance monitoring configuration. Run the
following command to add the sensor to the configuration:

mmperfmon config add --sensors /tmp/new-pmsensors.conf

If any of the sensors mentioned in the file exist already, they are mentioned in the output for the
command and those sensors are ignored, and the existing sensor configuration is kept. After the sensor
is added to the configuration file, its configuration settings can be updated using mmperfmon config
update command.
Run the following command to delete a sensor from the configuration:

prompt# mmperfmon config delete --sensors Sensor[,Sensor...]

Note: There are two new sensors, GPFSPool and GPFSFileset for the pmsensor service. If an older
version of the IBM Storage Scale performance monitoring system is upgraded, these sensors are not
automatically enabled. This is because automatically enabling the sensors might cause the collectors to
consume more main memory than what was set aside for monitoring. Changing the memory footprint
of the collector database might cause issues for the users if the collectors are tightly configured.
For information on how to manually configure the performance monitoring system (file-managed
configuration), see the Manual configuration section in the IBM Storage Scale: Administration Guide.
Related reference
“List of performance metrics” on page 117

110 IBM Storage Scale 5.1.9: Problem Determination Guide


The performance monitoring tool can report the following metrics:

Automatic assignment of single node sensors


The GPFSFilesetQuota, GPFSFileset, GPFSPool, and GPFSDiskCap sensors must be restricted
to run only on a single node in the cluster. The system automatically selects an adequate node
within the cluster by assigning a restricted value, @CLUSTER_PERF_SENSOR, to the sensors' restrict
field in the cluster. For example, create a configuration by using the mmperfmon config update
GPFSDiskCap.restrict=@CLUSTER_PERF_SENSOR command.
The node for CLUSTER_PERF_SENSOR is selected automatically based following criteria:
• The node has the PERFMON designation.
• The PERFMON component of the node is HEALTHY.
• The GPFS component of the node is HEALTHY.
Note: You can use the mmhealth node show command to find out whether the PERFMON and GPFS
components of a node are in a HEALTHY state.
By default, this node is selected from all nodes in the cluster. A GUI node is always preferred. However,
if you want to restrict the pool of nodes from which the sensor node is chosen, then you can create a
node class, CLUSTER_PERF_SENSOR_CANDIDATES, by using the mmcrnodeclass command. After the
CLUSTER_PERF_SENSOR_CANDIDATES node class is created, only the nodes in this class are selected.
If the selected node is in the DEGRADED state, then the system automatically reconfigures the
CLUSTER_PERF_SENSOR to another node that is in the HEALTHY state and triggers a restart of
performance monitoring service on the previous and currently selected nodes.
A user can view the currently selected node in the cluster by using the mmccr vget
CLUSTER_PERF_SENSOR command. If the mmhealth node eventlog command is run on the
DEGRADED and HEALHTY nodes, then it lists the singleton_sensor_off and singletom_sensor_on
events respectively.
If the automatic reconfiguration of the CLUSTER_PERF_SENSOR happens frequently, then the restart of
sensors is triggered more often than their configured period value. This can impact the system load and its
overall performance.
Note:
The GPFSDiskCap sensor is I/O intensive and it queries for the available file space on all GPFS file
systems. In case of large clusters with many file systems, if the GPFSDiskCap sensor is frequently
restarted, it can negatively impact the system performance. The GPFSDiskCap sensor can cause a similar
impact on the system performance as the mmdf command.
Therefore, it is recommended to use a dedicated node name, instead of using @CLUSTER_PERF_SENSOR
for any sensor, in the restrict field of a single node sensor until the node stabilizes in the HEALTHY
state.
For example, the GPFSDiskCap sensor is configured using @CLUSTER_PERF_SENSOR variable in the
restrict field as shown in the following configuration:
• name = GPFSDiskCap
• period = 86400
• restrict = @CLUSTER_PERF_SENSOR
If this node is frequently restarted, then it can impact the system performance and cause system load
issues. This issue can be avoided by using a dedicated node name as shown in the following configuration
by using the mmperfmon config update GPFSDiskCap.restrict=abc.cluster.node.com
command.
• name = GPFSDiskCap
• period = 86400
• restrict = abc.cluster.node.com

Chapter 4. Performance monitoring 111


Note: If you manually configure the restrict field of the capacity sensors, then you must ensure that all
the file systems on the specified node are mounted. This is done so that the file system-related data, like
capacity, can be recorded.
A newly installed cluster has @CLUSTER_PERF_SENSOR as the default value in the restrict fields of the
GPFSFilesetQuota, GPFSFileset, GPFSPool, and GPFSDiskCap sensors.
An updated cluster, which was installed before IBM Storage Scale 5.0.5, might not be configured to use
this feature automatically, and must be reconfigured by the administrator. You can use the mmperfmon
config update SensorName.restrict=@CLUSTER_PERF_SENSOR command, where SensorName
is the GPFSFilesetQuota, GPFSFileset, GPFSPool, and GPFSDiskCap sensors, to update the
configuration.
CAUTION: The sensor update works only when the cluster has a minRelelaseLevel of 5.0.1-0 or
higher. If you have 4.2.3.-x or 5.0.0.-x nodes in your cluster, this feature does not work.

Enabling performance monitoring sensors by using GUI


You need to enable performance monitor sensors to capture the performance data and share it with the
performance monitoring collectors.
Perform the following steps to enable performance monitoring sensor:
1. Go to Services > Performance Monitoring page in the IBM Storage Scale GUI.
2. Click Sensors. The Sensors tab lists all the performance monitoring sensors that are available in the
cluster.
3. Select the sensor that needs to be enabled from the list of sensors and then click Edit. The Edit Sensor
window appears.
4. Click the radio button that is placed next to the Interval field to specify the data collection interval of
the sensor. Specifying the collection interval for a sensor also enables the sensor.
5. Specify the scope of the data collection in the Node scope field. You can select either all nodes or
node classes as the scope. If you select Node Class as the node scope, the GUI node classes appear,
and you need to select the node classes based on your requirement.
6. Click Save to save the changes made to the sensor configuration.

Impact of sensor configuration


Sensor configuration has an impact on pmcollector memory consumption.
When you enable a sensor, there is an impact on the collector memory and disk consumption and the
query processing time. The impact depends on the system configuration and the number of keys and
metrics that are provided by the sensor. For example, sensor NFSIO keys and metrics scales by the
number of protocol nodes (cesNodes) and the number of NFS exports.
A high number of NFS exports or CES nodes has a large impact on the pmcollector memory consumption.
In addition, the sensor scheduling frequency period setting has an impact on how many metric values are
sent to the collector. A lesser period value means that metric values are sent more often to the collector
that increases pmcollector memory consumption.
The following table lists and explains the impact of enabling certain sensors and how the impact scales.

Table 30. Sensor with high impacts


Sensor name Scaling factor Default period setting Comment
NFSIO Number of cesNodes * 10 The key consists of
number of NFS exports node, sensor, export,
and nfs_ver (NFSv3,
NFSv4), 12 metric
values per key. For more
information, see NFS
metrics.

112 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 30. Sensor with high impacts (continued)
Sensor name Scaling factor Default period setting Comment
GPFSDisk Number of NSD server 0 (disabled) The key consists
nodes * number of disks of node, sensor,
gpfs_cluster_name,
gpfs_fs_name,
gpfs_disk_name, 16
metric values per key.
For more information,
see GPFS metrics.
Warning: Enable
the sensor on
demand only
and restrict to
the nodes of
interest.

GPFSAFMFSET Number of AFM gateway 10 The key consists


nodes * number of file of node, sensor,
sets gpfs_fs_name,
gpfs_fset_name.
More than 130 metric
values per key. For more
information, see GPFS
metrics.

Recommendations
• System entities that do not exist anymore leave their key signature and their historical data in the
collector database.
For example, if you delete a node, file system, file set, disk, and other resources from the cluster,
the key and data of the removed entity remains within the collector database. The same is valid if
you rename an entity (for example, rename a node). If your system entities are often short-living
and for temporary usage only, such entities might cause a high impact on the pmcollector memory
consumption.
• You must check your system for such expired keys from time to time and delete the expired keys by
running the following commands to list the expired keys first and delete afterward.

mmperfm query --list=expiredKeys

mmperfmon delete --expiredKeys

For more information about command usage, see mmperfmon command in the IBM Storage Scale:
Command and Programming Reference Guide.
Note: The historical data of the removed entities is not removed from the database automatically
with immediate effect because it is possible that the customer might need to query the
corresponding historical data later.

Configuring the initial sensor data poll delay for long periods
This section provides information on how to configure the initial sensor data poll delay for long periods.
Some performance monitoring sensors might invoke heavy-load tasks like executing the tsdf command.
The scheduled period of such sensors can be as high as once per hour or once per day, and the heavy-
load tasks can force the Performance Monitoring Tool (PMT) to change its configuration frequently. When
the PMT configuration changes frequently, the system triggers the sensors to restart at every instance.

Chapter 4. Performance monitoring 113


Once the sensors restart, they might continue to run more often than the defined period value. This
causes system overload issues, which can tamper the overall system performance.
In order to mitigate the impact of system overload, the PMT supports a Boolean flag,
delay_initial_poll. This flag delays the initial invocation of sensors depending on its period value
and is set to true by default. The metrics of those sensors, where the delay_initial_poll is set
to true, experience a delay and are not immediately available after a sensor's restart. However, if the
delayed invocation of sensor is not preferred, then the delay_initial_poll flag is set to false.
If the delay_initial_poll flag is set to true, then it delays the very first poll of sensors as follows:
• If the period is greater than or equal to 24 hours, then there is a delay of three hours until the first poll of
the sensor is done.
• If the period is greater than or equal to 12 hours, then there is a delay of two hours until the first poll of
the sensor is done.
• If the period is greater than or equal to 3 hours, then there is a delay of one hour until the first poll of the
sensor is done.
• If the period is greater than or equal to 1 hour, then there is a delay of one-third of the period until the
first poll of the sensor is done.
Here, period refers to the configured time period for a sensor.

Configuring the collector


The following section describes how to configure the collector in a performance monitoring tool.
The most important configuration options are the domains and the peers configuration options. All other
configuration options are best left at their defaults and are explained within the default configuration file
that is shipped with ZIMON.
The configuration file of the collector, ZIMonCollector.cfg, is located in the /opt/IBM/zimon/
folder. The ZIMonCollector.cfg file is maintained automatically by the mmperfmon config command
and must not be changed manually. Manual changes might be overwritten by the next call of the
mmperfmon config command. The description below is for educational purposes and the values that
are shown might differ from the default values actually used on the system.
CAUTION: If you modify any domain parameters after the data collection has started, that might
cause loss of the historical data that is already collected.

Metric domain configuration


The domains configuration indicates the number of metrics to be collected and how long they must be
retained and in what granularity. Multiple domains might be specified. If data no longer fits into the
current domain, data is spilled over into the next domain and resampled.
A simple configuration is shown.

domains = {
# this is the raw domain, aggregation factor for the raw domain is always 0
aggregation = 0
duration = "12h"
filesize = "1g" # maximum file size
files = 5 # number of files.
},

{
# this is the second domain that aggregates to 60 seconds
aggregation = 60
duration = "4w"
filesize = "500m" # maximum file size
files = 4 # number of files.
},

114 IBM Storage Scale 5.1.9: Problem Determination Guide


{
# this is the third domain that aggregates to 30*60 seconds == 30 minutes
aggregation = 30
duration = "1y"
filesize = "500m" # maximum file size
files = 4 # number of files.
}

The configuration file lists several data domains. At least one domain must be present, and the first
domain represents the raw data collection as the data is collected by sensors. The aggregation parameter
for this first domain must be set to 0.
Each domain specifies the following parameters:
• The duration parameter indicates the time period until the collected metrics are pushed into the next
(coarser-grained) domain. If this option is left out, no limit on the duration is imposed. The units are
seconds, hours, days, weeks, months, and years { s, h, d, w, m, y }.
• The filesize and files parameter indicates how much space is allocated on disk for a specific
domain. When metrics are stored in memory, a persistence mechanism is in place that also stores
the metrics on disk in files of size filesize. Once the number of files is reached and a new file is
to be allocated, the oldest file is removed from the disk. The persistent storage must be at least as
large as the amount of main memory to be allocated for a domain. The size of the persistent storage is
significant because when the collector is restarted, the in-memory database is re-created from these
files.
If both the ram and the duration parameters are specified, both constraints are active at the same
time. As soon as one of the constraints is impacted, the collected metrics are pushed into the next
(coarser-grained) domain.
The aggregation value, which is used for the second and following domains, indicates the resampling to
be performed. When data is spilled into this domain, it is resampled to be no better than indicated by the
aggregation factor. The value for the second domain is in seconds, the value for domain n (n >2) is the
value of domain n-1 multiplied by the aggregation value of domain n.
The collector collects the metrics from the sensors. Depending on the number of nodes and metrics that
are collected, the collector requires a different amount of main memory to store the collected metrics in
the memory. For example, in a five-node cluster that reports only the load values (load1, load5, load15) ,
the collector maintains 15 metrics (three metrics times five nodes).
The collectors can be stopped (deactivated) by issuing the systemctl stop pmcollector command.
The collectors can be started (activated) by issuing the systemctl start pmcollector command.

IPv6 or dual stack configuration


A dual stack configuration has both IPv4 and IPv6 environments. In an IPv6 or dual stack environments,
the collector service interface needs to be configured so that it also accepts IPv6 request. This requires
you to manually change the /opt/IBM/zimon/ZIMonCollector.cfg file and restart the pmcollector
service on all the collectors in the cluster.
The ZIMON collector is deployed on the management GUI in a federated configuration. That is, the
performance monitor tool installation in a management GUI can involve multiple collectors. However,
in a dual stack configuration, a single name resolution must be defined for the multiple pmcollectors
within the cluster. All names and IP addresses of the peers must be updated in the /opt/IBM/zimon/
ZIMonCollector.cfg file for all the collector nodes. Individual pmcollectors cannot run on different IP
protocol versions. Moreover, to establish a successful connection between the GUI and the pmcollector
the name resolution must also be updated in the gpfsgui.properties. For example, you must assign
zimonAddress == localhost6 or ::1 for IPV6 and zimonAddress == localhost for IPV4 in
gpfsgui.properties. A sudo or root user can also change the order of the hosts in etc/hosts to
establish the connection between GUI and the ZIMON collector.

Chapter 4. Performance monitoring 115


Configuring multiple collectors
The performance monitoring tool installation can have a single collector or can consist of multiple
collectors to increase the scalability or the fault-tolerance of the performance monitoring system. This
latter configuration is referred to as federation. A single collector can easily support up to 150 sensor
nodes.
Note: For federation to work, all the collectors need to have the same version number.
In a multi-collector federated configuration, the collectors need to know about each other, else a collector
would return only the data that is stored in its own measurement database. After the collectors know
the peer collectors, they collaborate with each other to collect data for a given measurement query. All
collectors that are part of the federation are specified in the peers configuration option in the collector’s
configuration file as shown as:

peers = {
host = "collector1.mydomain.com"
port = "9085"
}, {
host = "collector2.mydomain.com"
port = "9085"
}

The port number is the one specified by the federationport configuration option, typically set to 9085.
It is acceptable to list the current host so that the same configuration file can be used for all the collector
machines.
After the peers are specified, a query for measurement data can be directed to any of the collectors listed
in the peers section. Also, the collectors collect and assemble a response that is based on all relevant
data from all collectors. Hence, clients only need to contact a single collector to get all the measurements
available in the system.
To distribute the measurement data reported by sensors over multiple collectors, multiple collectors
might be specified when automatically configuring the sensors, as shown in the following sample:
prompt# mmperfmon config generate \
--collectors collector1.domain.com,collector2.domain.com,…
If multiple collectors are specified, then the federation between these collectors is configured
automatically. The peers section in those collectors' configuration files, /opt/IBM/zimon/
ZIMonCollector.cfg, is also updated. The sensors pick one of the many collectors to report their
measurement data to. The sensors use stable hashes to pick the collector such that the sensor-collector
relationship does not change too much when new collectors are added or when a collector is removed.
Additionally, sensors and collectors can be configured for high availability. To maintain high availability,
each metric is sent to two collectors in case one collector becomes unavailable. In this setting, sensors
report their measurement data to more than one collector so that the failure of a single collector would
not lead to any data loss. For instance, if the collector redundancy is increased to two, then every sensor
reports to two collectors. As a side-effect of increasing the redundancy to two, the bandwidth that is used
for reporting measurement data is duplicated. The collector redundancy must be configured before the
sensor configuration is stored in GPFS by changing the colRedundancy option in /opt/IBM/zimon/
defaults/ZIMonSensors.cfg as explained in the “Configuring the sensor” on page 107 section.
Note: The federation interconnections require the IP address of peer collectors to be reverse-resolvable
to the long daemon name, which is used for the mmperfmon config --collectors option.

Migrating the pmcollector


Follow these steps to migrate an old collector node A to a new collector node B. Also, ensure that new
collector node provisions sufficient memory and filesystem space. For more information, see Planning for
performance monitoring tool in the IBM Storage Scale: Concepts, Planning, and Installation Guide for the
recommendations.
1. Install the pmcollector packages on the new node B. Do not start the pmcollector.service on the
new node.

116 IBM Storage Scale 5.1.9: Problem Determination Guide


Note: If a federation with other collectors is used, the version of the pmcollector package needs to be
the same on all the collector nodes that are being used in the federation.
2. Stop the pmcollector.service on all the collector nodes in the cluster by using the following
command:

mmdsh -N all systemctl stop pmcollector

3. Disable the service on the old collector node A by using the following command:

systemctl disable pmcollector --now

4. Change the peers section in the /opt/IBM/zimon/ZIMonCollector.cfg file on all the collector
nodes so that the new collector node B is included, and the old collector node A is removed.
Note: You must not edit the other collectors in the peers section.
5. Change the collector setting for the sensors by using the following command:

mmperfmon config update --collectors=nodeB[,list_of_other_collector_nodes

Note: All used collectors need to be mentioned in the command.


6. Stop pmcollector.service, and clean the data folder on the new collector node B by using the
following commands:

systemctl disable pmcollector


systemctl stop pmcollector
rm -r /opt/IBM/zimon/data/*

7. Move the complete data folder with its sub-folders from the old collector node A to the new collector
node B by using the following command:

scp -r nodeA:/opt/IBM/zimon/data nodeB:/opt/IBM/zimon/

8. Start the pmcollector.service on the new collector node B by using the following command:

systemctl enable pmcollector


systemctl start pmcollector

9. Remove the pmcollector packages on the old collector node A.


Note:
• The GUI must have a pmcollector on the same node for performance graphs. You can either move the
GUI node along with the pmcollector, or not move the pmcollector.
• It is not possible to merge data if data is not collected by the new collector node.

List of performance metrics


The performance monitoring tool can report the following metrics:

Linux metrics
All network and general metrics are native without any computed metrics. The following section lists all
the Linux metrics:

CPU
The following section lists information about CPU in the system. For example, myMachine|CPU|
cpu_user.
• cpu_contexts: Number of context switches across all CPU cores.
• cpu_guest: Percentage of total CPU spent running a guest OS. Included in cpu_user.
• cpu_guest_nice: Percentage of total CPU spent running as nice guest OS. Included in cpu_nice.

Chapter 4. Performance monitoring 117


• cpu_hiq: Percentage of total CPU spent serving hardware interrupts.
• cpu_idle: Percentage of total CPU spent idling.
• cpu_interrupts: Number of interrupts serviced.
• cpu_iowait: Percentage of total CPU spent waiting for I/O to complete.
• cpu_nice: Percentage of total CPU time spent in lowest-priority user processes.
• cpu_siq: Percentage of total CPU spent serving software interrupts.
• cpu_steal: Percentage of total CPU spent waiting for other OS in a virtualized environment.
• cpu_system: Percentage of total CPU time spent in kernel mode.
• cpu_user: Percentage of total CPU time spent in normal priority user processes.

DiskFree
The following section lists information about the free disk. Each mounted directory has a separate
section. For example, myMachine|DiskFree|myMount|df_free.
• df_free: Amount of free disk space on the file system.
• df_total: Amount of total disk space on the file system.
• df_used: Amount of used disk space on the file system.

Diskstat
The following section lists information about the disk status for each of the disks. For example,
myMachine|Diskstat|myDisk|disk_active_ios.
• disk_active_ios: Number of I/O operations currently in progress.
• disk_aveq: Weighted number of milliseconds spent doing I/Os.
• disk_io_time: Number of milliseconds the system spent doing I/O operation.
• disk_read_ios: Total number of read operations completed successfully.
• disk_read_merged: Number of (small) read operations that are merged into a larger read.
• disk_read_sect: Number of sectors read.
• disk_read_time: Amount of time in milliseconds spent reading.
• disk_write_ios: Number of write operations completed successfully.
• disk_write_merged: Number of (small) write operations that are merged into a larger write.
• disk_write_sect: Number of sectors written.
• disk_write_time: Amount of time in milliseconds spent writing.

Load
The following section lists information about the load statistics for a particular node. For example,
myMachine|Load|jobs.
• jobs: The total number of jobs that currently exist in the system.
• load1: The average load (number of jobs in the run queue) over the last minute.
• load15: The average load (number of jobs in the run queue) over the last 15 minutes.
• load5: The average load (number of jobs in the run queue) over the 5 minutes.

Memory
The following section lists information about the memory statistics for a particular node. For example,
myMachine|Memory|mem_active.
• mem_active: Active memory that was recently accessed.

118 IBM Storage Scale 5.1.9: Problem Determination Guide


• mem_active_anon: Active memory with no file association, that is, heap and stack memory.
• mem_active_file: Active memory that is associated with a file, for example, page cache memory.
• mem_available: Estimated value of memory that is available for starting new applications without
swapping.
• mem_buffers: Temporary storage used for raw disk blocks.
• mem_cached: In-memory cache for files read from disk (the page cache). Does not include
mem_swapcached.
• mem_dirty: Memory, which is waiting to get written back to the disk.
• mem_inactive: Inactive memory that is not accessed recently.
• mem_inactive_anon: Inactive memory with no file association, that is, inactive heap and stack
memory.
• mem_inactive_file: Inactive memory that is associated with a file, for example, page cache memory.
• mem_memfree: Total free RAM.
• mem_memtotal: Total usable RAM.
• mem_mlocked: Memory that is locked.
• mem_swapcached: In-memory cache for pages that are swapped back in.
• mem_swapfree: Amount of swap space that remains unused.
• mem_swaptotal: Total amount of swap space available.
• mem_unevictable: Memory that cannot be paged out.

Netstat
The following section lists information about the network status for a particular node. For example,
myMachine|Netstat|ns_remote_bytes_r.
• ns_closewait: Number of connections in the TCP_CLOSE_WAIT state.
• ns_established: Number of connections in the TCP_ESTABLISHED state.
• ns_listen: Number of connections in the TCP_LISTEN state.
• ns_local_bytes_r: Number of bytes received from a local node to a local node.
• ns_local_bytes_s: Number of bytes sent from a local node to a local node.
• ns_localconn: Number of local connections from a local node to a local node.
• ns_remote_bytes_r: Number of bytes sent from a local node to a remote node.
• ns_remote_bytes_s: Number of bytes sent from a remote node to a local node.
• ns_remoteconn: Number of remote connections from a local node to a remote node.
• ns_timewait: Number of connections in the TCP_TIME_WAIT state.

Network
The following section lists information about the network statistics per interface for a particular node. For
example, myMachine|Network|myInterface|netdev_bytes_r.
• netdev_bytes_r: Number of bytes received.
• netdev_bytes_s: Number of bytes sent.
• netdev_carrier: Number of carrier loss events.
• netdev_collisions: Number of collisions.
• netdev_compressed_r: Number of compressed frames received.
• netdev_compressed_s: Number of compressed packets sent.
• netdev_drops_r: Number of packets dropped while receiving.

Chapter 4. Performance monitoring 119


• netdev_drops_s: Number of packets dropped while sending.
• netdev_errors_r: Number of read errors.
• netdev_errors_s: Number of write errors.
• netdev_fifo_r: Number of FIFO buffer errors.
• netdev_fifo_s: Number of FIFO buffer errors while sending.
• netdev_frames_r: Number of frame errors while receiving.
• netdev_multicast_r: Number of multicast packets received.
• netdev_packets_r: Number of packets received.
• netdev_packets_s: Number of packets sent.

TopProc sensor
The TopProc sensor collects aggregated resource consumption for the 10 most CPU consuming
processes. The sensor is different from other performance monitoring sensors. The statistics can be
checked by using the mmperfmon report top command instead of with the mmperfmon query
command. For more information about the mmperfmon command, see the mmperfmon command in the
IBM Storage Scale: Command and Programming Reference Guide.
For example,

# mmperfmon report top -b 60 -n 1


Top values in format:
{process name | PID} {cpu per mil} {memory per mil}

Time node-21
2023-10-30-18:24:00 mmsysmon.py 28 13
pmcollector 13 8
tuned 12 3
mmfsd 12 107
java 9 49
pmsensors 7 1
vmtoolsd 5 1
kworker/1:1-ev 3 0
systemd 1 1
ksoftirqd/0 1 0

• {process name | PID}: Is the process name or the PID.


• {cpu per mil}: Is CPU usage in per mil (one-tenth of a percent) considering the process number of
system CPU usage.
• {memory per mil}: Is the memory usage in per mil (one-tenth of a percent) considering the total
system RAM (sysInfo.totalram).

GPFS metrics
The following section lists all the GPFS metrics:
• “GPFSDisk” on page 121
• “GPFSDiskCap” on page 121
• “GPFSFileset” on page 122
• “GPFSFileSystem” on page 122
• “GPFSFileSystemAPI” on page 123
• “GPFSLROC” on page 124
• “GPFSNode” on page 125
• “GPFSNodeAPI” on page 126
• “GPFSNSDDisk” on page 126
• “GPFSNSDFS” on page 127
• “GPFSNSDPool” on page 127

120 IBM Storage Scale 5.1.9: Problem Determination Guide


• “GPFSPDDisk” on page 128
• “GPFSPool” on page 128
• “GPFSPoolIO” on page 128
• “GPFSQoS” on page 128
• “GPFSVFSX” on page 129
• “GPFSVIO64” on page 134
• “GPFSWaiters” on page 135
• GPFSFilesetQuota
• “GPFSBufMgr” on page 135
• GPFSRPCS
• “Computed Metrics” on page 136

GPFSDisk
For each NSD in the system, for example myMachine|GPFSDisk|myCluster|myFilesystem|myNSD|
gpfs_ds_bytes_read
• gpfs_ds_bytes_read: Number of bytes read.
• gpfs_ds_bytes_written: Number of bytes written.
• gpfs_ds_max_disk_wait_rd: The longest time spent waiting for a disk read operation.
• gpfs_ds_max_disk_wait_wr: The longest time spent waiting for a disk write operation.
• gpfs_ds_max_queue_wait_rd: The longest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_ds_max_queue_wait_wr: The longest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_ds_min_disk_wait_rd: The shortest time spent waiting for a disk read operation.
• gpfs_ds_min_disk_wait_wr: The shortest time spent waiting for a disk write operation.
• gpfs_ds_min_queue_wait_rd: The shortest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_ds_min_queue_wait_wr: The shortest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_ds_read_ops: Number of read operations.
• gpfs_ds_tot_disk_wait_rd: The total time in seconds spent waiting for disk read operations.
• gpfs_ds_tot_disk_wait_wr: The total time in seconds spent waiting for disk write operations.
• gpfs_ds_tot_queue_wait_rd: The total time that is spent between being enqueued for a read
operation and the completion of that operation.
• gpfs_ds_tot_queue_wait_wr: The total time that is spent between being enqueued for a write
operation and the completion of that operation.
• gpfs_ds_write_ops: Number of write operations.

GPFSDiskCap
Specifies the available disk space capacity on GPFS file systems per pool and per disk.
The key structure is:
<gpfs_cluster_name>|GPFSDiskCap|<gpfs_fs_name>|<gpfs_diskpool_name>|
<gpfs_disk_name>|<metric_name>
Following are the metrics (metric_name):
• gpfs_disk_disksize: Total size of disk.

Chapter 4. Performance monitoring 121


• gpfs_disk_free_fullkb: Total available space on disk in full blocks.
• gpfs_disk_free_fragkb: Total available space on disk in fragments.

GPFSInodeCap
Specifies the available inode capacity on GPFS file systems.
The key structure is:
<gpfs_cluster_name>|GPFSInodeCap|<gpfs_fs_name>|<metric_name>
Following are the metrics (metric_name):
• gpfs_fs_inode_used: Number of used inodes.
• gpfs_fs_inode_free: Number of free inodes.
• gpfs_fs_inode_alloc: Number of allocated inodes.
• gpfs_fs_inode_max: Maximum number of inodes.
Note: The GPFSInodeCap is not an independent sensor in the perfmon config, but a sub-sensor of the
GPFSDiskCap sensor.

GPFSPoolCap
Specifies the available disk space capacity on GPFS file systems per pool and per disk usage type.
The key structure is:
<gpfs_cluster_name>|GPFSPoolCap|<gpfs_fs_name>|<gpfs_diskpool_name>|
<gpfs_disk_usage_name>|<metric_name>
The gpfs_disk_usage_name can be either of the following values:
• dataAndMetadata
• dataOnly
• descOnly
• metadataOnly
Following are the metrics (metric_name):
• gpfs_pool_disksize: Total size of all disks for this usage type.
• gpfs_pool_free_fullkb: Total available disk space in full blocks for this usage type.
• gpfs_pool_free_fragkb: Total available space in fragments for this usage type.
Note: The GPFSPoolCap is not an independent sensor in the perfmon config, but a sub-sensor of the
GPFSDiskCap sensor.

GPFSFileset
For each independent fileset in the file system: Cluster name - GPFSFileset - filesystem name - fileset
name.
For example, myCluster|GPFSFileset|myFilesystem|myFileset|gpfs_fset_maxInodes.
• gpfs_fset_maxInodes: Maximum number of inodes for this independent fileset.
• gpfs_fset_freeInodes: Number of free inodes available for this independent fileset.
• gpfs_fset_allocInodes: Number of inodes allocated for this independent fileset.

GPFSFileSystem
For each file system, for example myMachine|GPFSFilesystem|myCluster|myFilesystem|
gpfs_fs_bytes_read

122 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_fs_bytes_read: Number of bytes read.
• gpfs_fs_bytes_written: Number of bytes written.
• gpfs_fs_disks: Number of disks in the file system.
• gpfs_fs_max_disk_wait_rd: The longest time spent waiting for a disk read operation.
• gpfs_fs_max_disk_wait_wr: The longest time spent waiting for a disk write operation.
• gpfs_fs_max_queue_wait_rd: The longest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_fs_max_queue_wait_wr: The longest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_fs_min_disk_wait_rd: The shortest time spent waiting for a disk read operation.
• gpfs_fs_min_disk_wait_wr: The shortest time spent waiting for a disk write operation.
• gpfs_fs_min_queue_wait_rd: The shortest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_fs_min_queue_wait_wr: The shortest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_fs_read_ops: Number of read operations.
• gpfs_fs_tot_disk_wait_rd: The total time in seconds spent waiting for disk read operations.
• gpfs_fs_tot_disk_wait_wr: The total time in seconds spent waiting for disk write operations.
• gpfs_fs_tot_queue_wait_rd: The total time that is spent between being enqueued for a read
operation and the completion of that operation.
• gpfs_fs_tot_queue_wait_wr: The total time that is spent between being enqueued for a write
operation and the completion of that operation.
• gpfs_fs_write_ops: Number of write operations.

Note:
The behavior of the minimum and maximum wait time for read and write I/O to disk and the queue
wait time, for example metrics such as *max_disk_wait_wr and *max_queue_wait_wr, has changed.
These metrics are now reset each time a sample is taken.
In the previous releases, these metrics were the minimum or maximum values noted since the start of
the mmfsd daemon and were only reset after the mmfsd daemon was restarted. The maximum and
minimum values are equal to the maximum and minimum value of each read instance in a sensor period
because the metrics are reset each time a sample is taken in a sensor period.

GPFSFileSystemAPI
These metrics give the following information for each file system (application view). For example,
myMachine|GPFSFilesystemAPI|myCluster|myFilesystem|gpfs_fis_bytes_read.
• gpfs_fis_bytes_read: Number of bytes read.
• gpfs_fis_bytes_written: Number of bytes written.
• gpfs_fis_close_calls: Number of close calls.
• gpfs_fis_disks: Number of disks in the file system.
• gpfs_fis_inodes_written: Number of inode updates to disk.
• gpfs_fis_open_calls: Number of open calls.
• gpfs_fis_read_calls: Number of read calls.
• gpfs_fis_readdir_calls: Number of readdir calls.
• gpfs_fis_write_calls: Number of write calls.

Chapter 4. Performance monitoring 123


GPFSLROC
These metrics provide following details about the Local Read Only Cache (LROC) operations:
For example, myNode|GPFSLROC|gpfs_lroc_capmb.
Note: All minimum (*_rmin), maximum (*_rmax), and average (*_ravg) result time metrics can
be reset only once when the mmfsd daemon is restarted, or using the mmpmon reset or mmfsadm
resetstats commands.
• gpfs_lroc_agt_i: Total number of inserts from the FLEA agent.
• gpfs_lroc_agt_i_ravg: Average response time in microseconds, which is taken by FLEA agent to
insert the data.
• gpfs_lroc_agt_i_rmax: Maximum response time in microseconds, which is taken by FLEA agent to
insert the data.
• gpfs_lroc_agt_i_rmin: Minimum response time in microseconds, which is taken by FLEA agent to
insert the data.
• gpfs_lroc_agt_r: Total number of reads from the FLEA agent.
• gpfs_lroc_agt_r_ravg: Average response time in microseconds, which is taken by FLEA agent to
read the data.
• gpfs_lroc_agt_r_rmax: Maximum response time in microseconds, which is taken by FLEA agent to
read the data.
• gpfs_lroc_agt_r_rmin: Minimum response time in microseconds, which is taken by FLEA agent to
read the data.
• gpfs_lroc_capmb: Total capacity of space in MB.
• gpfs_lroc_data_abs: Average amount of data in bytes that is stored in the cache per buffer
descriptor.
• gpfs_lroc_data_i: Total number of data invalidations in the cache.
• gpfs_lroc_data_if: Total number of failures to invalidate data from the cache.
• gpfs_lroc_data_imb: Total amount of invalidated data in MB.
• gpfs_lroc_data_inr: Total number of data invalidations without being recalled from the cache.
• gpfs_lroc_data_q: Total number of data queries in the cache.
• gpfs_lroc_data_qf: Total number of data query failures.
• gpfs_lroc_data_qmb: Total amount of data in MB, which is queried in the cache.
• gpfs_lroc_data_r: Total number of data that is recalled from the cache.
• gpfs_lroc_data_rf: Total number of failures that are required to recall the data.
• gpfs_lroc_data_rmb: Total amount of data in MB, which is recalled from the cache.
• gpfs_lroc_data_s: Total number of data that is stored in the cache.
• gpfs_lroc_data_sf: Total number of failures to store data in the cache.
• gpfs_lroc_data_smb: Total amount of space in MB used to cache the data.
• gpfs_lroc_directory_i: Total number of directories that are invalidated from the cache.
• gpfs_lroc_directory_if: Total number of failures that is required to invalidate directory data from
the cache.
• gpfs_lroc_directory_imb: Total amount of invalidated directory data in MB.
• gpfs_lroc_directory_inr: Total number of directories that are invalidated without being recalled
from the cache.
• gpfs_lroc_directory_q: Total number of directories that are queried in the cache.
• gpfs_lroc_directory_qf: Total number of directory data query failures.
• gpfs_lroc_directory_qmb: Total amount of directory data in MB, which is queried in the cache.

124 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_lroc_directory_r: Total number of directories that are recalled from the cache.
• gpfs_lroc_directory_rf: Total number of failures to recall the directory data.
• gpfs_lroc_directory_rmb: Total amount of directory data in MB, which is recalled from the cache.
• gpfs_lroc_directory_s: Total number of directories that are stored in the cache.
• gpfs_lroc_directory_sf: Total number of failures to store the directory data in the cache.
• gpfs_lroc_directory_smb: Total amount of space in MB, which is used to cache the directory data.
• gpfs_lroc_inode_i: Total number of inodes that are invalidated from the cache.
• gpfs_lroc_inode_if: Total number of failures to invalidate the inode data from the cache.
• gpfs_lroc_inode_imb: Total amount of space in MB to invalidate the inode data from the cache.
• gpfs_lroc_inode_inr: Total number of inodes that are invalidated without being recalled from the
cache.
• gpfs_lroc_inode_q: Total number of inodes that are queried in the cache.
• gpfs_lroc_inode_qf: Total number of inode query failures.
• gpfs_lroc_inode_qmb: Total amount of space in MB of inode data that is queried in the cache.
• gpfs_lroc_inode_r: Total number of inodes that are recalled from the cache.
• gpfs_lroc_inode_rf: Total number of failures to recall inode data.
• gpfs_lroc_inode_rmb: Total amount of space in MB of inode data that is recalled from the cache.
• gpfs_lroc_inode_s: Total number of inodes that are stored in the cache.
• gpfs_lroc_inode_sf: Total number of failures to store inode data in the cache.
• gpfs_lroc_inode_smb: Total amount of space in MB that is used to cache the inodes.
• gpfs_lroc_ssd_r: Total number of IOPs that are read into the SSD device.
• gpfs_lroc_ssd_r_ravg: Average response time in microseconds that is taken by the SSD device to
read the data.
• gpfs_lroc_ssd_r_rmax: Maximum response time in microseconds, which is taken by the SSD device
to read the data.
• gpfs_lroc_ssd_r_rmin: Minimum response time in microseconds, which is taken by the SSD device
to read the data.
• gpfs_lroc_ssd_r_p: Total number of pages that are read from the SSD device.
• gpfs_lroc_ssd_w: Total number of IOPs that are written to the SSD device.
• gpfs_lroc_ssd_w_p: Total number of pages that are written to the SSD device.
• gpfs_lroc_ssd_w_ravg: Average response time in microseconds, which is taken by the SSD device
to write the data.
• gpfs_lroc_ssd_w_rmax: Maximum response time in microseconds, which is taken by the SSD device
to write the data.
• gpfs_lroc_ssd_w_rmin: Minimum response time in microseconds, which is taken by the SSD device
to write the data.
• gpfs_lroc_usemb: Total used space in MB.

GPFSNode
These metrics give the following information for a particular node. For example, myNode|GPFSNode|
gpfs_ns_bytes_read.
• gpfs_ns_bytes_read: Number of bytes read.
• gpfs_ns_bytes_written: Number of bytes written.
• gpfs_ns_clusters: Number of clusters that are participating.
• gpfs_ns_disks: Number of disks in all mounted file systems.

Chapter 4. Performance monitoring 125


• gpfs_ns_filesys: Number of mounted file systems.
• gpfs_ns_max_disk_wait_rd: The longest time spent waiting for a disk read operation.
• gpfs_ns_max_disk_wait_wr: The longest time spent waiting for a disk write operation.
• gpfs_ns_max_queue_wait_rd: The longest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_ns_max_queue_wait_wr: The longest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_ns_min_disk_wait_rd: The shortest time spent waiting for a disk read operation.
• gpfs_ns_min_disk_wait_wr: The shortest time spent waiting for a disk write operation.
• gpfs_ns_min_queue_wait_rd: The shortest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_ns_min_queue_wait_wr: The shortest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_ns_read_ops: Number of read operations.
• gpfs_ns_tot_disk_wait_rd: The total time in seconds spent waiting for disk read operations.
• gpfs_ns_tot_disk_wait_wr: The total time in seconds spent waiting for disk write operations.
• gpfs_ns_tot_queue_wait_rd: The total time that is spent between being enqueued for a read
operation and the completion of that operation.
• gpfs_ns_tot_queue_wait_wr: The total time that is spent between being enqueued for a write
operation and the completion of that operation.
• gpfs_ns_write_ops: Number of write operations.
Note:
The behavior of the minimum and maximum wait time for read and write I/O to disk and the queue
wait time, for example metrics such as *max_disk_wait_wr and *max_queue_wait_wr, has changed.
These metrics are now reset each time a sample is taken.
In the previous releases, these metrics were the minimum or maximum values noted since the start of
the mmfsd daemon and were only reset after the mmfsd daemon was restarted. The maximum and
minimum values are equal to the maximum and minimum value of each read instance in a sensor period
because the metrics are reset each time a sample is taken in a sensor period.

GPFSNodeAPI
These metrics give the following information for a particular node from its application point of view. For
example, myMachine|GPFSNodeAPI|gpfs_is_bytes_read.
• gpfs_is_bytes_read: Number of bytes read.
• gpfs_is_bytes_written: Number of bytes written.
• gpfs_is_close_calls: Number of close calls.
• gpfs_is_inodes_written: Number of inode updates to disk.
• gpfs_is_open_calls: Number of open calls.
• gpfs_is_readDir_calls: Number of readdir calls.
• gpfs_is_read_calls: Number of read calls.
• gpfs_is_write_calls: Number of write calls.

GPFSNSDDisk
These metrics give the following information about each NSD disk on the NSD server. For example,
myMachine|GPFSNSDDisk|myNSDDisk|gpfs_nsdds_bytes_read.
• gpfs_nsdds_bytes_read: Number of bytes read.

126 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_nsdds_bytes_written: Number of bytes written.
• gpfs_nsdds_max_disk_wait_rd: The longest time spent waiting for a disk read operation.
• gpfs_nsdds_max_disk_wait_wr: The longest time spent waiting for a disk write operation.
• gpfs_nsdds_max_queue_wait_rd: The longest time between being enqueued for a disk read
operation and the completion of that operation.
• gpfs_nsdds_max_queue_wait_wr: The longest time between being enqueued for a disk write
operation and the completion of that operation.
• gpfs_nsdds_min_disk_wait_rd: The shortest time spent waiting for a disk read operation.
• gpfs_nsdds_min_disk_wait_wr: The shortest time spent waiting for a disk write operation.
• gpfs_nsdds_min_queue_wait_rd: The shortest time between being enqueued for a disk read
operation and the completion of that operation.
• gpfs_nsdds_min_queue_wait_wr: The shortest time between being enqueued for a disk write
operation and the completion of that operation.
• gpfs_nsdds_read_ops: Number of read operations.
• gpfs_nsdds_tot_disk_wait_rd: The total time in seconds spent waiting for disk read operations.
• gpfs_nsdds_tot_disk_wait_wr: The total time in seconds spent waiting for disk write operations.
• gpfs_nsdds_tot_queue_wait_rd: The total time that is spent between being enqueued for a read
operation and the completion of that operation.
• gpfs_nsdds_tot_queue_wait_wr: The total time that is spent between being enqueued for a write
operation and the completion of that operation.
• gpfs_nsdds_write_ops: Number of write operations.
Note:
The behavior of the minimum and maximum wait time for read and write I/O to disk and the queue
wait time, for example metrics such as *max_disk_wait_wr and *max_queue_wait_wr, has changed.
These metrics are now reset each time a sample is taken.
In the previous releases, these metrics were the minimum or maximum values noted since the start of
the mmfsd daemon and were only reset after the mmfsd daemon was restarted. The maximum and
minimum values are equal to the maximum and minimum value of each read instance in a sensor period
because the metrics are reset each time a sample is taken in a sensor period.

GPFSNSDFS
These metrics give the following information for each file system served by a specific NSD server. For
example, myMachine|GPFSNSDFS|myFilesystem|gpfs_nsdfs_bytes_read.
• gpfs_nsdfs_bytes_read: Numbers of NSD bytes read, aggregated to the file system.
• gpfs_nsdfs_bytes_written: Numbers of NSD bytes written, aggregated to the file system.
• gpfs_nsdfs_read_ops: Numbers of NSD read operations, aggregated to the file system.
• gpfs_nsdfs_write_ops: Numbers of NSD write operations, aggregated to the file system.

GPFSNSDPool
These metrics give the following information for each file system and pool that is served
by a specific NSD server. For example, myMachine|GPFSNSDPool|myFilesystem|myPool|
gpfs_nsdpool_bytes_read.
• gpfs_nsdpool_bytes_read: Numbers of NSD bytes read, aggregated to the file system.
• gpfs_nsdpool_bytes_written: Numbers of NSD bytes written, aggregated to the file system.
• gpfs_nsdpool_read_ops: Numbers of NSD read operations, aggregated to the file system.
• gpfs_nsdpool_write_ops: Numbers of NSD write operations, aggregated to the file system.

Chapter 4. Performance monitoring 127


GPFSPDDisk
The sensor GPFSPDDisk provides Pdisk server statistics.
• gpfs_pdds_read_ops: The number of disk read operations.
• gpfs_pdds_bytes_read: The number of bytes read from disk.
• gpfs_pdds_tot_disk_wait_rd: The total time spent waiting for disk read operations.
• gpfs_pdds_tot_queue_wait_rd: The total time spent between being enqueued for a disk read
operation and the completion of that operation.
• gpfs_pdds_min_disk_wait_rd: The shortest time spent waiting for a disk read operation.
• gpfs_pdds_min_queue_wait_rd: The shortest time between being enqueued for a disk read
operation and the completion of that operation.
• gpfs_pdds_max_disk_wait_rd: The longest time spent waiting for a disk read operation.
• gpfs_pdds_max_queue_wait_rd: The longest time between being enqueued for a disk read
operation and the completion of that operation.
• gpfs_pdds_write_ops: The number of disk write operations
• gpfs_pdds_bytes_written: The number of bytes written to disk
• gpfs_pdds_tot_disk_wait_wr: The total time spent waiting for disk write operations.
• gpfs_pdds_tot_queue_wait_wr: The total time spent between being enqueued for a disk write
operation and the completion of that operation.
• gpfs_pdds_min_disk_wait_wr: The shortest time spent waiting for a disk write operation.
• gpfs_pdds_min_queue_wait_wr: The shortest time between being enqueued for a disk write
operation and the completion of that operation.
• gpfs_pdds_max_disk_wait_wr: The longest time spent waiting for a disk write operation.
• gpfs_pdds_max_queue_wait_wr: The longest time between being enqueued for a disk write
operation and the completion of that operation.

GPFSPool
For each pool in each file system: Cluster name - GPFSPool - filesystem name -pool name.
For example, myCluster|GPFSPool|myFilesystem|myPool|gpfs_pool_free_dataKBvalid*.
• gpfs_pool_free_dataKB: Free capacity for data (in KB) in the pool.
• gpfs_pool_total_dataKB: Total capacity for data (in KB) in the pool.
• gpfs_pool_free_metaKB: Free capacity for metadata (in KB) in the pool.
• gpfs_pool_total_metaKB: Total capacity for metadata (in KB) in the pool.

GPFSPoolIO
These metrics give the details about each cluster, file system, and pool in the system, from the point of
view of a specific node. For example, myMachine|GPFSPoolIO|myCluster|myFilesystem|myPool|
gpfs_pool_bytes_rd
• gpfs_pool_bytes_rd: Bytes read from the pool.
• gpfs_pool_bytes_wr: Bytes written to the pool.
• gpfs_pool_free_fragkb: Total available space in fragments for this usage type.

GPFSQoS
These metrics give the following information for each QoS class in the system: Cluster name -
GPFSQoS - filesystem name - storage pool name - fileset name. For example, myCluster|GPFSQoS|
myFilesystem|data|misc|myFileset|gpfs_qos_iops.

128 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_qos_et: The interval in seconds during which the measurement was made.
• gpfs_qos_iops: The performance of the class in I/O operations per second.
• gpfs_qos_ioql: The average number of I/O requests in the class that are pending for reasons other
than being queued by QoS.
• gpfs_qos_maxiops:: The limitation of the class in I/O operations per second.
• gpfs_qos_maxmbs: The limitation of the class in I/O throughput per second.
• gpfs_qos_mbs: I/O in megabytes per second.
• gpfs_qos_nnodes: The number of nodes of throttling of the class are widely opened.
• gpfs_qos_qsdl: The average number of I/O requests in the class that are queued by QoS.

GPFSVFSX
The GPFSVFSX sensor provides virtual file system operations statistics, including the number of
operations and their average, minimum and maximum latency metrics.
Some performance monitoring sensors, such as VFS, are not enabled by default, even though they have
predefined queries that are associated with the mmperfmon query command. This happens because the
collector might have performance issues when it is required to collect more than a million metrics per
second.
In order to enable the VFS statistics and sensor, use the mmfsadm vfsstats enable command on the
node. Similarly, to enable the GPFSVFSX sensor, set the period value to an integer n, which is greater than
zero by using the following command:

mmperfmon config update GPFSVFSX.period=n

Later, the VFS statistics can be disabled by using the mmfsadm vfsstats disable command, and the
GPFSVFSX sensor can be disabled by using the following command:

mmperfmon config update GPFSVFSX.period=0

The GPFSVFSX sensor provides the following metrics:


• gpfs_vfsx_accesses: Number of accesses operations.
• gpfs_vfsx_accesses_t: Amount of time in seconds spent in accesses operations.
• gpfs_vfsx_accesses_tmax: Maximum amount of time in seconds spent in accesses operations.
• gpfs_vfsx_accesses_tmin: Minimum amount of time in seconds spent in accesses operations.
• gpfs_vfsx_aioread: Number of aioread operations.
• gpfs_vfsx_aioread_t: Amount of time in seconds spent in aioread operations.
• gpfs_vfsx_aioread_tmax: Maximum amount of time in seconds spent in aioread operations.
• gpfs_vfsx_aioread_tmin: Minimum amount of time in seconds spent in aioread operations.
• gpfs_vfsx_aiowrite: Number of aiowrite operations.
• gpfs_vfsx_aiowrite_t: Amount of time in seconds spent in aiowrite operations.
• gpfs_vfsx_aiowrite_tmax: Maximum amount of time in seconds spent in aiowrite operations.
• gpfs_vfsx_aiowrite_tmin: Minimum amount of time in seconds spent in aiowrite operations.
• gpfs_vfsx_clear: Number of clear operations.
• gpfs_vfsx_clear_t: Amount of time in seconds spent in clear operations.
• gpfs_vfsx_clear_tmax: Maximum amount of time in seconds spent in clear operations.
• gpfs_vfsx_clear_tmin: Minimum amount of time in seconds spent in clear operations.
• gpfs_vfsx_close: Number of close operations.
• gpfs_vfsx_close_t: Amount of time in seconds spent in close operations.
• gpfs_vfsx_close_tmax: Maximum amount of time in seconds spent in close operations.

Chapter 4. Performance monitoring 129


• gpfs_vfsx_close_tmin: Minimum amount of time in seconds spent in close operations.
• gpfs_vfsx_create: Number of create operations.
• gpfs_vfsx_create_t: Amount of time in seconds spent in create operations.
• gpfs_vfsx_create_tmax: Maximum amount of time in seconds spent in create operations.
• gpfs_vfsx_create_tmin: Minimum amount of time in seconds spent in create operations.
• gpfs_vfsx_decodeFh: Number of decodeFh operations.
• gpfs_vfsx_decodeFh_t: Amount of time in seconds spent in decodeFh operations.
• gpfs_vfsx_decodeFh_tmax: Maximum amount of time in seconds spent in decodeFh operations.
• gpfs_vfsx_decodeFh_tmin: Minimum amount of time in seconds spent in decodeFh operations.
• gpfs_vfsx_encodeFh: Number of encodeFh operations.
• gpfs_vfsx_encodeFh_t: Amount of time in seconds spent in encodeFh operations.
• gpfs_vfsx_encodeFh_tmax: Maximum amount of time in seconds spent in encodeFh operations.
• gpfs_vfsx_encodeFh_tmin: Minimum amount of time in seconds spent in encodeFh operations.
• gpfs_vfsx_flock: Number of flock operations.
• gpfs_vfsx_flock_t: Amount of time in seconds spent in flock operations.
• gpfs_vfsx_flock_tmax: Maximum amount of time in seconds spent in flock operations.
• gpfs_vfsx_flock_tmin: Minimum amount of time in seconds spent in flock operations.
• gpfs_vfsx_fsync: Number of fsync operations.
• gpfs_vfsx_fsync_t: Amount of time in seconds spent in fsync operations.
• gpfs_vfsx_fsync_tmax: Maximum amount of time in seconds spent in fsync operations.
• gpfs_vfsx_fsync_tmin: Minimum amount of time in seconds spent in fsync operations.
• gpfs_vfsx_fsyncRange: Number of fsyncRange operations.
• gpfs_vfsx_fsyncRange_t: Amount of time in seconds spent in fsyncRange operations.
• gpfs_vfsx_fsyncRange_tmax: Maximum amount of time in seconds spent in fsyncRange
operations.
• gpfs_vfsx_fsyncRange_tmin: Minimum amount of time in seconds spent in fsyncRange operations.
• gpfs_vfsx_ftrunc: Number of ftrunc operations.
• gpfs_vfsx_ftrunc_t: Amount of time in seconds spent in ftrunc operations.
• gpfs_vfsx_ftrunc_tmax: Maximum amount of time in seconds spent in ftrunc operations.
• gpfs_vfsx_ftrunc_tmin: Minimum amount of time in seconds spent in ftrunc operations.
• gpfs_vfsx_getattr: Number of getattr operations.
• gpfs_vfsx_getattr_t: Amount of time in seconds spent in getattr operations.
• gpfs_vfsx_getattr_tmax: Maximum amount of time in seconds spent in getattr operations.
• gpfs_vfsx_getattr_tmin: Minimum amount of time in seconds spent in getattr operations.
• gpfs_vfsx_getDentry: Number of getDentry operations.
• gpfs_vfsx_getDentry_t: Amount of time in seconds spent in getDentry operations.
• gpfs_vfsx_getDentry_tmax: Maximum amount of time in seconds spent in getDentry operations.
• gpfs_vfsx_getDentry_tmin: Minimum amount of time in seconds spent in getDentry operations.
• gpfs_vfsx_getParent: Number of getParent operations.
• gpfs_vfsx_getParent_t: Amount of time in seconds spent in getParent operations.
• gpfs_vfsx_getParent_tmax: Maximum amount of time in seconds spent in getParent operations.
• gpfs_vfsx_getParent_tmin: Minimum amount of time in seconds spent in getParent operations.
• gpfs_vfsx_getxattr: Number of getxattr operations.
• gpfs_vfsx_getxattr_t: Amount of time in seconds spent in getxattr operations.

130 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_vfsx_getxattr_tmax: Maximum amount of time in seconds spent in getxattr operations.
• gpfs_vfsx_getxattr_tmin: Minimum amount of time in seconds spent in getxattr operations.
• gpfs_vfsx_link: Number of link operations.
• gpfs_vfsx_link_t: Amount of time in seconds spent in link operations.
• gpfs_vfsx_link_tmax: Maximum amount of time in seconds spent in link operations.
• gpfs_vfsx_link_tmin: Minimum amount of time in seconds spent in link operations.
• gpfs_vfsx_listxattr: Number of listxattr operations.
• gpfs_vfsx_listxattr_t: Amount of time in seconds spent in listxattr operations.
• gpfs_vfsx_listxattr_tmax: Maximum amount of time in seconds spent in listxattr operations.
• gpfs_vfsx_listxattr_tmin: Minimum amount of time in seconds spent in listxattr operations.
• gpfs_vfsx_lockctl: Number of lockctl operations.
• gpfs_vfsx_lockctl_t: Amount of time in seconds spent in lockctl operations.
• gpfs_vfsx_lockctl_tmax: Maximum amount of time in seconds spent in lockctl operations.
• gpfs_vfsx_lockctl_tmin: Minimum amount of time in seconds spent in lockctl operations.
• gpfs_vfsx_lookup: Number of lookup operations.
• gpfs_vfsx_lookup_t: Amount of time in seconds spent in lookup operations.
• gpfs_vfsx_lookup_tmax: Maximum amount of time in seconds spent in lookup operations.
• gpfs_vfsx_lookup_tmin: Minimum amount of time in seconds spent in lookup operations.
• gpfs_vfsx_mapLloff: Number of mapLloff operations.
• gpfs_vfsx_mapLloff_t: Amount of time in seconds spent in mapLloff operations.
• gpfs_vfsx_mapLloff_tmax: Maximum amount of time in seconds spent in mapLloff operations.
• gpfs_vfsx_mapLloff_tmin: Minimum amount of time in seconds spent in mapLloff operations.
• gpfs_vfsx_mkdir: Number of mkdir operations.
• gpfs_vfsx_mkdir_t: Amount of time in seconds spent in mkdir operations.
• gpfs_vfsx_mkdir_tmax: Maximum amount of time in seconds spent in mkdir operations.
• gpfs_vfsx_mkdir_tmin: Minimum amount of time in seconds spent in mkdir operations.
• gpfs_vfsx_mknod: Number of mknod operations.
• gpfs_vfsx_mknod_t: Amount of time in seconds spent in mknod operations.
• gpfs_vfsx_mknod_tmax: Maximum amount of time in seconds spent in mknod operations.
• gpfs_vfsx_mknod_tmin: Minimum amount of time in seconds spent in mknod operations.
• gpfs_vfsx_mmapread: Number of mmapread operations.
• gpfs_vfsx_mmapread_t: Amount of time in seconds spent in mmapread operations.
• gpfs_vfsx_mmapread_tmax: Maximum amount of time in seconds spent in mmapread operations.
• gpfs_vfsx_mmapread_tmin: Minimum amount of time in seconds spent in mmapread operations.
• gpfs_vfsx_mmapwrite: Number of mmapwrite operations.
• gpfs_vfsx_mmapwrite_t: Amount of time in seconds spent in mmapwrite operations.
• gpfs_vfsx_mmapwrite_tmax: Maximum amount of time in seconds spent in mmapwrite operations.
• gpfs_vfsx_mmapwrite_tmin: Minimum amount of time in seconds spent in mmapwrite operations.
• gpfs_vfsx_mount: Number of mount operations.
• gpfs_vfsx_mount_t: Amount of time in seconds spent in mount operations.
• gpfs_vfsx_mount_tmax: Maximum amount of time in seconds spent in mount operations.
• gpfs_vfsx_mount_tmin: Minimum amount of time in seconds spent in mount operations.
• gpfs_vfsx_open: Number of open operations.

Chapter 4. Performance monitoring 131


• gpfs_vfsx_open_t: Amount of time in seconds spent in open operations.
• gpfs_vfsx_open_tmax: Maximum amount of time in seconds spent in open operations.
• gpfs_vfsx_open_tmin: Minimum amount of time in seconds spent in open operations.
• gpfs_vfsx_read: Number of read operations.
• gpfs_vfsx_read_t: Amount of time in seconds spent in read operations.
• gpfs_vfsx_read_tmax: Maximum amount of time in seconds spent in read operations.
• gpfs_vfsx_read_tmin: Minimum amount of time in seconds spent in read operations.
• gpfs_vfsx_readdir: Number of readdir operations.
• gpfs_vfsx_readdir_t: Amount of time in seconds spent in readdir operations.
• gpfs_vfsx_readdir_tmax: Maximum amount of time in seconds spent in readdir operations.
• gpfs_vfsx_readdir_tmin: Minimum amount of time in seconds spent in readdir operations.
• gpfs_vfsx_readlink: Number of readlink operations.
• gpfs_vfsx_readlink_t: Amount of time in seconds spent in readlink operations.
• gpfs_vfsx_readlink_tmax: Maximum amount of time in seconds spent in readlink operations.
• gpfs_vfsx_readlink_tmin: Minimum amount of time in seconds spent in readlink operations.
• gpfs_vfsx_readpage: Number of readpage operations.
• gpfs_vfsx_readpage_t: Amount of time in seconds spent in readpage operations.
• gpfs_vfsx_readpage_tmax: Maximum amount of time in seconds spent in readpage operations.
• gpfs_vfsx_readpage_tmin: Minimum amount of time in seconds spent in readpage operations.
• gpfs_vfsx_remove: Number of remove operations.
• gpfs_vfsx_remove_t: Amount of time in seconds spent in remove operations.
• gpfs_vfsx_remove_tmax: Maximum amount of time in seconds spent in remove operations.
• gpfs_vfsx_remove_tmin: Minimum amount of time in seconds spent in remove operations.
• gpfs_vfsx_removexattr: Number of removexattr operations.
• gpfs_vfsx_removexattr_t: Amount of time in seconds spent in removexattr operations.
• gpfs_vfsx_removexattr_tmax: Maximum amount of time in seconds spent in removexattr
operations.
• gpfs_vfsx_removexattr_tmin: Minimum amount of time in seconds spent in removexattr
operations.
• gpfs_vfsx_rename: Number of rename operations.
• gpfs_vfsx_rename_t: Amount of time in seconds spent in rename operations.
• gpfs_vfsx_rename_tmax: Maximum amount of time in seconds spent in rename operations.
• gpfs_vfsx_rename_tmin: Minimum amount of time in seconds spent in rename operations.
• gpfs_vfsx_rmdir: Number of rmdir operations.
• gpfs_vfsx_rmdir_t: Amount of time in seconds spent in rmdir operations.
• gpfs_vfsx_rmdir_tmax: Maximum amount of time in seconds spent in rmdir operations.
• gpfs_vfsx_rmdir_tmin: Minimum amount of time in seconds spent in rmdir operations.
• gpfs_vfsx_setacl: Number of setacl operations.
• gpfs_vfsx_setacl_t: Amount of time in seconds spent in setacl operations.
• gpfs_vfsx_setacl_tmax: Maximum amount of time in seconds spent in setacl operations.
• gpfs_vfsx_setacl_tmin: Minimum amount of time in seconds spent in setacl operations.
• gpfs_vfsx_setattr: Number of setattr operations.
• gpfs_vfsx_setattr_t: Amount of time in seconds spent in setattr operations.
• gpfs_vfsx_setattr_tmax: Maximum amount of time in seconds spent in setattr operations.

132 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_vfsx_setattr_tmin: Minimum amount of time in seconds spent in setattr operations.
• gpfs_vfsx_setxattr: Number of setxattr operations.
• gpfs_vfsx_setxattr_t: Amount of time in seconds spent in setxattr operations.
• gpfs_vfsx_setxattr_tmax: Maximum amount of time in seconds spent in setxattr operations.
• gpfs_vfsx_setxattr_tmin: Minimum amount of time in seconds spent in setxattr operations.
• gpfs_vfsx_statfs: Number of statfs operations.
• gpfs_vfsx_statfs_t: Amount of time in seconds spent in statfs operations.
• gpfs_vfsx_statfs_tmax: Maximum amount of time in seconds spent in statfs operations.
• gpfs_vfsx_statfs_tmin: Minimum amount of time in seconds spent in statfs operations.
• gpfs_vfsx_symlink: Number of symlink operations.
• gpfs_vfsx_symlink_t: Amount of time in seconds spent in symlink operations.
• gpfs_vfsx_symlink_tmax: Maximum amount of time in seconds spent in symlink operations.
• gpfs_vfsx_symlink_tmin: Minimum amount of time in seconds spent in symlink operations.
• gpfs_vfsx_sync: Number of sync operations.
• gpfs_vfsx_sync_t: Amount of time in seconds spent in sync operations.
• gpfs_vfsx_sync_tmax: Maximum amount of time in seconds spent in sync operations.
• gpfs_vfsx_sync_tmin: Minimum amount of time in seconds spent in sync operations.
• gpfs_vfsx_tsfattr: Number of tsfattr operations.
• gpfs_vfsx_tsfattr_t: Amount of time in seconds spent in tsfattr operations.
• gpfs_vfsx_tsfattr_tmax: Maximum amount of time in seconds spent in tsfattr operations.
• gpfs_vfsx_tsfattr_tmin: Minimum amount of time in seconds spent in tsfattr operations.
• gpfs_vfsx_tsfsattr: Number of tsfsattr operations.
• gpfs_vfsx_tsfsattr_t: Amount of time in seconds spent in tsfsattr operations.
• gpfs_vfsx_tsfsattr_tmax: Maximum amount of time in seconds spent in tsfsattr operations.
• gpfs_vfsx_tsfsattr_tmin: Minimum amount of time in seconds spent in tsfsattr operations.
• gpfs_vfsx_unmap: Number of unmap operations.
• gpfs_vfsx_unmap_t: Amount of time in seconds spent in unmap operations.
• gpfs_vfsx_unmap_tmax: Maximum amount of time in seconds spent in unmap operations.
• gpfs_vfsx_unmap_tmin: Minimum amount of time in seconds spent in unmap operations.
• gpfs_vfsx_vget: Number of vget operations.
• gpfs_vfsx_vget_t: Amount of time in seconds spent in vget operations.
• gpfs_vfsx_vget_tmax: Maximum amount of time in seconds spent in vget operations.
• gpfs_vfsx_vget_tmin: Minimum amount of time in seconds spent in vget operations.
• gpfs_vfsx_write: Number of write operations.
• gpfs_vfsx_write_t: Amount of time in seconds spent in write operations.
• gpfs_vfsx_write_tmax: Maximum amount of time in seconds spent in write operations.
• gpfs_vfsx_write_tmin: Minimum amount of time in seconds spent in write operations.
• gpfs_vfsx_writepage: Number of writepage operations.
• gpfs_vfsx_writepage_t: Amount of time in seconds spent in writepage operations.
• gpfs_vfsx_writepage_tmax: Maximum amount of time in seconds spent in writepage operations.
• gpfs_vfsx_writepage_tmin: Minimum amount of time in seconds spent in writepage operations.
Note: The GPFSVFS sensor cannot be used for IBM Storage Scale 5.1.0 and later versions. When the IBM
Storage Scale is upgraded to IBM Storage Scale 5.1.0, the user needs to perform the following manual
steps to migrate from the GPFSVFS sensor to the GPFSVFSX sensor:

Chapter 4. Performance monitoring 133


• The GPFSVFSX sensor must be added to the performance monitoring configuration by using the
following command:

mmperfmon config add --sensors /opt/IBM/zimon/defaults/ZIMonSensors_GPFSVFSX.cfg

• The GPFSVFS sensor must be disabled by using the following command:

mmperfmon config update GPFSVFS.period=0

External calls to the mmpmon vfssx interfere with the GPFSVFSX sensor. As the mmpmon command
resets the vfssx metrics after every call, it might cause both the GPFSVFSX sensor and the external caller
to retrieve inaccurate data.
Note:
The GPFSVFSX read and write operation time metrics do not provide meaningful minimum and
maximum values. The gpfs_vfsx_read_tmin and gpfs_vfsx_read_tmax have the same values
as the summarized time, which is gpfs_vfsx_read_t. Similarly, the gpfs_vfsx_write_tmin,
gpfs_vfsx_write_tmax, and gpfs_vfsx_write_t have the same values.
This is due to the aggregation of various read sub-operations, such as gpfs_f_read, gpfs_f_readv,
and gpfs_f_aio_read (that is completed synchronously) into gpfs_vfsx_read. And also, due
to the aggregation of various write sub-operations, such as gpfs_f_write, gpfs_f_writev, and
gpfs_f_aio_write (that is completed synchronously) into gpfs_vfsx_write.

GPFSVIO64
These metrics provide details of the virtual I/O server (VIOS) operations, where VIOS is supported.
Note: GPFSVIO64 is a replacement for GPFSVIO sensor and uses 64-bit counters.
• gpfs_vio64_fixitOps: Number of VIO fix strips operations with read medium error.
• gpfs_vio64_flushUpWrOps: Numbers of VIO flush update operations.
• gpfs_vio64_flushPFTWOps: Numbers of VIO flush promoted full-track write operations.
• gpfs_vio64_forceConsOps: Number of VIO force consistency operations.
• gpfs_vio64_FTWOps: Numbers of VIO full-track write operations.
• gpfs_vio64_logTipReadOps: Number of VIO log-tip read operations.
• gpfs_vio64_logHomeReadOps: Number of VIO log-home read operations.
• gpfs_vio64_logWriteOps: Number of VIO logs write operations.
• gpfs_vio64_medWriteOps: Number of VIO mediums write operations.
• gpfs_vio64_metaWriteOps: Number of recovery group metadata write operations.
• gpfs_vio64_migrateTrimOps: Number of migrate trim operations.
• gpfs_vio64_migratedOps: Number of VIO strip migration operations.
• gpfs_vio64_promFTWOps: Number of VIO promoted full-track write operations.
• gpfs_vio64_ptrackTrimOps: Number of ptrack trim operations.
• gpfs_vio64_readCacheHit: Numbers of cache hit with read operations.
• gpfs_vio64_readCacheMiss: Number of caches miss with read operations.
• gpfs_vio64_readOps: Number of VIO read operations.
• gpfs_vio64_RGDWriteOps: Number of recovery group descriptor write operations.
• gpfs_vio64_scrubOps: Number of VIO scrub operations.
• gpfs_vio64_shortWriteOps: Numbers of VIO short write operations.
• gpfs_vio64_vtrackTrimOps: Number of vtrack trim operations.
Note: To report the new sensor data, the pmcollector must have the same or a higher code version than
the pmsensors module. Otherwise, it ignores the data of the new sensors.

134 IBM Storage Scale 5.1.9: Problem Determination Guide


GPFSWaiters
For each independent fileset in the file system: Node- GPFSWaiters - waiters_time_threshold (all, 0.1s,
0.2s, 0.5s, 1.0s, 30.0s, 60.0s).
Note: Here 'all' implies a waiting time greater than or equal to 0 seconds.
For example, myNode|GPFSWaiters|all|gpfs_wt_count_all.
• gpfs_wt_count_all: Count of all threads with waiting time greater than or equal to
waiters_time_threshold seconds.
• gpfs_wt_count_delay: Count of threads that are waiting for delay interval expiration with waiting
time greater than or equal to waiters_time_threshold seconds.
• gpfs_wt_count_local_io: Count of threads that are waiting for local I/O with waiting time greater
than or equal to waiters_time_threshold seconds.
• gpfs_wt_count_network_io: Count of threads that are waiting for network I/O with waiting time
greater than or equal to waiters_time_threshold seconds.
• gpfs_wt_count_syscall: Count of threads that are waiting for system call completion with waiting
time greater than or equal to waiters_time_threshold seconds.
• gpfs_wt_count_thcond: Count of threads that are waiting for a GPFS condition variable to be
signaled with waiting time greater than or equal to waiters_time_threshold seconds.
• gpfs_wt_count_thmutex: Count of threads that are waiting to lock a GPFS mutex with waiting time
greater than or equal to waiters_time_threshold seconds.

GPFSFilesetQuota
The following metrics provide details of a fileset quota:
• gpfs_rq_blk_current: Number of kilobytes currently in use.
• gpfs_rq_blk_soft_limit: Assigned soft quota limit.
• gpfs_rq_blk_hard_limit: Assigned hard quota limit.
• gpfs_rq_blk_in_doubt: Number of kilobytes in-doubt, availability not yet resolved.
• gpfs_rq_file_current: Number of files (inodes) currently in use.
• gpfs_rq_file_soft_limit: Assigned soft quota limit.
• gpfs_rq_file_hard_limit: Assigned hard quota limit.
• gpfs_rq_file_in_doubt: Number of files (inodes) in-doubt, availability not yet resolved.

GPFSBufMgr
The following metric provides the current size of a page pool:
• gpfs_bufm_tot_poolSizeK: Total size of the page pool.
Note: To activate the sensor on upgraded systems, run the mmperfmon config add --
sensors /opt/IBM/zimon/defaults/ZIMonSensors_GPFSBufMgr.cfg command.

GPFSRPCS
Each metric that the GPFSRPCS sensor provides for a node is combined over all the peers to which the
node is connected. An average is a weighted average over all the peer connections. A minimum is the
minimum of the minimum for all the peer connections, and a maximum is the maximum of the maximum
for all the peer connections.
The GPFSRPCS metrics are:
• gpfs_rpcs_chn_av: The average amount of time the RPC must wait for access to a communication
channel to the target node.

Chapter 4. Performance monitoring 135


• gpfs_rpcs_chn_cnt: The number of local nodes to this cluster (peers) this stat is collected from.
• gpfs_rpcs_chn_max: The maximum amount of time the RPC must wait for access to a communication
channel to the target node.
• gpfs_rpcs_chn_min: The minimum amount of time the RPC must wait for access to a communication
channel to the target node.
• gpfs_rpcs_lat_tcp_av: The average latency of the RPC when sent and received over an Ethernet
interface.
• gpfs_rpcs_lat_tcp_cnt: The number of local nodes to this cluster (peers) this stat is collected
from.
• gpfs_rpcs_lat_tcp_max: The maximum latency of the RPC when sent and received over an Ethernet
interface.
• gpfs_rpcs_lat_tcp_min: The minimum latency of the RPC when sent and received over an Ethernet
interface.
• gpfs_rpcs_rcv_tcp_av : The average amount of time to transfer an RPC message from an Ethernet
interface into the daemon.
• gpfs_rpcs_rcv_tcp_cnt: The number of local nodes to this cluster (peers) this stat is collected
from.
• gpfs_rpcs_rcv_tcp_max: The maximum amount of time to transfer an RPC message from an
Ethernet interface into the daemon.
• gpfs_rpcs_rcv_tcp_min: The minimum amount of time to transfer an RPC message from an
Ethernet interface into the daemon.
• gpfs_rpcs_snd_tcp_av: The average amount of time to transfer an RPC message to an Ethernet
interface.
• gpfs_rpcs_snd_tcp_cnt: The number of local nodes to this cluster (peers) this stat is collected
from.
• gpfs_rpcs_snd_tcp_max: The maximum amount of time to transfer an RPC message to an Ethernet
interface.
• gpfs_rpcs_snd_tcp_min: The minimum amount of time to transfer an RPC message to an Ethernet
interface.

Computed Metrics
These metrics can be used only through the mmperfmon query command. The following metrics are
computed for GPFS:
• gpfs_create_avg_lat (latency): gpfs_vfs_create_t / gpfs_vfs_create
• gpfs_read_avg_lat (latency): gpfs_vfs_read_t / gpfs_vfs_read
• gpfs_remove_avg_lat (latency): gpfs_vfs_remove_t / gpfs_vfs_remove
• gpfs_write_avg_lat (latency): gpfs_vfs_write_t / gpfs_vfs_write
Important: The performance monitoring information that is driven by the IBM Storage Scale's internal
monitoring tool and driven by the users who use the mmpmon command might affect each other.

AFM metrics
You can use AFM metrics only when GPFS is configured on your system. The following section lists all the
AFM metrics.
GPFSAFM
• gpfs_afm_avg_time: Average time in seconds that a pending operation waited in the gateway queue
before it is sent to remote system.
• gpfs_afm_bytes_pending: Total number of bytes pending, which are not yet written to the remote
system.

136 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_afm_bytes_read: Total number of bytes read from remote system as a result of cache miss.
• gpfs_afm_bytes_written: Total number of bytes written to the remote system as a result of cache
updates.
• gpfs_afm_conn_broken: Total number of times the connection to the remote system was broken.
• gpfs_afm_conn_esta: Total number of times a connection was established with the remote system.
• gpfs_afm_fset_expired: Total number of times the fileset was marked expired due to a
disconnection with remote system and expiry of the configured timeout.
• gpfs_afm_longest_time: Longest time in seconds that a pending operation waited in the gateway
queue before it is sent to remote system.
• gpfs_afm_ops_expired: Number of operations that were sent to remote system because they were
expired after waiting until the configured asynchronous timeout in the gateway queue.
• gpfs_afm_ops_forced: Number of operations that were sent to remote system because they were
forced out of the gateway queue before the configured asynchronous timeout, which might be due to a
dependent operation.
• gpfs_afm_ops_revoked: Number of operations that were sent to the remote system because a
conflicting token, which is acquired from another GPFS node that is resulted in a revoke.
• gpfs_afm_ops_sent: Total number of operations that are sent over the communication protocol to
the remote system.
• gpfs_afm_ops_sync: Number of synchronous operations that were sent to remote system.
• gpfs_afm_num_queued_msgs: Number of messages that are currently enqueued.
• gpfs_afm_shortest_time: Shortest time in seconds that a pending operation waited in the gateway
queue before it is sent to remote system.
• gpfs_afm_tot_read_time: Total time in seconds to run read operations from the remote system.
• gpfs_afm_tot_write_time: Total time in seconds to run write operations to the remote system.
• gpfs_afm_used_q_memory: Used memory in bytes by the messages enqueued.
GPFSAFMFS
• gpfs_afm_fs_avg_time: Average time in seconds that a pending operation waited in the gateway
queue before it is sent to remote system for this file system.
• gpfs_afm_fs_bytes_pending: Total number of bytes pending, which are not yet written to the
remote system for this file system.
• gpfs_afm_fs_bytes_read: Total number of bytes read from remote system as a result of cache miss
for this file system.
• gpfs_afm_fs_bytes_written: Total number of bytes written to the remote system as a result of
cache updates for this file system.
• gpfs_afm_fs_conn_broken: Total number of times the connection to the remote system was broken
for this file system.
• gpfs_afm_fs_conn_esta: Total number of times a connection was established with the remote
system for this file system.
• gpfs_afm_fs_fset_expired: Total number of times the fileset was marked expired due to a
disconnection with remote system and expiry of the configured timeout for this file system.
• gpfs_afm_fs_longest_time: Longest time in seconds that a pending operation waited in the
gateway queue before it is sent to remote system for this file system.
• gpfs_afm_fs_num_queued_msgs: Number of messages that are currently queued for this file system.
• gpfs_afm_fs_ops_expired: Number of operations that were sent to remote system because they
were expired after waiting until the configured asynchronous timeout in the gateway queue for this file
system.

Chapter 4. Performance monitoring 137


• gpfs_afm_fs_ops_forced: Number of operations that were sent to remote system because they
were forced out of the gateway queue before the configured asynchronous timeout. The timeout might
be due to a dependent operation for this file system.
• gpfs_afm_fs_ops_revoked: Number of operations that were sent to the remote system because a
conflicting token, which is acquired from another GPFS node that is resulted in a revoke for this file
system.
• gpfs_afm_fs_ops_sent: Total number of operations that are sent over the communication protocol
to the remote system for this file system.
• gpfs_afm_fs_ops_sync: Number of synchronous operations that were sent to remote system for this
file system.
• gpfs_afm_fs_shortest_time: Shortest time in seconds that a pending operation waited in the
gateway queue before it is sent to remote system for this file system.
• gpfs_afm_fs_tot_read_time: Total time in seconds to run read operations from the remote system
for this file system.
• gpfs_afm_fs_tot_write_time: Total time in seconds to run write operations to the remote system
for this file system.
• gpfs_afm_fs_used_q_memory: Used memory in bytes by the messages queued for this file system.
GPFSAFMFSET
• gpfs_afm_fset_avg_time: Average time in seconds that a pending operation waited in the gateway
queue before it is sent to remote system for this fileset.
• gpfs_afm_fset_bytes_pending: Total number of bytes pending, which are not yet written to the
remote system for this fileset.
• gpfs_afm_fset_bytes_read: Total number of bytes read from remote system as a result of cache
miss for this fileset.
• gpfs_afm_fset_bytes_written: Total number of bytes written to the remote system as a result of
cache updates for this fileset.
• gpfs_afm_fset_conn_broken: Total number of times the connection to the remote system was
broken for this fileset.
• gpfs_afm_fset_conn_esta: Total number of times a connection was established with the remote
system for this fileset.
• gpfs_afm_fset_fset_expired: Total number of times the fileset was marked expired due to a
disconnection with remote system and expiry of the configured timeout for this fileset.
• gpfs_afm_fset_longest_time: Longest time in seconds that a pending operation waited in the
gateway queue before it is sent to remote system for this fileset.
• gpfs_afm_fset_num_queued_msgs: Number of messages that are currently queued for this file
system.
• gpfs_afm_fset_ops_expired: Number of operations that were sent to remote system because they
were expired after waiting until the configured asynchronous timeout in the gateway queue for this
fileset.
• gpfs_afm_fset_ops_forced: Number of operations that were sent to remote system because they
were forced out of the gateway queue before the configured asynchronous timeout. The timeout might
be due to a dependent operation for this fileset.
• gpfs_afm_fset_ops_revoked: Number of operations that were sent to the remote system because
a conflict token, which is acquired from another GPFS node that is resulted in a revoke for this fileset.
• gpfs_afm_fset_ops_sent: Total number of operations that are sent over the communication
protocol to the remote system for this fileset.
• gpfs_afm_fset_ops_sync: Number of synchronous operations that were sent to remote system for
this fileset.
• gpfs_afm_fset_shortest_time: Shortest time in seconds that a pending operation waited in the
gateway queue before it is sent to remote system for this fileset.

138 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_afm_fset_tot_read_time: Total time in seconds to run read operations from the remote
system for this fileset.
• gpfs_afm_fset_tot_write_time: Total time in seconds to run write operations to the remote
system for this fileset.
• gpfs_afm_fset_used_q_memory: Used memory in bytes by the messages queued for this fileset.
Note: GPFSAFM, GPFSAFMFS, and GPFSAFMFSET have other metrics, which indicate the statistics on
the state of remote file system operations and appear in the following format:
• For GPFSAFM: gpfs_afm_operation_state
• For GPFSAFMFS: gpfs_afm_fs_operation_state
• For GPFSAFMFSET: gpfs_afm_fset_operation_state
The operation can be one of the following values:
• lookup
• getattr
• readdir
• readlink
• create
• mkdir
• mknod
• remove
• rmdir
• rename
• chmod
• trunc
• stime
• link
• symlink
• setsttr
• setxattr
• open
• close
• read
• readsplit
• writesplit
• write
Each of these options can in turn have one of the following five states:
• queued
• inflight
• complete
• errors
• filter
For example, metrics like gpfs_afm_write_filter, gpfs_afm_fs_create_queued,
gpfs_afm_fset_rmdir_inflight are also available.

Chapter 4. Performance monitoring 139


Protocol metrics
The following section lists all the protocol metrics for IBM Storage Scale.

NFS metrics
The following section lists all the NFS metrics.

NFSIO
• nfs_read_req: Number of bytes that are requested for reading.
• nfs_write_req: Number of bytes that are requested for writing.
• nfs_read: Number of bytes that are transferred for reading.
• nfs_write: Number of bytes that are transferred for writing.
• nfs_read_ops: Number of total read operations.
• nfs_write_ops: Number of total write operations.
• nfs_read_err: Number of erroneous read operations.
• nfs_write_err: Number of erroneous write operations.
• nfs_read_lat: Time that is used by read operations (in nanoseconds).
• nfs_write_lat: Time that is used by write operations (in nanoseconds).
• nfs_read_queue: Time that is spent in the RPC waiting queue.
• nfs_write_queue: Time that is spent in the RPC waiting queue.

Computed metrics
The following metrics are computed for NFS and can be used only with the mmperfmon query
command.
• nfs_total_ops: nfs_read_ops + nfs_write_ops
• nfsIOlatencyRead: (nfs_read_lat + nfs_read_queue) / nfs_read_ops
• nfsIOlatencyWrite: (nfs_write_lat + nfs_write_queue) / nfs_write_ops
• nfsReadOpThroughput: nfs_read/nfs_read_ops
• nfsWriteOpThroughput: nfs_write/nfs_write_ops

Object metrics
The following section lists all the object metrics:
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.

140 IBM Storage Scale 5.1.9: Problem Determination Guide


SwiftAccount
• account_auditor_time: Timing the data for individual account database audits.
• account_DEL_err_time: Timing the data for each DELETE request, which results in an error like bad
request, not mounted, or missing timestamp.
• account_DEL_time: Timing the data for each DELETE request, which does not result in an error.
• account_GET_err_time: Timing the data for each GET request, which results in an error like bad
request, not mounted, bad delimiter, account listing limit too high, or bad accept header.
• account_GET_time: Timing the data for each GET request, which does not result in an error.
• account_HEAD_err_time: Timing the data for each HEAD request, which results in an error like bad
request or not mounted.
• account_HEAD_time: Timing the data for each HEAD request, which does not result in an error.
• account_POST_err_time: Timing the data for each POST request, which results in an error like bad
request, bad or missing timestamp, or not mounted.
• account_POST_time: Timing the data for each POST request, which does not result in an error.
• account_PUT_err_time: Timing the data for each PUT request, which results in an error like bad
request, not mounted, conflict, or recently deleted.
• account_PUT_time: Timing the data for each PUT request, which does not result in an error.
• account_reaper_time: Timing the data for each reap_account() call.
• account_REPLICATE_err_time: Timing the data for each REPLICATE request, which results in an
error like bad request or not mounted.
• account_REPLICATE_time: Timing the data for each REPLICATE request, which does not result in an
error.
• account_replicator_time: Timing the data for each database replication attempt, which does not
result in a failure.

SwiftContainer
• container_auditor_time: Timing the data for each container audit.
• container_DEL_err_time: Timing the data for DELETE request errors like bad request, not
mounted, missing timestamp, or conflict.
• container_DEL_time: Timing the data for each DELETE request, which does not result in an error.
• container_GET_err_time: Timing the data for GET request errors like bad request, not mounted,
parameters not utf8, or bad accept header.
• container_GET_time: Timing data for each GET request, which does not result in an error.
• container_HEAD_err_time: Timing the data for HEAD request errors like bad request or not
mounted.
• container_HEAD_time: Timing the data for each HEAD request, which does not result in an error.
• container_POST_err_time: Timing the data for POST request errors like bad request, bad x-
container-sync-to, or not mounted.
• container_POST_time: Timing the data for each POST request, which does not result in an error.
• container_PUT_err_time: Timing the data for PUT request errors like bad request, missing
timestamp, not mounted, or conflict.
• container_PUT_time: Timing the data for each PUT request, which does not result in an error.
• container_REPLICATE_err_time: Timing the data for REPLICATE request errors like bad request
or not mounted.
• container_REPLICATE_time: Timing the data for each REPLICATE request, which does not result in
an error.

Chapter 4. Performance monitoring 141


• container_replicator_time: Timing the data for each database replication attempt, which does
not result in a failure.
• container_sync_deletes_time: Timing the data for each container database row synchronization
through deletion.
• container_sync_puts_time: Timing the data for each container database row synchronization by
using the PUT process.
• container_updater_time: Timing the data for processing a container that includes only timing for
containers, which is needed to update their accounts.

SwiftObject
• object_auditor_time: Timing the data for each object audit (does not include any rate-
limiting sleep time for max_files_per_second, but does include rate-limiting sleep time for
max_bytes_per_second).
• object_DEL_err_time: Timing the data for DELETE request errors like bad request, missing
timestamp, not mounted, or precondition that failed. Includes requests, which did not find or match
the object.
• object_DEL_time: Timing the data for each DELETE request, which does not result in an error.
• object_expirer_time: Timing the data for each object expiration attempt that includes ones, which
result in an error.
• object_GET_err_time: Timing the data for GET request errors like bad request, not mounted, header
timestamps before the epoch, or precondition failed. File errors, which result in a quarantine, are not
counted here.
• object_GET_time: Timing the data for each GET request, which did not result in an error. Includes
requests, which did not find the object, such as disk errors that result in file quarantine.
• object_HEAD_err_time: Timing the data for HEAD request errors like bad request or not mounted.
• object_HEAD_time: Timing the data for each HEAD request, which did not result in an error. Includes
requests, which did not find the object, such as disk errors that result in file quarantine.
• object_POST_err_time: Timing the data for POST request errors like bad request, missing
timestamp, delete-at in past, or not mounted.
• object_POST_time: Timing the data for each POST request, which did not result in an error.
• object_PUT_err_time: Timing the data for PUT request errors like bad request, not mounted,
missing timestamp, object creation constraint violation, or delete-at in past.
• object_PUT_time: Timing the data for each PUT request, which did not result in an error.
• object_REPLICATE_err_time: Timing the data for REPLICATE request errors like bad request or not
mounted.
• object_REPLICATE_time: Timing the data for each REPLICATE request, which did not result in an
error.
• object_replicator_partition_delete_time: Timing the data for partitions that are replicated
to another node because they do not belong to this node. This metric is not tracked per device.
• object_replicator_partition_update_time: Timing the data for partitions replicated that also
belong on this node. This metric is not tracked per device.
• object_updater_time: Timing the data for object sweeps to flush async_pending container updates.
It does not include object sweeps that did not find an existing async_pending storage directory.

SwiftProxy
• proxy_account_GET_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests 200, which is a standard response for successful HTTP requests.
• proxy_account_GET_time: Timing the data for GET request, start to finish, 200, which is a standard
response for successful HTTP requests.

142 IBM Storage Scale 5.1.9: Problem Determination Guide


• proxy_account_HEAD_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests, 204, wherein request is processed but no content is returned.
• proxy_account_HEAD_time: Timing the data for HEAD request, start to finish, 204, wherein request
is processed but no content is returned.
• proxy_account_latency: Timing the data up to completion of sending the response headers, 200,
which is a standard response for successful HTTP requests.
• proxy_container_DEL_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests, 204, wherein request is processed but no content is returned.
• proxy_container_DEL_time: Timing the data for DELETE request, start to finish, 204, wherein
request is processed but no content is returned.
• proxy_container_GET_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests, 200, which is a standard response for successful HTTP requests.
• proxy_container_GET_time: Timing the data for GET request, start to finish, 200, which is a
standard response for successful HTTP requests.
• proxy_container_HEAD_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests, 204, wherein request is processed but no content is returned.
• proxy_container_HEAD_time: Timing the data for HEAD request, start to finish, 204, wherein
request is processed but no content is returned.
• proxy_container_latency: Timing the data up to completion of sending the response headers, 200,
which is a standard response for successful HTTP requests.
• proxy_container_PUT_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests, 201, wherein request is processed and a new resource is created.
• proxy_container_PUT_time: Timing the data for each PUT request not resulting in an error, 201,
wherein request is processed and a new resource is created.
• proxy_object_DEL_bytes: The sum of bytes that are transferred in (from clients) and out (to clients)
for requests, 204, wherein request is processed but no content is returned.
• proxy_object_DEL_time: Timing the data for DELETE request, start to finish, 204, wherein request
is processed but no content is returned.
• proxy_object_GET_bytes: The sum of bytes that are transferred in (from clients) and out (to clients)
for requests, 200, which is a standard response for successful HTTP requests.
• proxy_object_GET_time: Timing the data for GET request, start to finish, 200, which is a standard
response for successful HTTP requests.
• proxy_object_HEAD_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests, 204, wherein request is processed but no content is returned.
• proxy_object_HEAD_time: Timing the data for HEAD request, start to finish, 204, wherein request is
processed but no content is returned.
• proxy_object_latency: Timing the data up to completion of sending the response headers, 200,
which is a standard response for successful HTTP requests.
• proxy_object_PUT_bytes: The sum of bytes that are transferred in (from clients) and out (to clients)
for requests, 201, wherein request is processed and a new resource is created.
• proxy_object_PUT_time: Timing the data for each PUT request not resulting in an error, 201,
wherein request is processed and a new resource is created.
Note: For information about computed metrics for object, see “Performance monitoring for object
metrics” on page 150.

SMB metrics
The following section lists all the SMB metrics.

SMBGlobalStats
• connect count: Number of connections since start of the parent smbd process.

Chapter 4. Performance monitoring 143


• disconnect count: Number of connections that are closed since start of the parent smbd process.
• idle: Describes the idling behavior of smbd.
– count: Number of times the smbd processes are waiting for events in epoll.
– time: Times the smbd process that is spend in waiting for events in epoll.
• cpu_user time: The user time that is determined by the get_rusage system call in seconds.
• cpu_system time: The system time that is determined by the get_rusage system call in seconds.
• request count: Number of SMB requests since start of the parent smbd process.
• push_sec_ctx: SMBDs switch between the user and the root security context. The push action places
the current context onto a stack.
– count: Number of times the current security context is pushed onto the stack.
– time: The time that it takes to put the current security context, which includes all syscall that is
needed to save the current context on the stack.
• pop_sec_ctx: Getting the last security context from the stack and restore it.
– count: Number of times the current security context is restored from the stack.
– time: The time that it takes to put the restore the security context from the stack, which includes all
syscall that is needed to get restore the security context from the stack.
• set_sec_ctx:
– count: Number of times the security context is set for user.
– time: The time that it takes to set the security context for user.
• set_root_sec_ctx:
– count: Number of times the security context is set for user.
– time: The time that it takes to set the security context for user.

SMB2 metrics
The SMB2 metrics are available for all SMB2 requests, such as create, read, write, and find.
• op_count: Number of times the corresponding SMB request is called.
• op_idle
– for notify: Time that is taken between a notification request and sending a corresponding notification.
– for oplock breaks: Time that is waiting until an oplock is broken.
– For all others, the value is always zero.
• op_inbytes: Number of bytes that are received for the corresponding request that includes protocol
headers.
• op_outbytes: Number of bytes that are sent for the corresponding request that includes protocol
headers.
• op_time: The total amount of time that is spent for all corresponding SMB2 requests.

CTDB metrics
The following section lists all the CTDB metrics:
• CTDB version: Version of the CTDB protocol used by the node.
• Current time of statistics: Time when the statistics are generated. This is useful when collecting
statistics output periodically for post-processing.
• Statistics collected since: Time when CTDB was started or the last time statistics was reset. The
output shows the duration and the timestamp.
• num_clients: Number of processes currently connected to CTDB's UNIX socket. This includes recovery
daemon, CTDB tool and SMB processes (smbd, winbindd).

144 IBM Storage Scale 5.1.9: Problem Determination Guide


• frozen: 1 if the databases are currently frozen, 0 if otherwise.
• recovering: 1 if recovery is active, 0 if otherwise.
• num_recoveries: Number of recoveries since the start of CTDB or since the last statistics reset.
• client_packets_sent: Number of packets sent to client processes via UNIX domain socket.
• client_packets_recv: Number of packets received from client processes via UNIX domain socket.
• node_packets_sent: Number of packets sent to the other nodes in the cluster via TCP.
• node_packets_recv: Number of packets received from the other nodes in the cluster via TCP.
• keepalive_packets_sent: Number of keepalive messages sent to other nodes. CTDB periodically sends
keepalive messages to other nodes. For more information, see the KeepAliveInterval tunable in CTDB-
tunables(7) on the CTDB documentation website.
• keepalive_packets_recv: Number of keepalive messages received from other nodes.
• node: This section lists various types of messages processed which originated from other nodes via TCP.
– req_call: Number of REQ_CALL messages from the other nodes.
– reply_call: Number of REPLY_CALL messages from the other nodes.
– req_dmaster: Number of REQ_DMASTER messages from the other nodes.
– reply_dmaster: Number of REPLY_DMASTER messages from the other nodes.
– reply_error: Number of REPLY_ERROR messages from the other nodes.
– req_message: Number of REQ_MESSAGE messages from the other nodes.
– req_control: Number of REQ_CONTROL messages from the other nodes.
– reply_control: Number of REPLY_CONTROL messages from the other nodes.
• client: This section lists various types of messages processed which originated from clients via UNIX
domain socket.
– req_call: Number of REQ_CALL messages from the clients.
– req_message: Number of REQ_MESSAGE messages from the clients.
– req_control: Number of REQ_CONTROL messages from the clients.
• timeouts: This section lists timeouts occurred when sending various messages.
– call: Number of timeouts for REQ_CALL messages.
– control: Number of timeouts for REQ_CONTROL messages.
– traverse: Number of timeouts for database traverse operations.
• locks: This section lists locking statistics.
– num_calls: Number of completed lock calls. This includes database locks and record locks.
– num_current: Number of scheduled lock calls. This includes database locks and record locks.
– num_pending: Number of queued lock calls. This includes database locks and record locks.
– num_failed: Number of failed lock calls. This includes database locks and record locks.
• total_calls: Number of req_call messages processed from clients. This number should be same as
client --> req_call.
• pending_calls: Number of req_call messages which are currently being processed. This number
indicates the number of record migrations in flight.
• childwrite_calls: Number of record update calls. Record update calls are used to update a record under
a transaction.
• pending_childwrite_calls: Number of record update calls currently active.
• memory_used: The amount of memory in bytes currently used by CTDB using talloc. This includes all
the memory used for CTDB´s internal data structures. This does not include the memory mapped TDB
databases.

Chapter 4. Performance monitoring 145


• max_hop_count: The maximum number of hops required for a record migration request to obtain the
record. High numbers indicate record contention.
• total_ro_delegations: Number of read-only delegations created.
• total_ro_revokes: Number of read-only delegations that were revoked. The difference between
total_ro_revokes and total_ro_delegations gives the number of currently active read-only delegations.
• hop_count_buckets: Distribution of migration requests based on hop counts values.
• lock_buckets: Distribution of record lock requests based on time required to obtain locks. Buckets are
< 1ms, < 10ms, < 100ms, < 1s, < 2s, < 4s, < 8s, < 16s, < 32s, < 64s, > 64s.
• locks_latency: The minimum, the average, and the maximum time (in seconds) required to obtain
record locks.
• reclock_ctdbd: The minimum, the average, and the maximum time (in seconds) required to check if
recovery lock is still held by recovery daemon when recovery mode is changed. This check is done in
ctdb daemon.
• reclock_recd: The minimum, the average, and the maximum time (in seconds) required to check if
recovery lock is still held by recovery daemon during recovery. This check is done in recovery daemon.
• call_latency: The minimum, the average, and the maximum time (in seconds) required to process a
REQ_CALL message from client. This includes the time required to migrate a record from remote node, if
the record is not available on the local node.
• childwrite_latency: The minimum, the average, and the maximum time (in seconds) required to update
records under a transaction.

Cross protocol metrics


The following section lists all the cross protocol metrics:
• nfs_iorate_read_perc: nfs_read_ops/(op_count+nfs_read_ops)
• nfs_iorate_read_perc_exports: 1.0*nfs_read_ops/(op_count+nfs_read_ops)
• nfs_iorate_write_perc: nfs_write_ops/(write|op_count+nfs_write_ops)
• nfs_iorate_write_perc_exports: 1.0*nfs_write_ops/(op_count+nfs_write_ops)
• nfs_read_throughput_perc: nfs_read/(read|op_outbytes+nfs_read)
• nfs_write_throughput_perc: nfs_write/(write|op_outbytes+nfs_write)
• smb_iorate_read_perc: op_count/(op_count+nfs_read_ops)
• smb_iorate_write_perc: op_count/(op_count+nfs_write_ops)
• smb_latency_read: read|op_time/read|op_count
• smb_latency_write: write|op_time/write|op_count
• smb_read_throughput_perc: read|op_outbytes/(read|op_outbytes+nfs_read)
• smb_total_cnt: write|op_count+close|op_count
• smb_tp: op_inbytes+op_outbytes
• smb_write_throughput_perc: write|op_outbytes/(write|op_outbytes+nfs_write)
• total_read_throughput: nfs_read+read|op_outbytes
• total_write_throughput: nfs_write+write|op_inbytes

Cloud services metrics


The following section lists all the metrics for Cloud services:

Cloud services
• mcs_total_bytes: Total number of bytes that are uploaded to or downloaded from the cloud storage
tier.
• mcs_total_failed_operations: The total number of failed PUT or GET operations.
• mcs_total_failed_requests: Total number of failed migration, recall, or remove requests.

146 IBM Storage Scale 5.1.9: Problem Determination Guide


• mcs_total_failed_requests_time: The total time (in milliseconds) that is spent in failed
migration, recall, or remove requests.
• mcs_total_operation_errors: The total number of erroneous PUT or GET operations that are
based on the operation, which is specified in the mcs_operation key.
• mcs_total_operation_errors_time: The total time taken (in milliseconds) for erroneous PUT or
GET operations.
• mcs_total_operation_time: The total time that is taken (in milliseconds) for PUT or GET operations
for both data and metadata.
• mcs_total_parts: The total number of parts that are transferred to the cloud provider in case of
multipart upload.
• mcs_total_persisted_bytes: The total number of bytes that are transferred successfully and
persists on the cloud provider. This information is used for both migrate and recall operations.
• mcs_total_persisted_parts: The total number of transferred parts that persisted successfully on
the cloud provider in case of multipart upload.
• mcs_total_persisted_time: For PUT operation, the total time taken (in milliseconds) for
transferring and persisting the bytes on the cloud provider. For GET operation, the total time taken
(in milliseconds) for downloading and persisting the bytes on the file system.
• mcs_total_request_time: Time (in seconds) that is taken for all migration, recall, or remove
requests.
• mcs_total_requests: Total number of migration, recall, or remove requests.
• mcs_total_retried_operations: The total number of retry PUT operations, which is used for both
migrate and recall operations.
• mcs_total_successful_operations: The total number of successful PUT or GET operations for
both data and metadata.
• tct_fs_csap_used: Total number of bytes that are used by a file system for a specific CSAP.
• tct_fs_total_blob_time: The total blob time on the file system.
• tct_fs_total_failed_operations: The total number of failed PUT or GET operations with respect
to a file system.
• tct_fs_total_failed_requests: Total number of failed migration, recall, or remove requests with
respect to a file system.
• tct_fs_total_failed_requests_time: The total time (in milliseconds) that is spent in failed
migration, recall, or remove requests with respect to a file system.
• tct_fs_total_operation_errors: The total number of erroneous PUT or GET operations with
respect to a file system based on the operation specified in the mcs_operation key.
• tct_fs_total_operation_errors_time: The total time (in milliseconds) that is taken for
erroneous PUT or GET operations with respect to a file system.
• tct_fs_total_operation_time: The total time taken (in milliseconds) for PUT or GET operations
for both data and metadata with respect to a file system.
• tct_fs_total_parts: The total number of parts that are transferred to the cloud provider from a file
system in case of a multipart upload.
• tct_fs_total_persisted_bytes: The total number of transferred bytes from a file system that
successfully persisted on the cloud provider. This information is used for both migrate and recall
operations.
• tct_fs_total_persisted_parts: The total number of transferred parts (from a file system)
persisted successfully on the cloud provider in case of multipart upload.
• tct_fs_total_persisted_time: For PUT operation, the total time taken (in milliseconds) for
transferring and persisting the bytes on the cloud provider. For GET operation, the total time taken
(in milliseconds) for downloading and persisting the bytes on the file system.
• tct_fs_total_request_time: Time (in seconds) that is taken for all migration, recall, or remove
requests with respect to a file system.

Chapter 4. Performance monitoring 147


• tct_fs_total_requests: Total number of migration, recall, or remove requests with respect to a file
system.
• tct_fs_total_retried_operations: The total number of retry PUT operations with respect to a
file system. This information is used for both migrate and recall operations.
• tct_fs_total_successful_operations: The total number of successful PUT or GET operations for
both data and metadata with respect to a file system.
• tct_fset_csap_used: Total number of bytes that are used by a fileset for a specific CSAP.
• tct_fset_total_blob_time: The total blob time on the fileset.
• tct_fset_total_bytes: Total number of bytes uploaded to or downloaded from the cloud storage
tier with respect to a fileset.
• tct_fset_total_failed_operations: The total number of failed PUT or GET operations with
respect to a fileset.
• tct_fset_total_failed_requests: Total number of failed migration, recall, or remove requests
with respect to a fileset.
• tct_fset_total_failed_requests_time: The total time (in milliseconds) that is spent in failed
migration, recall, or remove requests with respect to a fileset.
• tct_fset_total_operation_errors: The total number of erroneous PUT or GET operations with
respect to a fileset based on the operation specified in the mcs_operation key.
• tct_fset_total_operation_errors_time: The total time taken (in milliseconds) for erroneous
PUT or GET operations with respect to a fileset.
• tct_fset_total_operation_time: The total time taken (in milliseconds) for PUT or GET operations
for both data and metadata with respect to a fileset.
• tct_fset_total_parts: The total number of parts that are transferred to the cloud provider from a
fileset in case of a multipart upload.
• tct_fset_total_persisted_bytes: The total number of transferred bytes from a fileset that are
successfully persisted on the cloud provider. This information is used for both migrate and recall
operations.
• tct_fset_total_persisted_parts: The total number of transferred parts (from a fileset) persisted
successfully on the cloud provider in case of multipart upload.
• tct_fset_total_persisted_time: For PUT operation, the total time taken (in milliseconds) for
transferring and persisting the bytes on the cloud provider. For GET operation, the total time taken (in
milliseconds) for downloading and persisting the bytes on the fileset.
• tct_fset_total_request_time: Time (in seconds) taken for all migration, recall, or remove
requests with respect to a fileset.
• tct_fset_total_requests: Total number of migration, recall, or remove requests with respect to a
fileset.
• tct_fset_total_retried_operations: The total number of retry PUT operations with respect to a
fileset. This information is used for both migrate and recall operations.
• tct_fset_total_successful_operations: The total number of successful PUT or GET operations
for both data and metadata with respect to a fileset.

ESS metrics
The following section lists the GPFSFCM metrics for ESS and IBM Storage Scale:
• “GPFSFCM” on page 148

GPFSFCM
• gpfs_fcm_pdisk_capacity: The capacity that is calculated by GNR but not obtained directly from
the disk. Usually a little less than the size of the disk. In bytes.

148 IBM Storage Scale 5.1.9: Problem Determination Guide


• gpfs_fcm_pdisk_used_capacity: The used capacity that is the space that is allocated to vdisks by
GNR. It might not be written. In bytes.
• gpfs_fcm_pdisk_physical_capacity: The physical capacity of the pdisk. Obtained from FCM
drives directly. In bytes.
• gpfs_fcm_pdisk_used_physical_capacity: The used physical capacity of the pdisk. Obtained
from FCM drives directly. In bytes.
• gpfs_fcm_pdisk_logical_capacity: The logical capacity of the pdisk. This should be equal to the
size of the FCM drive's namespace if the whole drive is allocated to the namespace. Obtained from FCM
drives directly. In bytes.
• gpfs_fcm_pdisk_used_logical_capacity: The used logical capacity of the pdisk. Obtained from
FCM drives directly. In bytes.
• gpfs_fcm_pdisk_state: The space information state of the pdisk.
The gpfs_fcm_pdisk_state values are always 0 and these values do not reflect the real FCM pdisk
state. To see the current state, issue the mmpmon pdisk_s command.
• gpfs_fcm_da_capacity: The GNR calculated capacity of the DA. The sum of the capacity of all pdisks
in the DA. In MB.
• gpfs_fcm_da_used_capacity: The capacity that is allocated to vdisks by GNR. The sum of the used
capacity of all pdisks in the DA. In MB.
• gpfs_fcm_da_physical_capacity: The physical capacity of the DA. The sum of the physical
capacity of all pdisks in the DA. In MB.
• gpfs_fcm_da_used_physical_capacity: The used physical capacity of the DA. This is calculated
by using the used physical capacity of the worst used pdisk. In MB.
• gpfs_fcm_da_logical_capacity: The logical capacity of the DA. The sum of the logical capacity of
all pdisks in the DA. In MB.
• gpfs_fcm_da_used_logical_capacity: The used logical capacity of the DA. This is calculated by
using the used logical capacity of the worst used pdisk. In MB.
• gpfs_fcm_da_state: The space information state of the DA.
The gpfs_fcm_da_state values are always 0 and these values do not reflect the real FCM declustered
array state. To see the current state, issue the mmpmon da_s command.

Note:
The GPFSFCM sensor in the default configuration is set to disabled when you install the sensor first time.
To enable the sensor on systems with FCM3 devices, issue the following commands:

mmperfmon config add --sensors /opt/IBM/zimon/defaults/ZIMonSensors_GPFSFCM.cfg

mmperfmon config update GPFSFCM.period=60

To disable the sensor, issue the following command:

mmperfmon config update GPFSFCM.period=0

GPFSFabricHospital
The sensor provides the following three GPFSFabricHospital metrics:
• gpfs_fabhospital_totalIOCount: Number of I/O operations done on this path. This number
includes failing and successful I/O.
• gpfs_fabhospital_errorIOCount: Number of I/O errors encountered by the Linux block layer on
this path.
• gpfs_fabhospital_deviceErrorIOCount: Number of I/O device errors found by the disk hospital
during diagnosing I/O errors.

Chapter 4. Performance monitoring 149


Note: The sensor is used for the internal ESS hospital error reporting and the RAS health event
generation and is subject to change. It will be enabled along with the ESS hospital setup and there
is no need for the user to enable the sensor manually. In case, the sensor is expected to be enabled but
is missing in the perfmon config, issue the following command:

mmperfmon config add --sensors /opt/IBM/zimon/defaults/ZIMonSensors_GPFSFabricHospital.cfg

Performance monitoring for object metrics


The mmperfmon command can be used to obtain object metrics information. Make sure that pmswift is
configured and the object sensors are added to measure the object metrics.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
The mmperfmon command is enhanced to calculate and print the sum, average, count, minimum, and
maximum of metric data for object queries. The following command can be used to display metric data for
object queries:

mmperfmon query NamedQuery [StartTime EndTime]

Currently, the calculation of the sum, average, count, minimum, and maximum is only applicable for the
following object metrics:
• account_HEAD_time
• account_GET_time
• account_PUT_time
• account_POST_time
• account_DEL_time
• container_HEAD_time
• container_GET_time
• container_PUT_time
• container_POST_time
• container_DEL_time
• object_HEAD_time
• object_GET_time
• object_PUT_time
• object_POST_time
• object_DEL_time
• proxy_account_latency

150 IBM Storage Scale 5.1.9: Problem Determination Guide


• proxy_container_latency
• proxy_object_latency
• proxy_account_GET_time
• proxy_account_GET_bytes
• proxy_account_HEAD_time
• proxy_account_HEAD_bytes
• proxy_account_POST_time
• proxy_account_POST_bytes
• proxy_container_GET_time
• proxy_container_GET_bytes
• proxy_container_HEAD_time
• proxy_container_HEAD_bytes
• proxy_container_POST_time
• proxy_container_POST_bytes
• proxy_container_PUT_time
• poxy_container_PUT_bytes
• proxy_container_PUT_time
• proxy_container_PUT_bytes
• proxy_container_DEL_time
• proxy_container_DEL_bytes
• proxy_object_GET_time
• proxy_object_GET_bytes
• proxy_object_HEAD_time
• proxy_object_HEAD_bytes
• proxy_object_POST_time
• proxy_object_POST_bytes
• proxy_object_PUT_time
• proxy_object_PUT_bytes
• proxy_object_PUT_time
• proxy_object_PUT_bytes
• proxy_object_DEL_time
• proxy_object_DEL_bytes
• proxy_object_POST_time
• proxy_object_POST_bytes

Use the following command to run a objObj query for object metrics. This command calculates and
prints the sum, average, count, minimum, and maximum of metric data for the object objObj for the
metrics mentioned.
mmperfmon query objObj 2016-09-28-09:56:39 2016-09-28-09:56:43

1: cluster1.ibm.com|SwiftObject|object_auditor_time
2: cluster1.ibm.com|SwiftObject|object_expirer_time
3: cluster1.ibm.com|SwiftObject|object_replication_partition_delete_time
4: cluster1.ibm.com|SwiftObject|object_replication_partition_update_time
5: cluster1.ibm.com|SwiftObject|object_DEL_time
6: cluster1.ibm.com|SwiftObject|object_DEL_err_time
7: cluster1.ibm.com|SwiftObject|object_GET_time
8: cluster1.ibm.com|SwiftObject|object_GET_err_time
9: cluster1.ibm.com|SwiftObject|object_HEAD_time

Chapter 4. Performance monitoring 151


10: cluster1.ibm.com|SwiftObject|object_HEAD_err_time
11: cluster1.ibm.com|SwiftObject|object_POST_time
12: cluster1.ibm.com|SwiftObject|object_POST_err_time
13: cluster1.ibm.com|SwiftObject|object_PUT_time
14: cluster1.ibm.com|SwiftObject|object_PUT_err_time
15: cluster1.ibm.com|SwiftObject|object_REPLICATE_time
16: cluster1.ibm.com|SwiftObject|object_REPLICATE_err_time
17: cluster1.ibm.com|SwiftObject|object_updater_time

Row object_auditor_time object_expirer_time object_replication_partition_delete_time


object_replication_partition_update_time object_DEL_time object_DEL_err_time
object_GET_time object_GET_err_time object_HEAD_time object_HEAD_err_time object_POST_time
object_POST_err_time object_PUT_time object_PUT_err_time object_REPLICATE_time
object_REPLICATE_err_time object_updater_time
1 2016-09-28 09:56:39 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.855923 0.000000 0.000000 0.000000 45.337915 0.000000 0.000000 0.000000 0.000000
2 2016-09-28 09:56:40 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 2016-09-28 09:56:41 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.931925 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 2016-09-28 09:56:42 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.855923 0.000000 0.000000 0.000000 516.280890 0.000000 0.000000 0.000000 0.000000

object_DEL_total_time = 0.0 object_PUT_total_time = 561.618805


object_GET_total_time = 0.0 object_POST_total_time = 0.0
object_HEAD_total_time = 1.786948 object_PUT_max_time = 516.28089
object_POST_max_time = 0.0 object_GET_max_time = 0.0
object_HEAD_max_time = 0.931025 object_DEL_max_time = 0.0
object_GET_avg_time = 0.0 object_DEL_avg_time = 0.0
object_PUT_avg_time = 280.809402 object_POST_avg_time = 0.0
object_HEAD_avg_time = 0.893474 object_DEL_time_count = 0.0
object_POST_time_count = 0 object_PUT_time_count = 2
object_HEAD_time_count = 2 object_GET_time_count = 0
object_DEL_min_time = 0.0 object_PUT_min_time = 45.337915
object_GET_min_time = 0.0 object_POST_min_time = 0.0
object_HEAD_min_time = 0.855923

Enabling protocol metrics


The type of information that is collected for NFS, SMB and Object protocols are configurable. This section
describes the location of the configuration data for these protocols.
Configuration information for SMB and NFS in the ZimonSensors.cfg file references the sensor
definition files in the /opt/IBM/zimon folder. For example:
• The CTDBDBStats.cfg file is referred in:

{ name = "CTDBDBStats"
period = 1
type = "Generic"
},

• The CTDBStats.cfg file is referred in:

{ name = "CTDBStats"
period = 1
type = "Generic"
},

• The NFSIO.cfg file is referred in:

{
# NFS Ganesha statistics
name = "NFSIO"
period = 1
type = "Generic"
},

• The SMBGlobalStats.cfg file is referred in:

{ name = "SMBGlobalStats"
period = 1

152 IBM Storage Scale 5.1.9: Problem Determination Guide


type = "Generic"
},

• The SMBStats.cfg file is referred in:

{ name = "SMBStats"
period = 1
type = "Generic"
},

At the time of installation, the object metrics proxy is configured to start by default on each Object
protocol node.
The object metrics proxy server, pmswiftd is controlled by the corresponding service script called
pmswiftd, located at /etc/rc.d/init.d/pmswiftd.service. You can start and stop the pmswiftd
service script by using the systemctl start pmswiftd and systemctl stop pmswiftd
commands respectively. You can also view the status of the pmswiftd service script by using the
systemctl status pmswiftd command.
In a system restart, the object metrics proxy server restarts automatically. In case of a failover, the server
starts automatically. If for some reason this does not occur, then the server must be started manually
using the systemctl start pmswiftd command.

Configuring performance monitoring API keys


To support enhanced security requirements and IBM Storage Scale in container-based deployment
models, the performance monitoring tool requires the usage of API keys in order to access the
performance data.
In such cases, a system defined API key scale_default is used by the following.
• CLI commands like mmperfmon query, mmhealth thresholds, gpfs.snap and mmdumpperfdata
• Backend processes like mmsysmon and call home
• GUI
Note: Other than scale_default, GUI also has another system-defined key called scale_gui.
The API keys are stored in a secured file _perfmon.keys in the Clustered Configuration Repository
(CCR). Additional API keys can be generated as required. For example, the IBM Storage Scale
performance monitoring bridge for Grafana uses API keys to access the performance data.
The system-defined API key scale_default is automatically generated by the mmsysmon health
monitoring daemon if it does not exist. The mmperfmon config update --apikey command
can be used to change the key if desired. However you cannot delete the system-defined API key
scale_default. For more information on maintaining API keys, see the mmperfmon command in the
IBM Storage Scale: Command and Programming Reference Guide.

Starting and stopping the performance monitoring tool


You can start and stop the performance monitoring tool by using the following commands.

Starting the performance monitoring tool


Use the systemctl start pmsensors command to start performance monitoring on a node.
Use the systemctl start pmcollector command on a node that has the collector.

Stopping the performance monitoring tool


Use the systemctl stop pmsensors command to stop sensor service on all nodes where active.
Use the systemctl stop pmcollector command to stop collector service on nodes where GUI is
installed.

Chapter 4. Performance monitoring 153


Note:
The systemctl commands work only for systems that use systemd scripts. On systems that use sysv
initialization scripts, you must use the service pmsensors and service pmcollector commands
instead of the systemctl commands.

Restarting the performance monitoring tool


If the pmsensor or pmcollector package is upgraded, the corresponding daemon is stopped and needs
to be started again.
To start the sensor on a particular node, use the systemctl start pmsensors command. To start the
collector, use the systemctl start pmcollector command.
If the ZIMonCollector.cfg file is changed, the pmsensors service on that node needs to be restarted
with systemctl restart pmcollector command.
With manual configuration, if the ZIMonSensors.cfg file is changed, the pmsensors service on that
node needs to be restarted by using the systemctl restart pmsensors command. No action is
necessary for IBM Storage Scale managed sensor configuration.
To restart the collector, use the systemctl restart pmcollector command.
Note:
This command works only for systems that use systemd scripts. On systems that use sysv initialization
scripts, you must use the service pmsensors and service pmcollector command instead of the
systemctl command.
For information on restarting the sensors and collectors for Transparent cloud tiering, see Integrating
Transparent Cloud Tiering metrics with performance monitoring tool in IBM Storage Scale: Administration
Guide.

Configuring the metrics to collect performance data


For performance reasons, the performance monitoring tool by default does not collect all the available
metrics. You can add other metrics to focus on particular performance problems.
For the available metrics, see “List of performance metrics” on page 117.
For information on sensor configuration, see “Configuring the sensor” on page 107.

Removing non-detectable resource identifiers from the performance


monitoring tool database
Every metric value stored in the database is associated with a metric name and resource identifier (single
entity). However, the performance monitoring tool does not perform detectability check of stored entities.
The identifiers of deleted and renamed resources remain in the database forever. The missing identifiers
that do not return any value over the retention period of 14 days can be reviewed and deleted by using the
mmperfmon command.
In order to avoid deleting temporarily missing identifiers, all entities that are not detectable are retained
for the time period of 14 days. If the retention period has expired, and no values have been returned for
the undetectable entity for over 14 days, it is listed on expiredKeys list and can be deleted.
Follow the given steps to clean up the performance monitoring tool database:
1. To view expired keys, issue the following command:

mmperfmon query --list=expiredKeys

The system displays output similar to this:

Found expired keys:


test_nodename|GPFSFilesystem|gpfsgui-cluster-2.novalocal|fs1

154 IBM Storage Scale 5.1.9: Problem Determination Guide


test_nodename|GPFSFilesystem|gpfsgui-cluster-2.novalocal|fs2
test_nodename|GPFSFilesystemAPI|gpfsgui-cluster-2.novalocal|fs2
test_nodename|GPFSFilesystemAPI|gpfsgui-cluster-2.novalocal|fs1
test_nodename|GPFSFilesystem|gpfsgui-cluster-2.novalocal|gpfs0
test_nodename|DiskFree|/mnt/gpfs0
test_nodename|Netstat
test_nodename|GPFSFilesystem|gpfsgui-cluster-2.novalocal|objfs
test_nodename|GPFSVFS
test_nodename|GPFSNode
test_nodename|GPFSFilesystemAPI|gpfsgui-cluster-2.novalocal|gpfs0
test_nodename|GPFSFilesystemAPI|gpfsgui-cluster-2.novalocal|objfs
test_nodename|DiskFree|/gpfs/fs2
test_nodename|DiskFree|/gpfs/fs1
test_nodename|GPFSRPCS
test_nodename|CPU
test_nodename|GPFSNodeAPI
test_nodename|Load
test_nodename|DiskFree|/mnt/objfs
test_nodename|Memory

2. To delete expired key, issue the following command:

mmperfmon delete --key ‘test_nodename|DiskFree|/mnt/gpfs0‘

The system displays output similar to this:

Check expired keys completed. Successfully 3 keys deleted.

The following table shows you resource types and responsible sensors that are included in the
detectability validation procedure.

Table 31. Resource types and the sensors responsible for them
Resource type Responsible sensors
Filesets data GPFSFileset , GPFSFilesetQuota
Filesystem inodes data GPFSInodeCap
Pools data GPFSPool, GPFSPoolCap
Filesystem mounts data DiskFree, GPFSFilesystem, GPFSFilesystemAPI
Disks and NSD data GPFSDiskCap, GPFSNSDDisk
Nodes data CPU
GPFSNode
GPFSNodeAPI
GPFSRPCS
GPFSVFS
Load
Memory
Netstat
SwiftAccount
SwiftContainer
SwiftObject
SwiftProxy

Note: The identifiers from Network, Protocols, TCT, and CTDB sensor data are not included in the
detectability validation and cleanup procedure.

Chapter 4. Performance monitoring 155


Measurements
A measurement is a value calculated by using more than one metric in a pre-defined formula.
Table 32. Measurements

Measurement Computation Group key Filter


operation

Fileset_inode1 gpfs_fset_allocInodes:sum gpfs_cluster_name,


gpfs_fset_freeInodes:sum - gpfs_fs_name,
Fileset Inode Capacity
gpfs_fset_maxInodes:sum / 100 *" gpfs_fset_name1
Utilization

DataPool_capUtill1 gpfs_pool_total_dataKB:sum gpfs_cluster_name,


gpfs_pool_free_dataKB:sum - gpfs_fs_name,
Data Pool Capacity Utilization
gpfs_pool_total_dataKB:sum / 100 * gpfs_diskpool_name

MetaDataPool_capUtil1 gpfs_pool_total_metaKB:sum gpfs_cluster_name,


gpfs_pool_free_metaKB:sum - gpfs_fs_name,
MetaData Pool Capacity
gpfs_pool_total_metaKB:sum / 100 * gpfs_diskpool_name
Utilization

FsLatency_diskWaitr gpfs_fs_tot_disk_wait_rd:sum gpfs_fs_name


gpfs_fs_read_ops:sum /
Average disk wait time per write
operation on the IBM Storage
Scale client

SMBNodeLatency_read op_time:avg op_count:avg / node read


Total amount of time spent for
all type of SMB read requests

SMBNodeLatency_write op_time:avg op_count:avg / node write


Total amount of time spent for
all type of SMB write requests

NFSNodeLatency_read nfs_read_lat:sum nfs_read_ops:sum / node


Time taken for NFS read
operations

NFSNodeLatency_write nfs_write_lat:sum nfs_write_ops:sum / node


Time taken for NFS read
operations

MemoryAvailable_percent2 mem_memtotal: 40000000 < mem_memfree: node


mem_buffers: + mem_cached: +
Estimated available memory
mem_memtotal: / 100 * * mem_memtotal:
percentage
40000000 >= mem_memfree: mem_buffers: +
• For the nodes having less mem_cached: + 40000000 / 100 * * +
than 40 GB total memory
allocation:
(mem_memfree+mem_buffer
s+mem_cached)/
mem_memtotal
• For the nodes having equal
to or greater than 40 GB
memory allocation:
(mem_memfree+mem_buffer
s+mem_cached)/40000000

DiskIoLatency_read disk_read_time: disk_read_ios: / node,diskdev_name


Average time in milliseconds
spent for a read operation on
the physical disk

DiskIoLatency_write disk_write_time: disk_write_ios: / node,diskdev_name


Average time in milliseconds
spent for a write operation on
the physical disk

AFMInQueueMemory_percent gpfs_afm_used_q_memory: node


3 cfg.afmHardMemThreshold / 100 *
Estimated available AFM in-
queue memory percentage

nfs_total_ops4 nfs_read_ops: nfs_write_ops: +

156 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 32. Measurements (continued)

Measurement Computation Group key Filter


operation

nfsIOlatencyRead nfs_read_lat: nfs_read_queue: + nfs_read_ops: /

nfsIOlatencyWrite nfs_write_lat: nfs_write_queue: + nfs_write_ops: /

nfsReadOpThroughput nfs_read: nfs_read_ops: /

nfsWriteOpThroughput nfs_write: nfs_write_ops: /

smb_latency_read op_time: op_count: / node read

smb_latency_write op_time: op_count: / node read

smb_total_cnt5 op_count:sum node write|close

smb_tp op_inbytes: op_outbytes: +

total_read_throughput nfs_read: op_outbytes: + read

total_write_throughput nfs_write: op_inbytes: + write

nfs_read_throughput_perc nfs_read: op_outbytes: nfs_read: + / read

smb_read_throughput_perc op_outbytes: op_outbytes: nfs_read: + / read

nfs_write_throughput_per nfs_write: op_outbytes: nfs_write: + / write


c

smb_write_throughput_per op_outbytes: op_outbytes: nfs_write: + / write


c

nfs_iorate_read_perc nfs_read_ops: op_count: nfs_read_ops: + / read

nfs_iorate_read_perc_exp 1.0 nfs_read_ops: * op_count: nfs_read_ops: + / read


orts6

nfs_iorate_write_perc7 nfs_write_ops: op_count: nfs_write_ops: + / write

nfs_iorate_write_perc_ex 1.0 nfs_write_ops: * op_count: nfs_write_ops: + / write


ports6

smb_iorate_read_perc8 op_count: op_count: nfs_read_ops: + / read

smb_iorate_write_perc9 op_count: op_count: nfs_write_ops: + / write

gpfs_write_avg_lat10 gpfs_vfsx_write_t: gpfs_vfsx_write: /

gpfs_read_avg_lat11 gpfs_vfsx_read_t: gpfs_vfsx_read: /

gpfs_create_avg_lat12 gpfs_vfsx_create_t: gpfs_vfsx_create: /

gpfs_remove_avg_lat13 gpfs_vfsx_remove_t: gpfs_vfsx_remove: /

gpfs_disk_free_percent gpfs_disk_free_fragkb: gpfs_disk_free_fullkb: + gpfs_disk_name


gpfs_disk_disksize: / 100 *

quota_blk_percent gpfs_rq_blk_current: gpfs_rq_blk_hard_limit: /


100 *

quota_file_percent gpfs_rq_file_current: gpfs_rq_file_hard_limit: /


100 *
1A system measurement used by a default threshold

2If total memory is smaller than 40 GB use real value for percentage, otherwise calculate percentage relative to 40GB.

3If the afmHardMemThreshold value is not set, then the default value of 8G is used.

4nfs_read_ops+nfs_write_ops

5computation: write|op_count+close|op_count

61.0*nfs_write_ops/(op_count+nfs_write_ops)

7nfs_write_ops/(write|op_count+nfs_write_ops)

8op_count/(op_count+nfs_read_ops)

9op_count/(op_count+nfs_write_ops)

10gpfs_vfs_write_t/gpfs_vfs_write

11gpfs_vfs_read_t/gpfs_vfs_read

Chapter 4. Performance monitoring 157


Table 32. Measurements (continued)

Measurement Computation Group key Filter


operation
12gpfs_vfs_create_t/gpfs_vfs_create

13gpfs_vfs_remove_t/gpfs_vfs_remo

Viewing and analyzing the performance data


The performance monitoring tool displays the performance metrics that are associated with GPFS and
the associated protocols. It helps you get a graphical representation of the status and trends of the key
performance indicators, and analyze IBM Storage Scale performance problems.
You can view and analyze the performance monitoring data by using the following methods:
• Using the IBM Storage Scale GUI.
• Using the mmperfmon command.
• Using an open source visualization tool called Grafana.
Note: The performance data that is available with mmperfmon query, GUI, or any other visualization
tool depends on the sensors that are installed and enabled. The sensor configuration details help you
determine the type of sensors that are available. For more information, see “Configuring the sensor” on
page 107.
Related concepts
Network performance monitoring
Network performance can be monitored either by using Remote Procedure Call (RPC) statistics or it can
be monitored by using the IBM Storage Scale graphical user interface (GUI).
Monitoring I/O performance with the mmpmon command
Use the mmpmon command to monitor the I/O performance of IBM Storage Scale on the node on which it
is run and on other specified nodes.
Using the performance monitoring tool
The performance monitoring tool collects metrics from GPFS and protocols and provides system
performance information. By default, the performance monitoring tool is enabled, and it consists of
Collectors, Sensors, and Proxies.

Performance monitoring using IBM Storage Scale GUI


The IBM Storage Scale GUI provides a graphical representation of the status and historical trends of the
key performance indicators. The manner in which information is displayed on the GUI, helps Users to
make quick and effective decisions, easily.
The following table lists the performance monitoring options that are available in the IBM Storage Scale
GUI.

Table 33. Performance monitoring options available in IBM Storage Scale GUI
Option Function
Monitoring > Statistics Displays performance of system resources and file
and object storage in various performance charts.
You can select the necessary charts and monitor
the performance based on the filter criteria.
The pre-defined performance widgets and metrics
help in investigating every node or any particular
node that is collecting the metrics.

158 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 33. Performance monitoring options available in IBM Storage Scale GUI (continued)
Option Function
Monitoring > Dashboards Provides a more readable and real-time user
interface that shows a graphical representation of
the status and historical trends of key performance
indicators. The dashboard view helps decision-
making easier without wasting time.
Nodes Provides an easy way to monitor the performance,
health status, and configuration aspects of all
available nodes in the IBM Storage Scale cluster.
Cluster > Network Provides an easy way to monitor the performance
and health status of various types of networks and
network adapters.
Monitoring > Thresholds Provides an option to configure and various
thresholds based on the performance monitoring
metrics. You can also monitor the threshold rules
and the events that are associated with each rule.
Files > File Systems Provides a detailed view of the performance,
capacity, and health aspects of file systems.
Files > Filesets Provides a detailed view of the fileset capacity.
Storage > Pools Provides a detailed view of the performance,
capacity, and health aspects of storage pools.
Storage > NSDs Provides a detailed view of the performance,
capacity, and health aspects of individual NSDs.
Protocols > NFS Exports Provides an overview of the performance aspects
of the NFS export.
Protocols > SMB Shares Provides an overview of the performance aspects
of the SMB shares.
Files > Transparent Cloud Tiering Provides insight into health, performance, and
configuration of the transparent cloud tiering
service.
Files > Active File Management Provides a detailed view of the configuration,
performance, and health status of AFM cache
relationship, AFM disaster recovery (AFMDR)
relationship, and gateway nodes.

The Statistics page is used for selecting the attributes based on which the performance of the system
needs to be monitored and comparing the performance based on the selected metrics. You can also mark
charts as favorite charts and these charts become available for selection when you add widgets in the
dashboard. You can display only two charts at a time in the Statistics page.
Favorite charts that are defined in the Statistics page and the predefined charts are available for selection
in the Dashboard.
You can configure the system to monitor the performance of the following functional areas in the system:
• Network
• System resources
• NSD server
• IBM Storage Scale client

Chapter 4. Performance monitoring 159


• NFS
• SMB
• Object
• CTDB
• Transparent cloud tiering. This option is available only when the cluster is configured to work with the
transparent cloud tiering service.
• Waiters
• AFM
Note: The functional areas such as NFS, SMB, Object, CTDB, and Transparent cloud tiering are available
only if the feature is enabled in the system.
The performance and capacity data are collected with the help of the following two components:
Sensor
The sensors are placed on all the nodes and they share the data with the collector. The sensors run on
any node that is needed to collect metrics. Sensors are started by default on the protocol nodes.
Collector
Collects data from the sensors. The metric collector runs on a single node and gathers metrics from
all the nodes that are running the associated sensors. The metrics are stored in a database on the
collector node. The collector ensures aggregation of data when the data gets older. The collector can
run on any node in the system. By default, the collector runs on the management node. You can
configure multiple collectors in the system. To configure performance monitoring through GUI, it is
mandatory to configure a collector on each GUI node.
The following picture provides a graphical representation of the performance monitoring configuration for
GUI.

Figure 2. Performance monitoring configuration for GUI

You can use the Services > Performance Monitoring page to configure sensors. You can also use the
mmperfmon command to configure the performance data collection through the CLI. The GUI displays a
subset of the available metrics that are available in the performance monitoring tool.

160 IBM Storage Scale 5.1.9: Problem Determination Guide


Configuring performance monitoring options in GUI
You need to configure and enable the performance monitoring for GUI to view the performance data in the
GUI.

Enabling performance tools in management GUI


You need to enable performance tools in the management GUI to display performance data in the
management GUI. For more information, see Enabling performance tools in management GUI section in
the IBM Storage Scale: Administration Guide.

Configuring capacity-related sensors to run on a single-node


Several capacity-related sensors must run only on a single node as they collect data for a clustered file
system. For example, GPFSDiskCap, GPFSFilesetQuota, GPFSFileset and GPFSPool.
It is possible to automatically restrict these sensors to a single node. For new installations, capacity-
related sensors are automatically configured to a single node where the capacity collection occurs. An
updated cluster, which was installed before ESS 5.3.7 (IBM Storage Scale 5.0.5), might not be configured
to use this feature automatically and must be reconfigured. To update the configuration, you can use the
mmperfmon config update SensorName.restrict=@CLUSTER_PERF_SENSOR command, where
SensorName values include GPFSFilesetQuota, GPFSFileset, GPFSPool, and GPFSDiskCap.
To collect file system and disk level capacity data on a single node that is selected by the system, run the
following command to update the sensor configuration.

mmperfmon config update GPFSDiskCap.restrict=@CLUSTER_PERF_SENSOR

If the selected node is in the DEGRADED state, then the CLUSTER_PERF_SENSOR is automatically
reconfigured to another node that is in the HEALTHY state. The performance monitoring service is
restarted on the previous and currently selected nodes. For more information, see Automatic assignment
of single node sensors in IBM Storage Scale: Problem Determination Guide.
Note: If the GPFSDiskCap sensor is frequently restarted, it can negatively impact the system performance.
The GPFSDiskCap sensor can cause a similar impact on the system performance as the mmdf command.
Therefore, to avoid using the @CLUSTER_PERF_SENSOR for any sensor in the restrict field of a single
node sensor until the node stabilizes in the HEALTHY state, it is advisable to use a dedicated healthy
node. If you manually configure the restrict field of the capacity sensors then you must ensure that all
the file systems on the specified node are mounted to record file system-related data, like capacity.
Use the Services > Performance Monitoring page to select the appropriate data collection periods for
these sensors.
For the GPFSDiskCap sensor, the recommended period is 86400, which means once per day. As the
GPFSDiskCap.period sensor runs mmdf command to get the capacity data, it is not recommended to use
a value less than 10800 (every 3 hours). To show fileset capacity information, it is necessary to enable
quota for all file systems where fileset capacity must be monitored. For more information, see the -q
option in the mmchfs command and mmcheckquota command.
To update the sensor configuration for triggering an hourly collection of capacity-based fileset capacity
information, run the mmperfmon command as shown in the following example,

mmperfmon config update GPFSFilesetQuota.restrict=@CLUSTER_PERF_SENSOR gui_node


GPFSFilesetQuota.period=3600

Verifying sensor and collector configurations


Do the following to verify whether collectors are working properly:
1. Issue systemctl status pmcollector on the GUI node to confirm that the collector is running.
Start collector it if it is not started already.

Chapter 4. Performance monitoring 161


2. If you cannot start the service, verify the log file that is located at the following location to fix the
issue: /var/log/zimon/ZIMonCollector.log.
3. Use a sample CLI query to test if data collection works properly. For example,

mmperfmon query cpu_user

Do the following to verify whether sensors are working properly:


1. Confirm that the sensor is configured correctly by issuing the mmperfmon config show command.
This command lists the content of the sensor configuration that is at the following location:/opt/IBM/
zimon/ZIMonSensors.cfg. The configuration must point to the node where the collector is running
and all the expected sensors must be enabled. An enabled sensor has a period greater than 0 in the
same config file.
2. Issue systemctl status pmsensors to verify the status of the sensors.

Configuring performance metrics and display options in the Statistics page of


the GUI
Use the Monitoring > Statistics page to monitor the performance of system resources and file and object
storage. Performance of the system can be monitored by using various pre-defined charts. You can select
the required charts and monitor the performance based on the filter criteria.
The pre-defined performance charts and metrics help in investigating every node or any particular node
that is collecting the metrics. The following figure shows various configuration options that are available in
the Statistics page of the management GUI.

Figure 3. Statistics page in the IBM Storage Scale management GUI

162 IBM Storage Scale 5.1.9: Problem Determination Guide


You can select pre-defined charts that are available for selection from pre-defined chart list. You can
display up to two charts at a time.

Display options in performance charts


The charting section displays the performance details based on various aspects. The GUI provides a rich
set of controls to view performance charts. You can use these controls to perform the following actions on
the charts that are displayed on the page:
• Zoom the chart by using the mouse wheel or resizing the timeline control. Y-axis is automatically
adjusted during zooming.
• Click and drag the chart or the timeline control at the bottom. Y-axis is automatically adjusted during
panning.
• Compare charts side by side. You can synchronize the y-axis and bind the x-axis. To modify the X and
Y axes of the chart, click the configuration symbol next to the title Statistics and select the required
options.
• Link the timelines of the two charts together by using the display options that are available.
• The Dashboard helps to access all single graph charts, which are either predefined or custom created
favorites.

Selecting performance and capacity metrics


To monitor the performance of the system, you need to select the appropriate metrics to be displayed in
the performance charts. Metrics are grouped under the combination of resource types and aggregation
levels. The resource types determine the area from which the data is taken to create the performance
analysis and aggregation level determines the level at which the data is aggregated. The aggregation
levels that are available for selection varies based on the resource type.
Sensors are configured against each resource type. The following table provides a mapping between
resource types and sensors under the Performance category.

Table 34. Sensors available for each resource type


Resource type Sensor name Candidate nodes
Network Network All
CPU
System Resources Load All
Memory
NSD Server GPFSNSDDisk NSD Server nodes
GPFSFilesystem
IBM Storage Scale Client GPFSVFS IBM Storage Scale Client nodes
GPFSFilesystemAPI
NFSIO Protocol nodes running NFS
NFS
service
SMBStats Protocol nodes running SMB
SMB
SMBGlobalStats service

Waiters GPFSWaiters All nodes


CTDBStats Protocol nodes running SMB
CTDB
service

Chapter 4. Performance monitoring 163


Table 34. Sensors available for each resource type (continued)
Resource type Sensor name Candidate nodes
SwiftAccount
SwiftContainer Protocol nodes running Object
Object
SwiftObject service

SwiftProxy
GPFSAFM
AFM GPFSAFMFS All nodes
GPFSAFMFSET
MCStoreGPFSStats
Transparent Cloud Tiering MCStoreIcstoreStats Cloud gateway nodes
MCStoreLWEStats

The resource type Waiters are used to monitor the long running file system threads. Waiters are
characterized by the purpose of the corresponding file system threads. For example, an RPC call waiter
that is waiting for Network I/O threads or a waiter that is waiting for a local disk I/O file system operation.
Each waiter has a wait time associated with it and it defines how long the waiter is already waiting. With
some exceptions, long waiters typically indicate that something in the system is not healthy.
The Waiters performance chart shows the aggregation of the total count of waiters of all nodes in the
cluster above a certain threshold. Different thresholds from 100 milliseconds to 60 seconds can be
selected in the list below the aggregation level. By default, the value shown in the graph is the sum
of the number of waiters that exceed threshold in all nodes of the cluster at that point in time. The
filter functionality can be used to display waiters data only for some selected nodes or file systems.
Furthermore, there are separate metrics for different waiter types such as Local Disk I/O, Network I/O,
ThCond, ThMutex, Delay, and Syscall.
You can also monitor the capacity details that are aggregated at the following levels:
• NSD
• Node
• File system
• Pool
• Fileset
• Cluster
The following table lists the sensors that are used for capturing the capacity details.

Table 35. Sensors available to capture capacity details


Sensor name Candidate nodes
DiskFree All nodes
GPFSFilesetQuota Only a single node
GPFSDiskCap Only a single node
GPFSPool Only a single node where all GPFS file systems are
mounted.
GPFSFileset Only a single node.

164 IBM Storage Scale 5.1.9: Problem Determination Guide


You can edit an existing chart by clicking the ellipsis icon on the performance chart header and select Edit
to modify the metrics selections. Follow the steps shown to drill down to the relevant metric:
1. Select the cluster to be monitored from the Cluster field. You can either select the local cluster or the
remote cluster.
2. Select Resource type. This is the area from which the data is taken to create the performance analysis.
3. Select Aggregation level. The aggregation level determines the level at which the data is aggregated.
The aggregation levels that are available for selection varies based on the resource type.
4. Select the entities that need to be graphed. The table lists all entities that are available for the chosen
resource type and aggregation level. When a metric is selected, you can also see the selected metrics
in the same grid and use methods like sorting, filtering, or adjusting the time frame to select the
entities that you want to select.
5. Select Metrics. Metrics is the type of data that need to be included in the performance chart. The list of
metrics that is available for selection varies based on the resource type and aggregation type.
6. Use the Filter option to further narrow down the selection. Depending on the selected object category
and aggregation level, the Filter section can be displayed underneath the aggregation level, allowing
one or more filters to be set. Filters are specified as regular expressions as shown in the following
examples:
• As a single entity:
node1
eth0
• Filter metrics applicable to multiple nodes as shown in the following examples:
– To select a range of nodes such as node1, node2 and node3:
node1|node2|node3
node[1-3]
– To filter based on a string of text. For example, all nodes starting with 'nod' or ending with 'int':
nod.+|.+int
– To filter network interfaces eth0 through eth6, bond0 and eno0 through eno6:
eth[0-6]|bond0|eno[0-6]
– To filter nodes starting with 'strg' or 'int' and ending with 'nx':
(strg)|(int).+nx

Creating favorite charts


Favorite charts are nothing but customized predefined charts. Favorite charts along with the predefined
charts are available for selection when you add widgets in the Dashboard page.
To create favorite charts, click the ‘star’ symbol that is placed next to the chart title and enter the label.

Configuring the dashboard to view performance charts


The Monitoring > Dashboard page provides an easy to read, single page, and real-time user interface that
provides a quick overview of the system performance.
The dashboard consists of several dashboard widgets and the associated favorite charts that can be
displayed within a chosen layout. Currently, the following important widget types are available in the
dashboard:
• Performance
• File system capacity by fileset
• System health events

Chapter 4. Performance monitoring 165


• System overview
• Filesets with the largest growth rate in last week
• Timeline
The following picture highlights the configuration options that are available in the edit mode of the
dashboard.

Figure 4. Dashboard page in the edit mode

Layout options
The highly customizable dashboard layout options helps to add or remove widgets and change its display
options. Select Layout Options option from the menu that is available in the upper right corner of the
Dashboard GUI page to change the layout options. While selecting the layout options, you can either
select the basic layouts that are available for selection or create a new layout by selecting an empty
layout as the starting point.
You can also save the dashboard so that it can be used by other users. Select Create Dashboard and
Delete Dashboard options from the menu that is available in the upper right corner of the Dashboard
page to create and delete dashboards respectively. If several GUIs are running by using CCR, saved
dashboards are available on all nodes.
When you open the IBM Storage Scale GUI after the installation or upgrade, you can see the default
dashboards that are shipped with the product. You can further modify or delete the default dashboards to
suit your requirements.

Widget options
Several dashboard widgets can be added in the selected dashboard layout. Select Edit Widgets option
from the menu that is available in the upper right corner of the Dashboard GUI page to edit or remove
widgets in the dashboard. You can also modify the size of the widget in the edit mode. Use the Add
Widget option that is available in the edit mode to add widgets in the dashboard.
The widgets with type Performance lists the charts that are marked as favorite charts in the Statistics
page of the GUI. Favorite charts along with the predefined charts are available for selection when you add
widgets in the dashboard.

166 IBM Storage Scale 5.1.9: Problem Determination Guide


To create favorite charts, click the ‘star’ symbol that is placed next to the chart title in the Monitoring >
Statistics page.

Querying performance data shown in the GUI through CLI


You can query the performance data that is displayed in the GUI through the CLI. This is usually used for
external system integration or to troubleshoot any issues with the performance data displayed in the GUI.
The following example shows how to query the performance data through CLI:

# mmperfmon query "sum(netdev_bytes_r)"

This query displays the following output:

Legend:
1: mr-31.localnet.com|Network|eth0|netdev_bytes_r
2: mr-31.localnet.com|Network|eth1|netdev_bytes_r
3: mr-31.localnet.com|Network|lo|netdev_bytes_r

Row Timestamp netdev_bytes_r netdev_bytes_r netdev_bytes_r


1 2016-03-15-14:52:09 10024
2 2016-03-15-14:52:10 9456
3 2016-03-15-14:52:11 9456
4 2016-03-15-14:52:12 9456
5 2016-03-15-14:52:13 9456
6 2016-03-15-14:52:14 9456
7 2016-03-15-14:52:15 27320
8 2016-03-15-14:52:16 9456
9 2016-03-15-14:52:17 9456
10 2016-03-15-14:52:18 11387

The sensor gets the performance data for the collector and the collector passes it to the performance
monitoring tool to display it in the CLI and GUI. If sensors and collectors are not enabled in the system,
the system does not display the performance data and when you try to query data from a system
resource, it returns an error message. For example, if performance monitoring tools are not configured
properly for the resource type Transparent Cloud Tiering, the system displays the following output while
querying the performance data:

mmperfmon query "sum(mcs_total_requests)" number_buckets 1


Error: No data available for query: 3169

mmperfmon: Command failed. Examine previous error messages to determine cause.

For more information on how to troubleshoot the performance data issues, see Chapter 30, “Performance
issues,” on page 479.

Monitoring performance of nodes


The Monitoring > Nodes page provides an easy way to monitor the performance, health status, and
configuration aspects of all available nodes in the IBM Storage Scale cluster.
The Nodes page provides the following options to analyze performance of nodes:
1. A quick view that gives the number of nodes in the system, and the overall performance of nodes
based on CPU and memory usages.
You can access this view by selecting the expand button that is placed next to the title of the page. You
can close this view if not required.
The graphs in the overview show the nodes that have the highest average performance metric over a
past period. These graphs are refreshed regularly. The refresh intervals of the top three entities are
depended on the displayed time frame as shown:
• Every minute for the 5 minutes time frame
• Every 15 minutes for the 1 hour time frame
• Every six hours for the 24 hours time frame
• Every two days for the 7 days' time frame

Chapter 4. Performance monitoring 167


• Every seven days for the 30 days' time frame
• Every four months for the 365 days' time frame
2. A nodes table that displays many different performance metrics.
To find nodes with extreme values, you can sort the values displayed in the nodes table by different
performance metrics. Click the performance metric in the table header to sort the data based on that
metric.
You can select the time range that determines the averaging of the values that are displayed in the
table and the time range of the charts in the overview from the time range selector, which is placed in
the upper right corner. The metrics in the table do not update automatically. The refresh button at the
top of the table allows to refresh the table content with more recent data.
You can group the nodes to be monitored based on the following criteria:
• All nodes
• NSD server nodes
• Protocol nodes
3. A detailed view of the performance and health aspects of individual nodes that are listed in the Nodes
page.
Select the node for which you need to view the performance details and select View Details. The
system displays various performance charts on the right pane.
The detailed performance view helps to drill-down to various performance aspects. The following list
provides the performance details that can be obtained from each tab of the performance view:
• Overview tab provides performance chart for the following:
– Client IOPS
– Client data rate
– Server data rate
– Server IOPS
– Network
– CPU
– Load
– Memory
• Events tab helps to monitor the events that are reported in the node. Three filter options are
available to filter the events by their status; such as Current Issues, Unread Messages, and All
Events displays every event, no matter if it is fixed or marked as read. Similar to the Events page,
you can also perform the operations like marking events as read and running fix procedure from this
events view.
• File Systems tab provides performance details of the file systems mounted on the node. You can
view the file system read or write throughput, average read or write transactions size, and file system
read or write latency.
• NSDs tab gives status of the disks that are attached to the node. The NSD tab appears only if the
node is configured as an NSD server.
• SMB and NFS tabs provide the performance details of the SMB and NFS services hosted on the node.
These tabs appear in the chart only if the node is configured as a protocol node.
• Network tab displays the network performance details.

Monitoring performance of file systems


The File Systems page provides an easy way to monitor the performance, health status, and configuration
aspects of the all available file systems in the IBM Storage Scale cluster.
The following options are available to analyze the file system performance:

168 IBM Storage Scale 5.1.9: Problem Determination Guide


1. A quick view that gives the number of protocol nodes, NSD servers, and NSDs that are part of the
available file systems that are mounted on the GUI server. It also provides overall capacity and total
throughput details of these file systems. You can access this view by selecting the expand button that
is placed next to the title of the page. You can close this view if not required.
The graphs displayed in the quick view are refreshed regularly. The refresh intervals are depended on
the displayed time frame as shown:
• Every minute for the 5 minutes time frame
• Every 15 minutes for the 1 hour time frame
• Every six hours for the 24 hours time frame
• Every two days for the 7 days time frame
• Every seven days for the 30 days time frame
• Every four months for the 365 days time frame
2. A file systems table that displays many different performance metrics. To find file systems with
extreme values, you can sort the values displayed in the file systems table by different performance
metrics. Click the performance metric in the table header to sort the data based on that metric. You
can select the time range that determines the averaging of the values that are displayed in the table
and the time range of the charts in the overview from the time range selector, which is placed in the
upper right corner. The metrics in the table do not update automatically. The refresh button at the top
of the table allows to refresh the table with more recent data.
3. A detailed view of the performance and health aspects of individual file systems. To see the detailed
view, you can either double-click on the file system for which you need to view the details or select the
file system and click View Details.
The detailed performance view helps to drill-down to various performance aspects. The following list
provides the performance details that can be obtained from each tab of the performance view:
• Overview: Provides an overview of the file system, performance, and properties.
• Events: System health events reported for the file system.
• NSDs: Details of the NSDs that are part of the file system.
• Pools: Details of the pools that are part of the file system.
• Nodes: Details of the nodes on which the file system is mounted.
• Filesets: Details of the filesets that are part of the file system.
• NFS: Details of the NFS exports created in the file system.
• SMB: Details of the SMB shares created in the file system.
• Object: Details of the IBM Storage Scale object storage on the file system.

Monitoring performance of NSDs


The NSDs page provides an easy way to monitor the performance, health status, and configuration
aspects of the all network shared disks (NSD) that are available in the IBM Storage Scale cluster.
The following options are available in the NSDs page to analyze the NSD performance:
1. An NSD table that displays the available NSDs and many different performance metrics. To find NSDs
with extreme values, you can sort the values that are displayed in the table by different performance
metrics. Click the performance metric in the table header to sort the data based on that metric. You
can select the time range that determines the averaging of the values that are displayed in the table
from the time range selector, which is placed in the upper right corner. The metrics in the table are
refreshed based on the selected time frame. You can refresh it manually to see the latest data.
2. A detailed view of the performance and health aspects of individual NSDs are also available in the
NSDs page. Select the NSD for which you need to view the performance details and select View
Details. The system displays various performance charts on the right pane.

Chapter 4. Performance monitoring 169


The detailed performance view helps to drill-down to various performance aspects. The following list
provides the performance details that can be obtained from each tab of the performance view:
• Overview: Provides an overview of the NSD performance details and related attributes.
• Events: System health events reported for the NSD.
• Nodes: Details of the nodes that serve the NSD.

Viewing performance data with mmperfmon


To view the metrics that are associated with GPFS and the associated protocols, run the mmperfmon
command with the query option. You can also use the mmperfmon command with the query option to
detect performance issues and problems. You can collect metrics for all nodes or for a particular node.
1. Problem: System slowing down
Solution: Use the mmperfmon query compareNodes cpu_user command or the mmperfmon
query compareNodes cpu_system command to compare CPU metrics for all the nodes in your
system.
• Check whether there is a node that has a significantly higher CPU utilization for the entire time
period. If so, see whether this trend continues. You might need to investigate further on this node.
• Check whether there is a node that has significantly lower CPU utilization over the entire period. If
so, check whether that node has a health problem.
• Use the mmperfmon query compareNodes protocolThroughput command to look at the
throughput for each of the nodes for the different protocols.
Note: The metrics of each individual protocol cannot always include exact I/O figures.
• Use the mmperfmon query compareNodes protocolIORate command to look at the I/O
performance for each of the nodes in your system.

2. Problem: A particular node is causing problems


Solution: Use the mmperfmon query usage command to show the CPU, memory, storage, and
network usage.

3. Problem: A particular protocol is causing problems


Solution: Use the mmperfmon query command to investigate problems with your specific protocol.
You can compare cross-node metrics by using the mmperfmon query compareNodes command.
For example, the mmperfmon query compareNodes nfs_read_ops command compares the NFS
read operations on all the nodes that are using NFS. By comparing the different NFS metrics, you can
identify which node is causing the problems. The problem might either manifest itself as running with
much higher values than the other nodes, or much lower (depending on the issue) when considered
over several buckets of time.

4. Problem: A particular protocol is causing problems on a particular node


Solution: Use the mmperfmon query command on the particular node to look deeper into the
protocol performance on that node.
For example, if there is a problem with NFS:
• Use the mmperfmon query nfsIOlatency command to get details of the nfsIOlatency.
• Use the mmperfmon query nfsIOrate command to get details of the NFS I/O rate.
• Use the mmperfmon query nfsThroughput command to get details of the NFS throughput.

For more information, see mmperform in IBM Storage Scale: Command and Programming Reference Guide.

170 IBM Storage Scale 5.1.9: Problem Determination Guide


List of queries
You can make the following predefined queries with query option of the mmperfmon command.

General and network


• usage: Retrieves details about the CPU, memory, storage, and network usage.
• cpu: Retrieves details of the CPU usage in system and user space, and context switches.
• netDetails: Retrieves details about the network.
• netErrors: Retrieves details about network problems, such as collisions, drops, and errors, for all
available networks.
• compareNodes: Compares a single metric across all nodes that are running sensors.

GPFS
GPFS metric queries give an overall view of the GPFS without considering the protocols.
• gpfsCRUDopsLatency: Retrieves information about the GPFS create, retrieve, update, and delete
operations latency.
• gpfsFSWaits: Retrieves information on the maximum waits for read and write operations for all file
systems.
• gpfsNSDWaits: Retrieves information on the maximum waits for read and write operations for all
disks.
• gpfsNumberOperations: Retrieves the number of operations to the GPFS file system.
• gpfsVFSOpCounts: Retrieves VFS operation counts.

Cross protocol
These queries retrieve information after metrics are compared between different protocols on a particular
node.
• protocolIOLatency: Compares latency per protocol (SMB, NFS, and Object).
• protocolIORate: Retrieves the percentage of total I/O rate per protocol (SMB, NFS, and Object).
• protocolThroughput: Retrieves the percentage of total throughput per protocol (SMB, NFS, and
Object).

NFS
These queries retrieve metrics associated with the NFS protocol.
• nfsIOLatency: Retrieves the NFS I/O latency in nanoseconds.
• nfsIORate: Retrieves the NFS I/O operations per second (NFS IOPS).
• nfsThroughput: Retrieves the NFS throughput in bytes per second.
• nfsErrors: Retrieves the NFS error count for read and write operations.
• nfsQueue: Retrieves the NFS read and write queue latency in nanoseconds.
• nfsThroughputPerOp: Retrieves the NFS read and write throughput per operation in bytes.

Object
• objAcc: Details of the Object account performance.
Retrieved metrics:
– account_auditor_time
– account_reaper_time

Chapter 4. Performance monitoring 171


– account_replicator_time
– account_DEL_time
– account_DEL_err_time
– account_GET_time
– account_GET_err_time
– account_HEAD_time
– account_HEAD_err_time
– account_POST_time
– account_POST_err_time
– account_PUT_time
– account_PUT_err_time
– account_REPLICATE_time
– account_REPLICATE_err_time
• objCon: Details of the Object container performance.
Retrieved metrics:
– container_auditor_time
– container_replicator_time
– container_DEL_time
– container_DEL_err_time
– container_GET_time
– container_GET_err_time
– container_HEAD_time
– container_HEAD_err_time
– container_POST_time
– container_POST_err_time
– container_PUT_time
– container_PUT_err_time
– container_REPLICATE_time
– container_REPLICATE_err_time
– container_sync_deletes_time
– container_sync_puts_time
– container_updater_time
• objObj: Details of the Object performance.
Retrieved metrics:
– object_auditor_time
– object_expirer_time
– object_replicator_partition_delete_time
– object_replicator_partition_update_time
– object_DEL_time
– object_DEL_err_time
– object_GET_time
– object_GET_err_time
– object_HEAD_time

172 IBM Storage Scale 5.1.9: Problem Determination Guide


– object_HEAD_err_time
– object_POST_time
– object_POST_err_time
– object_PUT_time
– object_PUT_err_time
– object_REPLICATE_err_time
– object_REPLICATE_time
– object_updater_time
• objPro: Details on the Object proxy performance.
Retrieved metrics:
– proxy_account_latency
– proxy_container_latency
– proxy_object_latency
– proxy_account_GET_time
– proxy_account_GET_bytes
– proxy_account_HEAD_time
– proxy_account_HEAD_bytes
– proxy_account_POST_time
– proxy_account_POST_bytes
– proxy_container_DEL_time
– proxy_container_DEL_bytes
– proxy_container_GET_time
– proxy_container_GET_bytes
– proxy_container_HEAD_time
– proxy_container_HEAD_bytes
– proxy_container_POST_time
– proxy_container_POST_bytes
– proxy_container_PUT_time
– proxy_container_PUT_bytes
– proxy_object_DEL_time
– proxy_object_DEL_bytes
– proxy_object_GET_time
– proxy_object_GET_bytes
– proxy_object_HEAD_time
– proxy_object_HEAD_bytes
– proxy_object_POST_time
– proxy_object_POST_bytes
– proxy_object_PUT_time
– proxy_object_PUT_bytes
• objAccIO: Information on the Object Account IO rate
Retrieved metrics:
– account_GET_time
– account_GET_err_time

Chapter 4. Performance monitoring 173


– account_HEAD_time
– account_HEAD_err_time
– account_POST_time
– account_POST_err_time
– account_PUT_time
– account_PUT_err_time
• objConIO: Information on the Object Container IO rate
Retrieved metrics:
– container_GET_time
– container_GET_err_time
– container_HEAD_time
– container_HEAD_err_time
– container_POST_time
– container_POST_err_time
– container_PUT_time
– container_PUT_err_time
• objObjIO: Information on the Object I/O rate
Retrieved metrics:
– object_GET_time
– object_GET_err_time
– object_HEAD_time
– object_HEAD_err_time
– object_POST_time
– object_POST_err_time
– object_PUT_time
– object_PUT_err_time
• objProIO: Information on the Object Proxy IO rate
Retrieved metrics:
– proxy_account_GET_time
– proxy_account_GET_bytes
– proxy_container_GET_time
– proxy_container_GET_bytes
– proxy_container_PUT_time
– proxy_container_PUT_bytes
– proxy_object_GET_time
– proxy_object_GET_bytes
– proxy_object_PUT_time
– proxy_object_PUT_bytes
• objAccThroughput: Information on the Object Account Throughput
Retrieved metrics:
– account_GET_time
– account_PUT_time

174 IBM Storage Scale 5.1.9: Problem Determination Guide


• objConThroughput: Information on the Object Container Throughput
Retrieved metrics:
– container_GET_time
– container_PUT_time
• objObjThroughput: Information on the Object Throughput
Retrieved metrics:
– object_GET_time
– object_PUT_time
• objProThroughput: Information on the Object Proxy Throughput
Retrieved metrics:
– proxy_account_GET_time
– proxy_account_GET_bytes
– proxy_container_GET_time
– proxy_container_GET_bytes
– proxy_container_PUT_time
– proxy_container_PUT_bytes
– proxy_object_GET_time
– proxy_object_GET_bytes
– proxy_object_PUT_time
– proxy_object_PUT_bytes
• objAccLatency: Information on the Object Account Latency
Retrieved metric:
– proxy_account_latency
• objConLatency: Information on the Object Container Latency
Retrieved metric:
– proxy_container_latency
• objObjLatency: Information on the Object Latency
Retrieved metric:
– proxy_object_latency

SMB
These queries retrieve metrics associated with SMB.
• smb2IOLatency: Retrieves the SMB2 I/O latencies per bucket size (default 1 sec).
• smb2IORate: Retrieves the SMB2 I/O rate in number of operations per bucket size (default 1 sec).
• smb2Throughput: Retrieves the SMB2 Throughput in bytes per bucket size (default 1 sec).
• smb2Writes: Retrieves count, # of idle calls, bytes in and out, and operation time for SMB2 writes.
• smbConnections: Retrieves the number of SMB connections.

CTDB
These queries retrieve metrics associated with CTDB.
• ctdbCallLatency: Retrieves information on the CTDB call latency.
• ctdbHopCountDetails: Retrieves information on the CTDB hop count buckets 0 - 5 for one database.

Chapter 4. Performance monitoring 175


• ctdbHopCounts: Retrieves information on the CTDB hop counts (bucket 00 = 1 - 3 hops) for all
databases.

Using IBM Storage Scale performance monitoring bridge with Grafana


Grafana is an open source tool for visualizing time series and application metrics. It provides a powerful
platform to create, explore, and share dashboards and data.
To use Grafana for monitoring the performance of an IBM Storage Scale device you need to install and
setup the IBM Storage Scale bridge for Grafana. This is available as an open source tool on GitHub.
For more information on the installation and usage of the IBM Storage Scale bridge for Grafana, see
ibm-spectrum-scale-bridge-for-grafana.

176 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 5. Monitoring GPUDirect storage
GPUDirect Storage (GDS) in IBM Storage Scale is integrated with the system health monitoring.
Issue the following commands to monitor the health status of the GDS component:

mmhealth node show

Or

mmhealth node show GDS

For more information about the various options available with the mmhealth command, see mmhealth
command in IBM Storage Scale: Command and Programming Reference Guide.
For more information, see GPUDirect Storage troubleshooting topic in IBM Storage Scale: Problem
Determination Guide.

© Copyright IBM Corp. 2015, 2024 177


178 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 6. Monitoring events through callbacks
You can configure the callback feature to provide notifications when node and cluster events occur.
Starting complex or long-running commands, or commands that involve GPFS files, might cause
unexpected and undesired results, including loss of file system availability. Use the mmaddcallback
command to configure the callback feature.
For more information on how to configure and manage callbacks, see the man page of the following
commands in IBM Storage Scale: Command and Programming Reference Guide:
• mmaddcallback
• mmdelcallback
• mmlscallback

© Copyright IBM Corp. 2015, 2024 179


180 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 7. Monitoring capacity through GUI
You can monitor the capacity of the file system, pools, filesets, NSDs, users, and user groups in the IBM
Storage Scale system.
The capacity details displayed in the GUI are obtained from the following sources:
• GPFS quota database. The system collects the quota details and stores it in the PostgreSQL database.
• Performance monitoring tool collects the capacity data. The GUI queries the performance monitoring
tool and displays capacity data in various pages in the GUI.
Based on the source of the capacity information, different procedures need to be performed to enable
capacity and quota data collection.
For both GPFS quota database and performance monitoring tool-based capacity and quota collection, you
need to use the Files > Quotas page to enable quota data collection per file system and enforce quota
limit checking. If quota is not enabled for a file system:
• No capacity and inode data is collected for users, groups, and filesets.
• Quota limits for users, groups, and filesets cannot be defined.
• No alerts are sent and the data writes are not restricted.
To enable capacity data collection from the performance monitoring tool, the GPFSFilesetQuota
sensor must be enabled. For more information on how to enable the performance monitoring sensor for
capacity data collection, see Manual installation of IBM Storage Scale GUI in IBM Storage Scale: Concepts,
Planning, and Installation Guide.

Capacity data obtained from the GPFS quota database


The capacity and quota information collected from the GPFS quota database is displayed on the Files >
Quotas and Files > User Capacity pages in the management GUI.
1. Files > Quotas page
Use quotas to control the allocation of files and data blocks in a file system. You can create default, user,
group, and fileset quotas through the Quotas page.
A quota is the amount of disk space and the amount of metadata that is assigned as upper limits
for a specified user, group of users, or fileset. Use the Actions menu to create or modify quotas.
The management GUI allows you to only manage capacity-related quota. The inode-related quota
management is only possible in the command-line interface.
You can specify a soft limit, a hard limit, or both. When you set a soft limit quota, a warning is sent to the
administrator when the file system is close to reaching its storage limit. A grace period starts when the
soft quota limit is reached. Data is written until the grace period expires, or until the hard quota limit is
reached. Grace time resets when used capacity goes less than the soft limit. If you set a hard limit quota,
then you cannot save data after the quota is reached. If the quota is exceeded, then you must delete files
or raise the quota limit to store more data.
Note:
• User or user group quotas for filesets are only supported if the Per Fileset option is enabled at the file
system level. Use the command-line interface to set the option. See the man pages of mmcrfs and
mmchfs commands for more detail.
• You need to unmount a file system to change the quota enablement method from per file system to per
fileset or vice versa.
You can set default user quotas at the file system level rather than defining user quotas explicitly for each
user. Default quota limits can be set for users. You can specify the general quota collection scope such as
per file system or per fileset to define whether the default quota needs to be defined at file system level or

© Copyright IBM Corp. 2015, 2024 181


fileset level and set the default user quota. After this value is set, all child objects that are created under
the file system or fileset are configured with the default soft and hard limits. You can assign a custom
quota limit to individual child objects, but the default limits remain the same unless changed at the file
system or fileset level.
After reconfiguring quota settings, it is recommended to run the mmcheckquota command for the
affected file system to verify the changes.
For more information on how to manage quotas, see Managing GPFS quotas section in the IBM Storage
Scale: Administration Guide.
Capacity data from users, groups, and filesets with no quota limit set are not listed in the Quotas page.
Use the Files > User Capacity page to see capacity information of such users and groups. Use the Files >
Filesets page to view current and historic capacity information of filesets.
2. Files > User Capacity page
The Files > User Capacity page provides predefined capacity reports for users and groups. While capacity
information of file systems, pools, and filesets is available in the respective areas of the GUI, the Files >
User Capacity page is the only place where information on used capacity per user or group is available.
The User Capacity page depends on the quota accounting method of the file system. You need to enable
quota for a file system to display the user capacity data. If quota is not enabled, you can follow the
fix procedure in the Files > Quotas page or use the mmchfs <Device> -Q yes CLI command to
enable quota. Even if the capacity limits are not set, the User Capacity page shows data as soon as
the quota accounting is enabled and users write data. This is different in the Quotas page, where only
users and groups with quota limits defined are listed. The user and group capacity quota information is
automatically collected once a day by the GUI.
For users and user groups, you can see the total capacity and whether quotas are set for these objects.
you can also see the percentage of soft limit and hard limit usage. When the hard limit is exceeded, no
more files belong to the respective user, user group, or fileset can be written. However, exceeding the
hard limit allows a certain grace period before disallowing more file writes. Soft and hard limits for disk
capacity are measured in units of kilobytes (KiB), megabytes (MiB), or gigabytes (GiB). Use the Files >
Quotas page to change the quota limits.

Capacity data collected through the performance monitoring tool


The historical capacity data collection for file systems, pools, and filesets depend on the correctly
configured data collection sensors for fileset quota and disk capacity. When the IBM Storage Scale system
is installed through the installation toolkit, the capacity data collection is configured by default. In other
cases, you need to enable capacity sensors manually.
If the capacity data collection is not configured correctly you can use mmperfmon CLI command or the
Services > Performance Monitoring > Sensors page.
The Services > Performance Monitoring > Sensors page allows to view and edit the sensor settings. By
default, the collection periods of capacity collection sensors are set to collect data with a period of up to
one day. Therefore, it might take a while until the data is refreshed in the GUI.
The following sensors are collecting capacity related information and are used by the GUI.
GPFSDiskCap
NSD, Pool and File system level capacity. Uses the mmdf command in the background and typically
runs once per day as it is resource intensive. Should be restricted to run on a single node only.
GPFSPool
Pool and file system level capacity. Requires a mounted file system and typically runs every 5 minutes.
Should be restricted to run on a single node only.
GPFSFilesetQuota
Fileset capacity based on the quota collection mechanism. Typically, runs every hour. Should be
restricted to run only on a single node.

182 IBM Storage Scale 5.1.9: Problem Determination Guide


GPFSFileset
Inode space (independent fileset) capacity and limits. Typically runs every 5 minutes. Should be
restricted to run only on a single node.
DiskFree
Overall capacity and local node capacity. Can run on every node.
The Monitoring > Statistics page allows to create customized capacity reports for file systems, pools and
filesets. You can store these reports as favorites and add them to the dashboard as well.
The dedicated GUI pages combine information about configuration, health, performance, and capacity in
one place. The following GUI pages provide the corresponding capacity views:
• Files > File Systems
• Files > Filesets
• Storage > Pools
• Storage > NSDs
The Filesets grid and details depend on quota that is obtained from the GPFS quota database and the
performance monitoring sensor GPFSFilesetQuota. If quota is disabled, the system displays a warning
dialog in the Filesets page.

Troubleshooting issues with capacity data displayed in the GUI


Due to the impact that capacity data collection can have on the system, different capacity values are
collected on a different schedule and are provided by different system components. The following list
provides insight on the issues that can arise from the multitude of schedules and subsystems that provide
capacity data:
Capacity in the file system view and the total amount of the capacity for pools and volumes view do
not match.
The capacity data in the file system view is collected every 10 minutes by performance monitoring
collector, but the capacity data for pools and Network Shared Disks (NSD) are not updated. By default,
NSD data is only collected once per day by performance monitoring collector and it is cached. Clicking
the refresh icon gathers the last two records from performance monitoring tool and it displays the last
record values if they are not null. If the last record has null values, the system displays the previous
one. If the values of both records are null, the system displays N/A and the check box for displaying
a time chart is disabled. The last update date is the record date that is fetched from performance
monitoring tool if the values are not null.
Capacity in the file system view and the total amount of used capacity for all filesets in that file
system do not match.
There are differences both in the collection schedule as well as in the collection mechanism that
contributes to the fact that the fileset capacities do not add up to the file system used capacity.
Scheduling differences:
Capacity information that is shown for filesets in the GUI is collected once per hour by performance
monitoring collector and displayed on Filesets page. When you click the refresh icon you get the
information of the last record from performance monitoring. If the last two records have null values,
you get a 'Not collected' warning for used capacity. The file system capacity information on the file
systems view is collected every 10 minutes by performance monitoring collector and when you click
the refresh icon you get the information of the last record from performance monitoring.
Data collection differences:
Quota values show the sum of the size of all files and are reported asynchronously. The quota
reporting does not consider metadata, snapshots, or capacity that cannot be allocated within a
subblock. Therefore, the sum of the fileset quota values can be lower than the data shown in the
file system view. You can use the CLI command mmlsfileset with the -d and -i options to view
capacity information. The GUI does not provide a means to display these values because of the
performance impact due to data collection.

Chapter 7. Monitoring capacity through GUI 183


The sum of all fileset inode values on the view quota window does not match the number of inodes
that are displayed on the file system properties window.
The quota value only accounts for user-created inodes while the properties for the file system also
display inodes that are used internally. Refresh the quota data to update these values.
No capacity data shown on a new system or for a newly created file system
Capacity data may show up with a delay of up to 1 day. The capacity data for file systems, NSDs, and
pools is collected once a day as this is a resource intensive operation. Line charts do not show a line if
only a single data point exists. You can use the hover function in order to see the first data point in the
chart.
The management GUI displays negative fileset capacity or an extremely high used capacity like
millions of Petabytes or 4000000000 used inodes.
This problem can be seen in the quota and filesets views. This problem is caused when the quota
accounting is out of sync. To fix this error, issue the mmcheckquota command. This command
recounts inode and capacity usage in a file system by user, user group, and fileset, and writes the
collected data into the database. It also checks quota limits for users, user groups, and filesets in a
file system. Running this command can impact performance of I/O operations.
No capacity data is displayed on the performance monitoring charts
Verify whether the GPFSFilesetQuota sensor is enabled. You can check the sensor status form the
Services > Performance Monitoring page in the GUI. For more information on how to enable the
performance monitoring sensor for capacity data collection, see Manual installation of IBM Storage
Scale GUI in IBM Storage Scale: Concepts, Planning, and Installation Guide.

184 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 8. Monitoring AFM and AFM DR
The following sections inform you how to monitor and troubleshoot AFM and AFM DR filesets.

Monitoring fileset states for AFM


AFM fileset can have different states depending on the mode and queue states.
To view the current cache state, run the

mmafmctl filesystem getstate

command, or the

mmafmctl filesystem getstate -j cache_fileset

command.
See the following table for the explanation of the cache state:

Table 36. AFM states and their description


AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Inactive The AFM cache is Operations were Healthy None
created not initiated on the
cache cluster after
the last daemon
restart.
FlushOnly Operations are Operations have Healthy This is a temporary
queued not started to state and should
flush. move to Active
when a write is
initiated.
Active The AFM cache is The cache cluster Healthy None
active is ready for an
operation.
Dirty The AFM is active The pending Healthy None
changes in the
cache cluster are
not played at the
home cluster. This
state does not
hamper the normal
activity.
Recovery The cache is A new gateway is Healthy None
accessed after taking over a fileset
primary gateway as primary gateway
failure after the old
primary gateway
failed.

© Copyright IBM Corp. 2015, 2024 185


Table 36. AFM states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
QueueOnly The cache is Operations such as Healthy This is a temporary
running some recovery, resync, state.
operation. failover are being
executed, and
operations are
being queued and
not flushed.
Disconnected Primary gateway Occurs only in a Unhealthy Correct the errant
cannot connect to cache cluster that NFS servers on the
the NFS server at is created over home cluster.
the home cluster. an NFS export.
When parallel
data transfer
is configured,
this state shows
the connectivity
between the
primary gateway
and the mapped
home server,
irrespective of
other gateway
nodes.
Unmounted The cache that • The home NFS is Unhealthy 1. Fix the NFS
is using NFS has not accessible. export issue in
detected a change the Home setup
in the home cluster • The home section and
exports are not
- sometimes during retry for access.
exported
creation or in
properly. 2. Relink the
the middle of
• The home export cache cluster if
an operation if
does not exist. the cache
home exports are
cluster does not
meddled with.
recover.
After
mountRetryInte
rval of the
primary gateway,
the cache cluster
retries connecting
with home.
Unmounted The cache that is There are problems Unhealthy Check remote
using the GPFS accessing the local filesystem mount
protocol detects mount of the on the cache
a change in remote file system. cluster and
the home cluster, remount if
sometimes during necessary.
creation or in
the middle of an
operation.

186 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 36. AFM states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Dropped Recovery failed. The local file Unhealthy Fix the issue and
system is full, access the fileset
space is not to retry recovery.
available on the
cache or the
primary cluster, or
case of a policy
failure during
recovery.
Dropped IW Failback failed. The local file Unhealthy Fix the issue and
system is full, access the fileset
space is not to retry failback.
available on the
cache or the
primary cluster, or
there is a policy
failure during
recovery.
Dropped A cache with active All queued Healthy This is a temporary
queue operations operations are state.
is forcibly unlinked. being de-queued,
and the fileset
remains in the
Dropped state
and moves to
the Inactive state
when the unlinking
is complete.
Dropped The old GW node AFM internally Healthy The system
starts functioning performs queue resolves this state
properly after a transfers from one on the next access.
failure. gateway to another
to handle gateway
node failures.

Chapter 8. Monitoring AFM and AFM DR 187


Table 36. AFM states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Dropped Cache creation or Export problems Unhealthy 1. Fix the NFS
in the middle of at home such as export issue in
an operation if following: the Home setup
the home exports section and
changed. • The home path is
retry for access.
not exported on
all NFS server 2. Relink the
nodes that are cache cluster if
interacting with the cache
the cache cluster does not
clusters. recover.
• The home cluster After
is exported after mountRetryInte
the operations rval the primary
have started on gateway retries
the fileset. connecting with
Changing fsid home cluster.
on the home
cluster after the
fileset operations
have begun.
• All home NFS
servers do not
have the same
fsid for the
same export
path.

Dropped During recovery or If gateway Unhealthy Increase


normal operation queue memory afmHardMemThre
is exceeded, the shold.
queue can get
dropped. The
memory has to
be increased to
accommodate all
requests and bring
the queue back to
the Active state.
Expired The RO cache that An event Unhealthy Fix the errant NFS
is configured to that occurs servers on the
expire. automatically home cluster
after prolonged
disconnection
when the cached
contents are not
accessible.

188 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 36. AFM states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
NeedsFailback The IW cache that A failback initiated Unhealthy Failback is
needs to complete on an IW automatically
failback. cache cluster is triggered on the
interrupted and is fileset, or the
incomplete. administrator can
run failback again.
FailbackInProgress Failback initiated Failback is in Healthy None
on IW cache. progress and
automatically
moves to
failbackComple
ted
FailbackCompleted The IW cache after Failback Healthy Run mmafmctl
failback. successfully failback --
completes on the stop on the cache
IW cache cluster. cluster.
NeedsResync The SW cache Occurs when the Unhealthy Run mmafmctl
cluster during home cluster resync on the
home corruption. is accidentally cache.
corrupted
NeedsResync Recovery on the A rare state Unhealthy No administrator
SW cache. possible only under action required.
error conditions The system would
during recovery fix this in
the subsequent
recovery.
Stopped Replication Fileset stops Unhealthy After planned
stopped on fileset. sending changes to downtime, run
the gateway node. mmafmctl <fs>
Mainly used during start -j
planned downtime. <fileset>
to start
sending changes/
modification to
the gateway node
and continue
replication.

Monitoring fileset states for AFM DR


AFM DR fileset can have different states depending on the mode and queue states.
Run the mmafmctl getstate command to view the current cache state.
See the following table:

Chapter 8. Monitoring AFM and AFM DR 189


Table 37. AFM DR states and their description
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Inactive AFM primary is Operations have Healthy None
created. not been initiated
on the primary
after last daemon
restart.
FlushOnly Operations are Operations have Healthy
queued. not started to
flush. This is a
temporary state
and moves to
Active when a write
is initiated.
Active AFM primary is Primary is ready for Healthy None
active. operation.
Dirty AFM primary is Indicates there are Healthy None
active. pending changes
in primary not
yet played at
secondary. Does
not hamper normal
activity.
Recovery The primary is Can occur when Healthy None
accessed after a new gateway is
MDS failure. taking over a fileset
as MDS after the
old MDS failed.
QueueOnly The primary is Can occur when Healthy This is a temporary
running some operations such state.
operation. as recovery are
being executed
and operations are
being queued and
are not yet flushed.
Disconnected It occurs when Occurs only in Unhealthy Correct the errant
the MDS cannot a cache cluster NFS servers on the
connect to the that is created secondary cluster.
NFS server at over NFS export.
secondary. When parallel
I/O is configured,
this state shows
the connectivity
between the MDS
and the mapped
home server,
irrespective of
other gateway
nodes.

190 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 37. AFM DR states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Unmounted Primary using NFS This can occur if: Unhealthy 1. Rectify the NFS
detects a change export issue as
• Secondary NFS is
in secondary in secondary
not accessible
- sometimes setup section
during creation or • Secondary and retry access
in the middle exports are not
exported 2. Relink primary if
of operation if
properly it does not
secondary exports
recover.
are interfered. • Secondary export
does not exist After
mountRetryInte
rval of the MDS,
the primary retries
connecting with
secondary
Unmounted The primary that Occurs when there Unhealthy Check remote
is using the GPFS are problems filesystem mount
protocol detects accessing the local on the primary
a change in the mount of the cluster and
secondary cluster, remote file system. remount if
sometimes during necessary.
creation or in
the middle of an
operation.
Dropped Recovery failed. Occurs when the Unhealthy Fix the issue and
local file system access the fileset
is full, space is to retry recovery.
not available on
the primary, or
a policy failure
during recovery.
Dropped A primary with All queued Healthy This is a temporary
active queue operations are state.
operations is being de-queued,
forcibly unlinked. and the fileset
remains in the
Dropped state
and moves to
the Inactive state
when the unlinking
is complete.
Dropped Old GW node AFM internally Healthy The system
starts functioning performs queue resolves this state
properly after a transfers from one on the next access.
failure . gateway to another
to handle gateway
node failures.

Chapter 8. Monitoring AFM and AFM DR 191


Table 37. AFM DR states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Dropped Primary creation Export problems at Unhealthy 1. Fix the NFS
or in the middle secondary such as: export issue in
of an operation if the secondary
the home exports • The home path is
setup section
changed. not exported on
and retry for
all NFS server
access.
nodes that are
interacting with 2. Relink the
the cache primary if the
clusters. Even if cache cluster
the home cluster does not
is exported after recover.
the operations After
have started on mountRetryInte
the fileset, rval the MDS
problems might retries connecting
persist. with the secondary.
• Changing fsid
on the home
cluster after the
fileset operations
have begun.
• All home nfs
servers do not
have the same
fsid for the
same export
path.

Dropped During recovery or If gateway Unhealthy Increase


normal operation. queue memory afmHardMemThre
is exceeded, the shold.
queue can get
dropped. The
memory has to
be increased to
accommodate all
requests and bring
the queue back to
the Active state.
NeedsResync Recovery on This is a rare state Unhealthy The problem gets
primary and is possible fixed automatically
only under error in the subsequent
conditions during recovery.
recovery.
NeedsResync Failback on This is a rare Unhealthy Rerun failback or
primary or state and is conversion.
conversion from possible only under
GPFS/SW to error conditions
primary. during failback or
conversion.

192 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 37. AFM DR states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
PrimInitProg Setting up primary This state is Healthy Review errors on
and secondary used while primary psnap0 failure if
relationship during and secondary are fileset state is not
- in the process active.
of establishing a
• creation of a
relationship while
primary fileset.
the psnap0 is
• conversion of in progress. All
gpfs, sw, or iw operations are
fileset to primary disallowed till
fileset. psnap0 is taken
• change locally. This should
secondary of a move to active
primary fileset. when psnap0 is
queued and played
on the secondary
side.
PrimInitFail Failed to set This is a rare Unhealthy • Review errors
up primary failure state when after psnap0
and secondary the psnap0 has failure.
relationship during not been created
• Re-running the
- at the primary.
mmafmctl
In this state no
• creation of a convertToPrim
data is moved
primary fileset. ary command
from the primary
• conversion of without any
to the secondary.
gpfs, sw, or iw parameters ends
The administrator
fileset to primary should check that this state.
fileset. the gateway nodes
• change are up and file
secondary of a system is mounted
primary fileset. on them on
the primary. The
secondary fileset
should also be
setup correctly and
available for use.
FailbackInProgress Primary failback This is the state Healthy None
started. when failback is
initiated on the
primary.

Chapter 8. Monitoring AFM and AFM DR 193


Table 37. AFM DR states and their description (continued)
AFM fileset state Condition Description Healthy or Administrator's
Unhealthy action
Stopped Replication Fileset stops Unhealthy After planned
stopped on fileset. sending changes to downtime, run
the gateway node. mmafmctl <fs>
Mainly used during start -j
planned downtime. <fileset>
to start
sending changes/
modification to
the gateway node
and continue
replication.

Monitoring health and events


You can use mmhealth to monitor health.
To monitor callback events, you can use mmaddcallback and mmdelcallback.

Monitoring with mmhealth


You can use mmhealth to monitor AFM and AFM DR.
Use the following mmhealth command to display the health status of the gateway node:
# mmhealth node show AFM
Node name: p7fbn10.gpfs.net

Component Status Status Change Reasons


------------------------------------------------------------
AFM HEALTHY 3 days ago -
fs1/p7fbn10ADR-4 HEALTHY 3 days ago -
fs1/p7fbn10ADR-5 HEALTHY 3 days ago -

There are no active error events for the component AFM on this node (p7fbn10.gpfs.net).
p7fbn10 Wed Mar 15 04:34:41 1]~# mmhealth node show AFM -Y
mmhealth:State:HEADER:version:reserved:reserved:node:component:entityname:entitytype:status:laststatuschange:
mmhealth:Event:HEADER:version:reserved:reserved:node:component:entityname:entitytype:event:arguments:
activesince:identifier:ishidden:
mmhealth:State:0:1:::p7fbn10.gpfs.net:NODE:p7fbn10.gpfs.net:NODE:DEGRADED:2017-03-11 18%3A48%3A20.600167 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:p7fbn10.gpfs.net:NODE:HEALTHY:2017-03-11 19%3A56%3A48.834633 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-5:FILESET:HEALTHY:2017-03-11 19%3A56%3A48.834753 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-4:FILESET:HEALTHY:2017-03-11 19%3A56%3A19.086918 EDT:

Use the following mmhealth command to display the health status of all the monitored AFM components
in the cluster:
# mmhealth cluster show AFM
Node name: p7fbn10.gpfs.net

Component Status Status Change Reasons


------------------------------------------------------------
AFM HEALTHY 3 days ago -
fs1/p7fbn10ADR-4 HEALTHY 3 days ago -
fs1/p7fbn10ADR-5 HEALTHY 3 days ago -

There are no active error events for the component AFM on this node (p7fbn10.gpfs.net).
p7fbn10 Wed Mar 15 04:34:41 1]~# mmhealth node show AFM -Y
mmhealth:State:HEADER:version:reserved:reserved:node:component:entityname:entitytype:status:laststatuschange:
mmhealth:Event:HEADER:version:reserved:reserved:node:component:entityname:entitytype:event:arguments:
activesince:identifier:ishidden:
mmhealth:State:0:1:::p7fbn10.gpfs.net:NODE:p7fbn10.gpfs.net:NODE:DEGRADED:2017-03-11 18%3A48%3A20.600167 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:p7fbn10.gpfs.net:NODE:HEALTHY:2017-03-11 19%3A56%3A48.834633 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-5:FILESET:HEALTHY:2017-03-11 19%3A56%3A48.834753 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-4:FILESET:HEALTHY:2017-03-11 19%3A56%3A19.086918 EDT:

Monitoring callback events for AFM and AFM DR


You can use events to monitor AFM and AFM DR fileset.
All events are at the fileset level. To add the events, run the mmaddcallback command.

194 IBM Storage Scale 5.1.9: Problem Determination Guide


An example of the command is

#mmdelcallback callback3

mmaddcallback callback3 --command /tmp/recovery_events.sh --event

afmRecoveryStart --parms "%eventName %homeServer %fsName %filesetName


%reason"

Table 38. List of events that can be added using mmaddcallback


Events Applicable to… Description
afmprepopend All AFM filesets Completion of the prefetch task.
afmRecoveryStart SW, IW, DR filesets Beginning of the recovery process.
afmRecoveryEnd SW, IW, DR filesets End of the recovery process.
afmRPOMiss primary Indicates that RPO is missed due to a network
delay or a failure to create snapshot on
secondary side. Failed RPOs are queued and
tried again on the secondary.
afmHomeDisconnec All AFM filesets, DR filesets For NFS target: The AFM home/DR secondary is
ted not reachable.
afmHomeConnected All AFM filesets, DR filesets For NFS target: The AFM home/DR secondary is
reachable.
afmFilesetExpired RO fileset For RO fileset: Fileset has expired.
afmFilesetUnexpired RO fileset For RO fileset: Fileset is back to Active after
expiration.
afmManualResyncC SW, IW, DR filesets The SW resync or failover process is complete
omplete after -
• conversion of gpfs, sw, or iw fileset to primary
fileset.
• change secondary of a primary fileset.

afmQueueDropped All AFM filesets, DR filesets The queue is dropped.


afmfilesetunmounte All AFM filesets, DR filesets The fileset is in the Unmounted state.
d
afmFilesetCreate All AFM filesets The fileset is created successfully.
afmFilesetLink All AFM filesets The fileset is linked successfully.
afmFilesetChange All AFM filesets The fileset is changed successfully. If the
fileset was renamed, then the new name is
mentioned in %reason.
afmFilesetUnlink All AFM filesets The fileset is unlinked successfully.
afmFilesetDelete All AFM filesets The fileset is deleted successfully.

Monitoring performance
You can use mmperfmon and mmpmon commands to monitor AFM and AFM DR.

Chapter 8. Monitoring AFM and AFM DR 195


Monitoring using mmpmon
You can use mmpmon to monitor AFM and AFM DR.
1. To reset some statistics on a gateway node, run the following commands:

echo "afm_s reset" | mmpmon


echo "afm_s fset all" | mmpmon

2. To reset all statistics, run the following command:

mmfsadm afm resetall

3. To view the statistics, run the following command:

echo afm_s | mmpmon -s -r 0 -d 2000

This command shows statistics from the time the Gateway is functioning. Every gateway recycle resets
the statistics.
The following example is from an AFM Gateway node. The example shows how many operations of
each type were executed on the gateway node.

c2m3n10 Tue May 10 09:55:59 0]~# echo afm_s | mmpmon

mmpmon> mmpmon node 192.168.2.20 name c2m3n10 afm_s s OK


Name Queued Inflight Completed Errors Filtered ENOENT
lookup 0 0 1 0 0 0
create 0 0 20 0 10 0
remove 0 0 0 0 10 0
open 0 0 2 0 0 0
read 0 0 0 0 1 0
write 0 0 20 0 650 0
BytesWritten = 53320860 (50.85 MB) (26035.58 KB/s) BytesToWrite = 0 (0.00 KB)
Queue Delay (s) (min:0 max:19 avg:18)
Async Msgs (expire:50 force:0 sync:4 revoke:0)
NumMsgExecuted = 715
NumHomeconn = 292
NumHomedisc = 292
NumRPOMisses = 1

The fields are described in the following table.

Table 39. Field description of the example


Field name Description
BytesWritten The amount of data synchronized to home.
BytesToWrite The amount of data in queue.
QueueDelay The maximum delay experienced by operations
before sync to home.
NumMsgExecuted The number of operations executed at home.
NumHomeconn The number of times home reconnected after
disconnection.
NumHomedisc The number of times home disconnected.
NumRPOMisses Related to RPOs for AFM primary fileset.

Monitoring using mmperfmon


You can use mmperfmon to monitor AFM and AFM DR.
Complete the following steps to enable Performance Monitoring tool and query data.

196 IBM Storage Scale 5.1.9: Problem Determination Guide


Note: Ensure that monitoring is initialized, performance monitoring is enabled, and other sensors are
collecting data.
1. Run the following command to configure the gateway nodes as performance monitoring nodes:

mmcrnodeclass afmGateways -N gw1,gw2

2. Set perfmon designation for the gateway nodes:

mmchnode --perfmon -N afmGateways

3. Enable the monitoring tool on the gateway nodes to set the collection periods to 10 or higher:

mmperfmon config update GPFSAFM.period=10 GPFSAFMFS.period=10 GPFSAFMFSET.period=10

4. Restrict the gateway nodes to collect AFM data:

mmperfmon config update GPFSAFM.restrict=afmGateways GPFSAFMFS.restrict=afmGateways


GPFSAFMFSET.restrict=afmGateways

5. Run the query to display time series data:

mmperfmon query gpfs_afm_fset_bytes_written --bucket-size 60 --number-buckets 1 -N gw1

The system displays output similar to:

Legend: 1: gw1|GPFSAFMFSET|gpfs0|independentwriter|gpfs_afm_fset_bytes_written Row Timestamp


gpfs_afm_fset_bytes_written 1 2017-03-10-13:28:00 133546

Note: You can use the GUI or the Grafana bridge to query collected data.

Monitoring prefetch
You can display the status of an AFM prefetch request by running the mmafmctl prefetch command
without the list-file option.
For example, for file system gpfs1 and fileset iw_1, run the following command:
# mmafmctl gpfs1 prefetch -j iw_1

Fileset Name Async Read (Pending) Async Read (Failed) Async Read (Already Cached) Async Read(Total)
Async Read (Data in Bytes)
------------ -------------------------------------- ------------------ ------------------------
iw_1 11 0 0 11 0

This output displays that there are 11 inodes that must be prefetched Async Read (Pending). When the
job has completed, the status command displays:
# mmafmctl gpfs1 prefetch -j iw_1
Fileset Name Async Read (Pending) Async Read (Failed) Async Read (Already Cached) Async Read(Total)
Async Read (Data in Bytes)
------------ -------------------------------------- ------------------ ------------------------
iw_1 0 0 10 11

Monitoring status using mmdiag


You can use the mmdiag command to monitor AFM and AFM DR in the following ways:
• Use the following mmdiag --afm command to display all active AFM-relationships on a gateway node:
# mmdiag --afm
The system displays output similar to -

=== mmdiag: afm ===


AFM Gateway: fin23p Active

AFM-Cache: fileset_2 (/cache_fs0/fs2) in Device: cache_fs0


Mode: independent-writer
Home: fin21p (nfs://fin21p/test_fs0/cache_fs0)

Chapter 8. Monitoring AFM and AFM DR 197


Fileset Status: Linked
Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 603
AFM-Cache: fileset1 (/cache_fs0/fs1) in Device: cache_fs0
Mode: single-writer
Home: fin21p (nfs://fin21p/test_fs0/cache_fs1)
Fileset Status: Linked
Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 2
AFM-Cache: fileset1 (/test_cache/fs1) in Device: test_cache
Mode: read-only
Home: fin21p (nfs://fin21p/test_fs0/cache_fs2)
Fileset Status: Linked
Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 3
[root@fin23p ~]# mmdiag --afm -Y
mmdiag:afm_fset:HEADER:version:reserved:reserved:cacheName:cachePath:deviceName
:cacheMode:HomeNode:HomePath:filesetStatus:handlerState:cacheState:qState:qLen:qNumExec
mmdiag:afm_gw:HEADER:version:reserved:reserved:gwNode:gwActive:gwDisconn
:Recov:Resync:NodeChg:QLen:QMem:softQMem:hardQMem:pingState
mmdiag:afm_gw:0:1:::fin23p:Active::::::::
mmdiag:afm_fset:0:1:::fileset_2:/cache_fs0/fs2:cache_fs0:independent-writer
:fin21p:nfs%3A//fin21p/test_fs0/cache_fs0:Linked:Mounted:Active:Normal:0:603:
mmdiag:afm_fset:0:1:::fileset1:/cache_fs0/fs1:cache_fs0:single-writer
:fin21p:nfs%3A//fin21p/test_fs0/cache_fs1:Linked:Mounted:Active:Normal:0:2:
mmdiag:afm_fset:0:1:::fileset1:/test_cache/fs1:test_cache:read-only
:fin21p:nfs%3A//fin21p/test_fs0/cache_fs2:Linked:Mounted:Active:Normal:0:3:

• Use the following mmdiag --afm command to display only the specified fileset's relationship:
# mmdiag --afm fileset=cache_fs0:fileset_2
The system displays output similar to -

=== mmdiag: afm ===


AFM-Cache: fileset_2 (/cache_fs0/fs2) in Device: cache_fs0
Mode: independent-writer
Home: fin21p (nfs://fin21p/test_fs0/cache_fs0)
Fileset Status: Linked
Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 603
[root@fin23p ~]# mmdiag --afm fset=cache_fs0:fileset_2 -Y
mmdiag:afm_fset:HEADER:version:reserved:reserved:cacheName:cachePath:deviceName
:cacheMode:HomeNode:HomePath:filesetStatus:handlerState:cacheState:qState:qLen:qNumExec
mmdiag:afm_fset:0:1:::fileset_2:/cache_fs0/fs2:cache_fs0:
independent-writer:fin21p:nfs%3A//fin21p/test_fs0/cache_fs0
:Linked:Mounted:Active:Normal:0:603:

• Use the following mmdiag --afm command to display detailed gateway-specific attributes:
# mmdiag --afm gw
The system displays output similar to -

=== mmdiag: afm ===


AFM Gateway: fin23p Active

QLen: 0 QMem: 0 SoftQMem: 2147483648 HardQMem 5368709120


Ping thread: Started
[root@fin23p ~]# mmdiag --afm gw -Y
mmdiag:afm_gw:HEADER:version:reserved:reserved:gwNode:gwActive:gwDisconn
:Recov:Resync:NodeChg:QLen:QMem:softQMem:hardQMem:pingState
mmdiag:afm_gw:0:1:::fin23p:Active:::::0:0:2147483648:5368709120:Started
[root@fin23p ~]#

• Use the mmdiag --afm command to display all active filesets known to the gateway node:
# mmdiag --afm fileset=all
The system displays output similar to -

=== mmdiag: afm ===


AFM-Cache: fileset1 (/test_cache/fs1) in Device: test_cache
Mode: read-only
Home: fin21p (nfs://fin21p/test_fs0/cache_fs2)
Fileset Status: Linked

198 IBM Storage Scale 5.1.9: Problem Determination Guide


Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 3
AFM-Cache: fileset1 (/cache_fs0/fs1) in Device: cache_fs0
Mode: single-writer
Home: fin21p (nfs://fin21p/test_fs0/cache_fs1)
Fileset Status: Linked
Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 2
AFM-Cache: fileset_2 (/cache_fs0/fs2) in Device: cache_fs0
Mode: independent-writer
Home: fin21p (nfs://fin21p/test_fs0/cache_fs0)
Fileset Status: Linked
Handler-state: Mounted
Cache-state: Active
Q-state: Normal Q-length: 0 Q-executed: 603
[root@fin23p ~]# mmdiag --afm fileset=all -Y
mmdiag:afm_fset:HEADER:version:reserved:reserved:cacheName:cachePath:deviceName
:cacheMode:HomeNode:HomePath:filesetStatus:handlerState:cacheState:qState:qLen:qNumExec
mmdiag:afm_fset:0:1:::fileset1:/test_cache/fs1:test_cache
:read-only:fin21p:nfs%3A//fin21p/test_fs0/cache_fs2
:Linked:Mounted:Active:Normal:0:3:
mmdiag:afm_fset:0:1:::fileset1:/cache_fs0/fs1:cache_fs0
:single-writer:fin21p:nfs%3A//fin21p/test_fs0/cache_fs1
:Linked:Mounted:Active:Normal:0:2:
mmdiag:afm_fset:0:1:::fileset_2:/cache_fs0/fs2:cache_fs0
:independent-writer:fin21p:nfs%3A//fin21p/test_fs0/cache_fs0
:Linked:Mounted:Active:Normal:0:603:

Policies used for monitoring AFM and AFM DR


You can monitor AFM and AFM DR using some policies and commands.
Following are the policies used for monitoring:
1. The following file attributes are available through the policy engine:

Table 40. Attributes with their description


Attribute Description
P The file is managed by AFM and AFM DR.
u The file is managed by AFM and AFM DR, and the
file is fully cached. When a file originates at the
home, it indicates that the entire file is copied
from the home cluster.
v A file or a soft link is newly created, but not
copied to the home cluster.
w The file has outstanding data updates.
x A hard link is newly created, but not copied to the
home cluster.
y A file metadata was changed and the change not
copied to the home cluster.
z A file is local to the cache and is not queued at
the home cluster.
j A file is appended, but not copied to the home
cluster. This attribute also indicates complete
directories.
k All files and directories that are not orphan and
are repaired.
2. A list of dirty files in the cache cluster:

Chapter 8. Monitoring AFM and AFM DR 199


This is an example of a LIST policy that generates a list of files in the cache with pending changes that
have not been copied to the home cluster.

RULE 'listall' list 'all-files' SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) ||


' ' ||
varchar(misc_attributes) || ' ' || fileset_name) WHERE
REGEX(misc_attributes,'[P]') AND
REGEX(misc_attributes,'[w|v|x|y|j]')

If there are no outstanding updates, an output file is not created.


3. A list of partially cached files:
The following example is that of a LIST policy that generates a list of partially-cached files. If the file
is in progress, partial caching is enabled or the home cluster becomes unavailable before the file is
completely copied.

RULE 'listall' list 'all-files'


SHOW(varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' ||
varchar(misc_attributes) || ' ' || fileset_name )
WHERE REGEX(misc_attributes,'[P]') AND NOT REGEX(misc_attributes,'[u]') AND
kb_allocated > 0

This list does not include files that are not cached. If partially-cached files do not exist, an output file is
not created.
4. The custom eviction policy:
The steps to use policies for AFM file eviction are - generate a list of files and run the eviction. This
policy lists all the files that are managed by AFM are not accessed in the last seven days.

RULE 'prefetch-list'
LIST 'toevict'
WHERE CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '7' DAYS
AND REGEX(misc_attributes,'[P]') /* only list AFM managed files */

To limit the scope of the policy or to use it on different filesets run mmapplypolicy by using a
directory path instead of a file system name. /usr/lpp/mmfs/bin/mmapplypolicy $path -f
$localworkdir -s $localworkdir -P $sharedworkdir/${policy} -I defer
Use mmafmctl to evict the files: mmafmctl datafs evict --list-file $localworkdir/
list.evict
5. A policy of uncached files:
a. The following example is of a LIST policy that generates a list of uncached files in the cache
directory:

RULE EXTERNAL LIST 'u_list'


RULE 'u_Rule' LIST 'u_list' DIRECTORIES_PLUS FOR FILESET ('sw1') WHERE NOT
REGEX(misc_attributes,'[u]')

b. An example of a LIST policy that generates a list of files with size and attributes belonging to the
cache fileset is as under - (cacheFset1 is the name of the cache fileset in the example.)

RULE 'all' LIST 'allfiles' FOR FILESET ('cacheFset1') SHOW( '/' || VARCHAR(kb_allocated)
|| '/' || varchar(file_size) || '/' ||
VARCHAR(BLOCKSIZE) || '/' || VARCHAR(MISC_ATTRIBUTES) )

Monitoring AFM and AFM DR using GUI


The Files > Active File Management page in the IBM Storage Scale provides an easy way to monitor the
performance, health status, and configuration aspects of the AFM and AFM DR relationships in the IBM
Storage Scale cluster. It also provides details of the gateway nodes that are part of the AFM or AFM DR
relationships.
The following options are available to monitor AFM and AFM DR relationships and gateway nodes:

200 IBM Storage Scale 5.1.9: Problem Determination Guide


1. A quick view that gives the details of top relationships between cache and home sites in an AFM or
AFM DR relationship. It also provides performance of gateway nodes by used memory and number of
queued messages. The graphs that are displayed in the quick view are refreshed regularly. The refresh
intervals are depended on the selected time frame. The following list shows the refresh intervals
corresponding to each time frame:
• Every minute for the 5 minutes time frame
• Every 15 minutes for the 1 hour time frame
• Every 6 hours for the 24 hours time frame
• Every two days for the 7 days time frame
• Every seven days for the 30 days time frame
• Every four months for the 365 days time frame
2. Different performance metrics and configuration details in the tabular format. The following tables are
available:
Cache
Provides information about configuration, health, and performance of the AFM feature that is
configured for data caching and replication. It also provides information on the estimated time that
is needed to refresh or flush queues during recovery or resync operations.
Disaster Recovery
Provides information about configuration, health, and performance of AFM DR configuration in the
cluster.
Gateway Nodes
Provides details of the nodes that are designated as the gateway node in the AFM or AFM DR
configuration.
To find an AFM or AFM DR relationship or a gateway node with extreme values, you can sort the values
that are displayed on the table by different attributes. Click the performance metric in the table header
to sort the data based on that metric. You can select the time range that determines the averaging of
the values that are displayed in the table and the time range of the charts in the overview from the
time range selector, which is placed in the upper right corner. The metrics in the table do not update
automatically. The refresh button, that is placed at the top of the table, allows to refresh the table with
more recent data.
3. A detailed view of the performance and health aspects of the individual AFM or AFM DR relationship or
gateway node. To see the detailed view, you can either double-click the row that lists the relationship
or gateway node of which you need to view the details or select the item from the table and click View
Details. The following details are available for each item:
Cache
• Overview: Provides number of available cache inodes and displays charts that show the amount
of data that is transferred, data backlog, and memory used for the queue.
• Events: Provides details of the system health events reported for the AFM component.
• Snapshots: Provides details of the snapshots that are available for the AFM fileset.
• Gateway Nodes: Provides details of the nodes that are configured as gateway node in the AFM
configuration.
Disaster Recovery
• Overview: Provides number of available primary inodes and displays charts that show the
amount of data that is transferred, data backlog, and memory used for the queue.
• Events: Provides details of the system health events reported for the AFM component.
• Snapshots: Provides details of the snapshots that are available for the AFM fileset.
• Gateway Nodes: Provides details of the nodes that are configured as gateway node in the AFM
configuration.

Chapter 8. Monitoring AFM and AFM DR 201


Gateway Nodes
The details of gateway nodes are available under the following tabs:
• Overview tab provides performance chart for the following:
– Client IOPS
– Client data rate
– Server data rate
– Server IOPS
– Network
– CPU
– Load
– Memory
• Events tab helps to monitor the events that are reported in the node. Similar to the Events
page, you can also perform the operations like marking events as read and running fix procedure
from this events view. Only current issues are shown in this view. The Monitoring > Events page
displays the entire set of events that are reported in the system.
• File Systems tab provides performance details of the file systems that are mounted on the node.
File system's read or write throughput, average read or write transactions size, and file system
read or write latency are also available.
Use the Mount File System or Unmount File System options to mount or unmount individual
file systems or multiple file systems on the selected node. The nodes on which the file system
need to be mounted or unmounted can be selected individually from the list of nodes or based
on node classes.
• NSDs tab gives status of the disks that are attached to the node. The NSD tab appears only if the
node is configured as an NSD server.
• SMB and NFS tabs provide the performance details of the SMB and NFS services that are hosted
on the node. These tabs appear in the chart only if the node is configured as a protocol node.
• The AFM tab provides details of the configuration and status of the AFM and AFM DR
relationships for which the node is configured as the gateway node.
It also displays the number of AFM filesets and the corresponding export server maps. Each
export map establishes a mapping between the gateway node and the NFS host name to allow
parallel data transfers from cache to home. One gateway node can be mapped only to a single
NFS server and one NFS server can be mapped to multiple gateway nodes.
• Network tab displays the network performance details.
• Properties tab displays the basic attributes of the node and you can use the Prevent file system
mounts option to specify whether you can prevent file systems from mounting on the node.

Monitoring AFM and AFM DR configuration and performance in the remote cluster
The IBM Storage Scale GUI can monitor only a single cluster. If you want to monitor the AFM and AFM DR
configuration, health, and performance across clusters, the GUI node of the local cluster must establish
a connection with the GUI node of the remote cluster. By establishing a connection between GUI nodes,
both the clusters can monitor each other. To enable remote monitoring capability among clusters, the GUI
nodes that are communicating each other must be in the same software level.
To establish a connection with the remote cluster, perform the following steps:
1. Perform the following steps on the local cluster to raise the access request:
a. Select the Request Access option that is available under the Outgoing Requests tab to raise the
request for access.
b. In the Request Remote Cluster Access dialog, enter an alias for the remote cluster name and
specify the GUI nodes to which the local GUI node must establish the connection.

202 IBM Storage Scale 5.1.9: Problem Determination Guide


c. If you know the credentials of the security administrator of the remote cluster, you can also add the
user name and password of the remote cluster administrator and skip step “2” on page 203 .
d. Click Send to submit the request.
2. Perform the following steps on the remote cluster to grant access:
a. When the request for connection is received in, the GUI displays the details of the request in the
Access > Remote Connections > Incoming Requests page.
b. Select Grant Access to grant the permission and establish the connection.
Now, the requesting cluster GUI can monitor the remote cluster. To enable both clusters to monitor each
other, repeat the procedure with reversed roles through the respective GUIs.
Note: Only the GUI user with Security Administrator role can grant access to the remote connection
requests.
When the remote cluster monitoring capabilities are enabled, you can view the following remote cluster
details in the local AFM GUI:
• On home and secondary, you can see the AFM relationships configuration, health status, and
performance values of the Cache and Disaster Recovery grids.
• On the Overview tab of the detailed view, the available home and secondary inodes are available.
• On the Overview tab of the detailed view, the details such as NFS throughput, IOPs, and latency details
are available, if the protocol is NFS.
The performance and status information on gateway nodes are not transferred to home.

Creating and deleting peer and RPO snapshots through GUI


When a peer snapshot is taken, it creates a snapshot of the cache fileset and then queues a snapshot
creation at the home site. This ensures application consistency at both cache and home sites. The
recovery point objective (RPO) snapshot is a type of peer snapshot that is used in the AFM DR setup. It is
used to maintain consistency between the primary and secondary sites in an AFM DR configuration.
Use the Create Peer Snapshot option in the Files > Snapshots page to create peer snapshots. You can
view and delete these peer snapshots from the Snapshots page and also from the detailed view of the
Files > Active File Management page.

Chapter 8. Monitoring AFM and AFM DR 203


204 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 9. Monitoring AFM to cloud object storage
The following sections inform you how to monitor and troubleshoot AFM to cloud object storage filesets.

Monitoring fileset states for AFM to cloud object storage


An AFM to cloud object storage can have different states. You can get these states by using the mmafmctl
filesystem getstate command according to the relation status it has with the cloud object storage
endpoints.
To view the current cache state, run the

mmafmctl filesystem getstate

command, or the

mmafmctl filesystem getstate -j cache_fileset

command.
See the following table for the explanation of the cache state:

Table 41. AFM to cloud object storage states and their description
AFM to cloud Condition Description Healthy or Administrator's
object storage Unhealthy action
fileset state
Inactive The fileset is A AFM to cloud Healthy None
created. object storage
fileset is created,
or operations were
not initiated on
the cluster after
the last daemon
restart.
FlushOnly Operations are Operations have Healthy This is a temporary
queued. not started to state and should
flush. move to Active
when a write is
initiated.
Active The fileset cache is The fileset is ready Healthy None
active. for an operation.
Dirty The fileset is The pending Healthy None
active. changes in the
fileset are not
played on the
could object
storage.

© Copyright IBM Corp. 2015, 2024 205


Table 41. AFM to cloud object storage states and their description (continued)
AFM to cloud Condition Description Healthy or Administrator's
object storage Unhealthy action
fileset state
Recovery The fileset is A new gateway Healthy None
accessed after is taking over a
primary gateway fileset as primary
failure or shutdown gateway after
and restart of the old primary
gateway node or gateway failed,
cluster. or if there are
some operations
in queue and
gateway or cluster
is restarted.
QueueOnly The fileset is Operations such as Healthy This is a temporary
running some recovery, resync, state.
operation. failover are being
run, and operations
are being queued
and not flushed.
Disconnected Primary gateway When a gateway Unhealthy Check endpoint
cannot connect to node of the fileset configuration
a cloud object cannot connect to and network
storage endpoint. a cloud object connection
storage endpoint. between a cluster
and a cloud object
storage.
Unmounted The buckets on The gateway Unhealthy Check that buckets
a cloud object can connect to exist with right
storage are not the endpoint configuration and
accessible. but cannot see keys.
the buckets and
cannot connect.
Dropped Recovery operation The local file Unhealthy Fix the issue and
failed. system is full, access the fileset
space is not to retry recovery.
available on the
fileset, or case of Provide more
a policy failure space for the
during recovery. fileset.

Dropped A fileset with active All queued Healthy This is a temporary


queue operations operations are state.
is forcibly unlinked. being de-queued,
and the fileset
remains in the
Dropped state
and moves to
the Inactive state
when the unlinking
is complete.

206 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 41. AFM to cloud object storage states and their description (continued)
AFM to cloud Condition Description Healthy or Administrator's
object storage Unhealthy action
fileset state
Dropped In the middle of When the relation Unhealthy Relink the fileset
an operation if is set and after buckets and
the buckets or working, the its configuration
its configuration buckets on a cloud is changed to its
on a cloud object storage are previous state.
object storage are changed.
changed.
Dropped During recovery or If gateway Unhealthy Increase the
normal operation queue memory afmHardMemThre
is exceeded, the shold value.
queue can get
dropped. The
memory has to
be increased to
accommodate all
requests and bring
the queue back to
the Active state.
Expired The RO mode An event Unhealthy Fix the errant
relation that is that occurs network or bucket
configured to automatically changes.
expire. after prolonged
disconnection
when the RO fileset
contents are not
accessible.
Stopped Replication The fileset stops Unhealthy After planned
stopped on fileset. sending changes downtime, run
to the gateway mmafmctl <fs>
node. This state start -j
is used during <fileset> to
planned downtime. start sending
changes or
modification to
the gateway node
and continue
replication.

Monitoring health and events


You can use the mmhealth command to monitor health of AFM to cloud object storage filesets on
gateway nodes.
1. Check the node status.

# mmhealth node show AFM

A sample output is as follows:

Node name: Node5GW


Component Status Status Change Reasons
-------------------------------------------------------------

Chapter 9. Monitoring AFM to cloud object storage 207


AFM HEALTHY 6 days ago -
fs1/singlewriter HEALTHY 6 days ago -

No active error events for the component AFM to cloud object storage on the Node5GW node.
2. To display the health status of all the monitored AFM components in the cluster, use the mmhealth
command.

# mmhealth cluster show AFM

A sample output is as follows:

Component Node Status Reasons


------------------------------------------------------------------------------------------
AFM Node5GW HEALTHY -
AFM Node4GW FAILED afm_fileset_unmounted

Monitoring performance
You can use mmperfmon and mmpmon commands to monitor AFM to cloud object storage.

Monitoring using mmpmon


You can use mmpmon to monitor AFM to cloud object storage.
Complete the following steps to enable the Performance Monitoring tool and query data.
1. To reset some statistics on a gateway node.

# echo "afm_s reset" | mmpmon


# echo "afm_s fset all" | mmpmon

2. Reset all statistics.

# mmfsadm afm resetall

3. Check the statistics.

# echo afm_s | mmpmon -s -r 0 -d 2000

This command shows statistics from the time the gateway is functioning. Every gateway recycle resets
the statistics.
The following example is from a gateway node. The example shows how many operations of each type
were run on the gateway node.

~# echo afm_s | mmpmon

mmpmon> mmpmon node 192.168.2.20 name c2m3n10 afm_s s OK


Name Queued Inflight Completed Errors Filtered ENOENT
lookup 0 0 1 0 0 0
create 0 0 20 0 10 0
remove 0 0 0 0 10 0
open 0 0 2 0 0 0
read 0 0 0 0 1 0
write 0 0 20 0 650 0
BytesWritten = 53320860 (50.85 MB) (26035.58 KB/s) BytesToWrite = 0 (0.00 KB)
Queue Delay (s) (min:0 max:19 avg:18)
Async Msgs (expire:50 force:0 sync:4 revoke:0)
NumMsgExecuted = 715
NumHomeconn = 292
NumHomedisc = 292
NumRPOMisses = 1

Where:
BytesWritten
The amount of data that is synchronized to the home.

208 IBM Storage Scale 5.1.9: Problem Determination Guide


BytesToWrite
The amount of data in the queue.
QueueDelay
The maximum delay that is experienced by operations before sync to the home.
NumMsgExecuted
The number of operations that were run on the home.
NumHomeconn
The number of times when the home was reconnected after disconnection.
NumHomedisc
The number of times home disconnected.

Monitoring using mmperfmon


You can use mmperfmon to monitor AFM to cloud object storage.
Complete the following steps to enable Performance Monitoring tool and query data.
Note: Ensure that monitoring is initialized, performance monitoring is enabled, and other sensors are
collecting data.
1. Configure the gateway nodes as performance monitoring nodes.

# mmcrnodeclass afmGateways -N gw1,gw2

2. Set the -perfmon designation for the gateway nodes.

# mmchnode –perfmon -N afmGateways

3. Enable the monitoring tool on the gateway nodes to set the collection periods to 10 or higher.

# mmperfmon config update GPFSAFM.period=10 GPFSAFMFS.period=10 GPFSAFMFSET.period=10

4. Restrict the gateway nodes to collect AFM data.

# mmperfmon config update GPFSAFM.restrict=afmGateways GPFSAFMFS.restrict=afmGateways


GPFSAFMFSET.restrict=afmGateways

5. Run the query to display time series data.

# mmperfmon query gpfs_afm_fset_bytes_written --bucket-size 60 --number-buckets 1 -N gw1

A sample output is as follows:

Legend: 1: gw1|GPFSAFMFSET|gpfs0|independentwriter|gpfs_afm_fset_bytes_written Row Timestamp


gpfs_afm_fset_bytes_written 1 2017-03-10-13:28:00 133546

Monitoring AFM to cloud object storage download and upload


You can use the mmafmcosctl command to monitor AFM to cloud object storage download and upload
operations.
You can use the mmafmcosctl command to upload and download the objects. When the command is
run, the upload or download stats are shown immediately and it progresses as the operation runs on the
fileset.
1. Download the objects.

# mmafmcosctl fs1 singlewriter /gpfs/fs1/singlewriter download --all

A sample output is as follows:

Queued Failed AlreadyCached TotalData


(approx in Bytes)

Chapter 9. Monitoring AFM to cloud object storage 209


8 0 4 1133881
Object Download successfully queued at the gateway.

Here, queued and failed objects are shown along with cached objects, which is not fetched from the
cloud object storage and the total data is the data download approximately in Bytes.
2. Upload the objects.

# mmafmcosctl fs1 localupdates /gpfs/fs1/localupdates/ download --all

A sample output is as follows:

Queued Failed AlreadyCached TotalData


(approx in Bytes)
3 0 0 6816
Object Download successfully queued at the gateway.

The numbers of objects, that are uploaded, are queued to the gateway and shown under the Queued
field.

210 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 10. GPFS SNMP support
GPFS supports the use of the SNMP protocol for monitoring the status and configuration of the GPFS
cluster. Using an SNMP application, the system administrator can get a detailed view of the system and be
instantly notified of important events, such as a node or disk failure.
The Simple Network Management Protocol (SNMP) is an application-layer protocol that facilitates the
exchange of management information between network devices. It is part of the Transmission Control
Protocol/Internet Protocol (TCP/IP) protocol suite. SNMP enables network administrators to manage
network performance, find and solve network problems, and plan for network growth.
SNMP consists of commands to enumerate, read, and write managed variables that are defined for a
particular device. It also has a trap command, for communicating events asynchronously.
The variables are organized as instances of objects, known as management information bases (MIBs).
MIBs are organized in a hierarchical tree by organization (for example, IBM). A GPFS MIB is defined for
monitoring many aspects of GPFS.
An SNMP agent software architecture typically consists of a master agent and a set of subagents, which
communicate with the master agent through a specific agent/subagent protocol (the AgentX protocol
in this case). Each subagent handles a particular system or type of device. A GPFS SNMP subagent is
provided, which maps the SNMP objects and their values.
You can also configure SNMP by using the Settings > Event Notifications > SNMP Manager page of the
IBM Storage Scale management GUI. For more information on the SNMP configuration options that are
available in the GUI, see “Configuring SNMP manager” on page 5.

Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
For more information, see “Collector node administration” on page 215.
The Net-SNMP master agent (also called the SNMP daemon, or snmpd) must be installed on the
collector node to communicate with the GPFS subagent and with your SNMP management application.
Net-SNMP is included in most Linux distributions and should be supported by your Linux vendor. Source
and binaries for several platforms are available from the download section of the Net-SNMP website
(www.net-snmp.org/download.html).
Note: Currently, the collector node must run on the Linux operating system. For an up-to-date list of
supported operating systems, specific distributions, and other dependencies, refer to the IBM Storage
Scale FAQ in IBM Documentation.
The GPFS subagent expects to find the following shared object libraries:

libnetsnmpagent.so -- from Net-SNMP


libnetsnmphelpers.so -- from Net-SNMP
libnetsnmpmibs.so -- from Net-SNMP
libnetsnmp.so -- from Net-SNMP
libwrap.so -- from TCP Wrappers
libcrypto.so -- from OpenSSL

Note: TCP Wrappers and OpenSSL are prerequisites and should have been installed when you installed
Net-SNMP.
TCP wrappers have been deprecated from RHEL 7 onwards and is not available from RHEL 8 onwards. You
can use firewalld as a firewall level replacement for TCP Wrappers.
For example, RHEL79 system (ESS legacy nodes running ESS 6.1.5.1):

libnetsnmpagent.so.31 -> libnetsnmpagent.so.31.0.0

© Copyright IBM Corp. 2015, 2024 211


For example, RHEL86:

systemlibnetsnmp.so.35 -> libnetsnmp.so.35.0.0

The installed libraries are found in /lib64 or /usr/lib64 or /usr/local/lib64. They may be installed under
names like libnetsnmp.so.5.1.2. The GPFS subagent expects to find them without the appended version
information in the name. Library installation should create these symbolic links for you, so you rarely need
to create them yourself. You can ensure that symbolic links exist to the versioned name from the plain
name. For example,

# cd /usr/lib64
# ln -s libnetsnmpmibs.so.5.1.2 libnetsnmpmibs.so

Repeat this process for all the libraries listed in this topic.
Note: For possible Linux platform and Net-SNMP version compatibility restrictions, see the IBM Storage
Scale FAQ in IBM Documentation.
Related concepts
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.
The management and monitoring subagent
The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.

Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
The following entries are required in the snmpd configuration file on the collector node (usually, /etc/
snmp/snmpd.conf):

master agentx
AgentXSocket tcp:localhost:705
trap2sink managementhost

where:
managementhost
Is the host name or IP address of the host to which you want SNMP traps sent.
If your GPFS cluster has a large number of nodes or a large number of file systems for which information
must be collected, you must increase the timeout and retry parameters for communication between the
SNMP master agent and the GPFS subagent to allow time for the volume of information to be transmitted.
The snmpd configuration file entries for this are:

agentXTimeout 60
agentXRetries 10

212 IBM Storage Scale 5.1.9: Problem Determination Guide


where:
agentXTimeout
Is set to 60 seconds for subagent to master agent communication.
agentXRetries
Is set to 10 for the number of communication retries.
Note: Other values may be appropriate depending on the number of nodes and file systems in your GPFS
cluster.
After modifying the configuration file, restart the SNMP daemon.
Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.
The management and monitoring subagent
The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.

Configuring management applications


To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
You must also supply the management application with the host name or IP address of the collector node
to be able to extract GPFS monitoring information through SNMP. To do this, you must be familiar with
your SNMP-based management applications.
For more information about Tivoli NetView or Tivoli Netcool, see IBM Documentation.
Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
Starting and stopping the SNMP subagent

Chapter 10. GPFS SNMP support 213


The SNMP subagent is started and stopped automatically.
The management and monitoring subagent
The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.

Installing MIB files on the collector node and management node


The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
To install this file on the collector node, do the following:
1. Copy or link the /usr/lpp/mmfs/data/GPFS-MIB.txt MIB file into the SNMP MIB directory
(usually, /usr/share/snmp/mibs).
Alternatively, you could add the following line to the snmp.conf file (usually found in the /etc/snmp
directory):

mibdirs +/usr/lpp/mmfs/data

2. Add the following entry to the snmp.conf file (usually found in the /etc/snmp directory):

mibs +GPFS-MIB

3. Restart the SNMP daemon.


Different management applications have different locations and ways for installing and loading a new MIB
file. The following steps for installing the GPFS MIB file apply only to Net-SNMP. If you are using other
management applications, such as NetView and NetCool, refer to corresponding product manuals (listed
in “Configuring management applications” on page 213) for the procedure of MIB file installation and
loading.
1. Remotely copy the /usr/lpp/mmfs/data/GPFS-MIB.txt MIB file from the collector node into the
SNMP MIB directory (usually, /usr/share/snmp/mibs).
2. Add the following entry to the snmp.conf file (usually found in the /etc/snmp directory):

mibs +GPFS-MIB

3. You might need to restart the SNMP management application. Other steps might be necessary to make
the GPFS MIB available to your management application.
Important: If the GPFS MIB is not available to the management application, add the following bold
entries in the snmpd.conf file:

####
# Third, create a view for us to let the group have rights to:
# Make at least snmpwalk -v 1 localhost -c public system fast again.
# name incl/excl subtree mask(optional)
view systemview included .1.3.6.1.2.1.1
view systemview included .1.3.6.1.2.1.25.1.1
view ibm included .1.3.6.1.4.1.2
####
# Finally, grant the group read-only access to the systemview view.
# group context sec.model sec.level prefix read write notif
#access notConfigGroup "" any noauth exact systemview none none
access notConfigGroup "" any noauth exact ibm none none

The .1.3.6.1.4.1.2 system view version can vary.


Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring Net-SNMP

214 IBM Storage Scale 5.1.9: Problem Determination Guide


The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.
The management and monitoring subagent
The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.

Collector node administration


Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
To assign a collector node and start the SNMP agent, enter:

mmchnode --snmp-agent -N NodeName

To unassign a collector node and stop the SNMP agent, enter:

mmchnode --nosnmp-agent -N NodeName

To see if there is a GPFS SNMP subagent collector node defined, enter:

mmlscluster | grep snmp

To change the collector node, issue the following two commands:

mmchnode --nosnmp-agent -N OldNodeName

mmchnode --snmp-agent -N NewNodeName

Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.
The management and monitoring subagent

Chapter 10. GPFS SNMP support 215


The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.

Starting and stopping the SNMP subagent


The SNMP subagent is started and stopped automatically.
The SNMP subagent is started automatically when GPFS is started on the collector node. If GPFS is
already running when the collector node is assigned, then the mmchnode command automatically starts
the SNMP subagent.
The SNMP subagent is stopped automatically when GPFS is stopped on the node (mmshutdown) or when
the SNMP collector node is unassigned (mmchnode).
Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
The management and monitoring subagent
The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.

The management and monitoring subagent


The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.
The management and monitoring subagent connects to the GPFS daemon on the collector node to
retrieve updated information about the status of the GPFS cluster.
SNMP data can be retrieved using an SNMP application such as Tivoli NetView. NetView provides a MIB
browser for retrieving user-requested data, as well as an event viewer for displaying asynchronous events.
Information that is collected includes status, configuration, and performance data about GPFS clusters,
nodes, disks, file systems, storage pools, and asynchronous events. The following is a sample of the data
that is collected for each of the following categories:
• Cluster status and configuration (see “Cluster status information” on page 219 and “Cluster
configuration information” on page 219)
– Name
– Number of nodes
– Primary and secondary servers
• Node status and configuration (see “Node status information” on page 220 and “Node configuration
information” on page 220)
– Name

216 IBM Storage Scale 5.1.9: Problem Determination Guide


– Current status
– Type
– Platform
• File system status and performance (see “File system status information” on page 221 and “File system
performance information” on page 222)
– Name
– Status
– Total space
– Free space
– Accumulated statistics
• Storage pools (see “Storage pool information” on page 223)
– Name
– File system to which the storage pool belongs
– Total storage pool space
– Free storage pool space
– Number of disks in the storage pool
• Disk status, configuration, and performance (see “Disk status information” on page 223, “Disk
configuration information” on page 224, and “Disk performance information” on page 224)
– Name
– Status
– Total space
– Free space
– Usage (metadata/data)
– Availability
– Statistics
• Asynchronous events (traps) (see “Net-SNMP traps” on page 225)
– File system mounted or unmounted
– Disks added, deleted, or changed
– Node failure or recovery
– File system creation, deletion, or state change
– Storage pool is full or nearly full
Note: If file systems are not mounted on the collector node at the time that an SNMP request is received,
the subagent can still obtain a list of file systems, storage pools, and disks, but some information, such as
performance statistics, are missing.
Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node

Chapter 10. GPFS SNMP support 217


The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.

SNMP object IDs


This topic defines the SNMP object IDs.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
The management and monitoring SNMP subagent serves the object identifier (OID) space defined as
ibm.ibmProd.ibmGPFS, which is the numerical enterprises.2.6.212 OID space.
Underneath this top-level space are the following items:
• gpfsTraps at ibmGPFS.0
• gpfsMIBObjects at ibmGPFS.1
• ibmSpectrumScaleGUI at ibmGPFS.10
You can also configure SNMP by using the Settings > Event Notifications > SNMP Manager page of the
IBM Storage Scale management GUI. The OID .1.3.6.1.4.1.2.6.212.10.0.1 is sent by the GUI for each
event.

MIB objects
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.

218 IBM Storage Scale 5.1.9: Problem Determination Guide


- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
The gpfsMIBObjects function provides a space of objects that can be retrieved using a MIB browser
application. Net-SNMP provides the following commands:
• snmpget
• snmpgetnext
• snmptable
• snmpwalk

Cluster status information


The following table lists the values and descriptions for the GPFS cluster:

Table 42. gpfsClusterStatusTable: Cluster status information


Value Description
gpfsClusterName The cluster name.
gpfsClusterId The cluster ID.
gpfsClusterMinReleaseLevel The currently enabled cluster functionality level.
gpfsClusterNumNodes The number of nodes that belong to the cluster.
gpfsClusterNumFileSystems The number of file systems that belong to the
cluster.

Cluster configuration information


The following table lists the values and descriptions for the GPFS cluster configuration:

Table 43. gpfsClusterConfigTable: Cluster configuration information


Value Description
gpfsClusterConfigName The cluster name.
gpfsClusterUidDomain The UID domain name for the cluster.
gpfsClusterRemoteShellCommand The remote shell command being used.
gpfsClusterRemoteFileCopyCommand The remote file copy command being used.
gpfsClusterPrimaryServer The primary GPFS cluster configuration server.
gpfsClusterSecondaryServer The secondary GPFS cluster configuration server.
gpfsClusterMaxBlockSize The maximum file system block size.
gpfsClusterDistributedTokenServer Indicates whether the distributed token server is
enabled.
gpfsClusterFailureDetectionTime The desired time for GPFS to react to a node
failure.
gpfsClusterTCPPort The TCP port number.

Chapter 10. GPFS SNMP support 219


Table 43. gpfsClusterConfigTable: Cluster configuration information (continued)
Value Description
gpfsClusterMinMissedPingTimeout The lower bound on a missed ping timeout
(seconds).
gpfsClusterMaxMissedPingTimeout The upper bound on missed ping timeout
(seconds).

Node status information


The following table provides description for each GPFS node:

Table 44. gpfsNodeStatusTable: Node status information


Node Description
gpfsNodeName Indicates the node name used by the GPFS
daemon.
gpfsNodeIp Indicates the node IP address.
gpfsNodePlatform Indicates the operating system being used.
gpfsNodeStatus Indicates the node status (for example, up or
down).
gpfsNodeFailureCount Indicates the number of node failures.
gpfsNodeThreadWait Indicates the longest hung thread's wait time in
milliseconds.
gpfsNodeHealthy Indicates whether the node is healthy in terms of
hung threads. If there are hung threads, the value
is no.
gpfsNodeDiagnosis Indicates the number of hung threads and detail on
the longest hung thread.
gpfsNodeVersion Indicates the GPFS product version of the currently
running daemon.

Node configuration information


The following table lists the collected configuration data for each GPFS node:

Table 45. gpfsNodeConfigTable: Node configuration information


Node Description
gpfsNodeConfigName The node name used by the GPFS daemon.
gpfsNodeType The node type (for example, manager/client or
quorum/nonquorum).
gpfsNodeAdmin Indicates whether the node is one of the preferred
admin nodes.
gpfsNodePagePoolL The size of the cache (low 32 bits).
gpfsNodePagePoolH The size of the cache (high 32 bits).
gpfsNodePrefetchThreads The number of prefetch threads.

220 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 45. gpfsNodeConfigTable: Node configuration information (continued)
Node Description
gpfsNodeMaxMbps An estimate of how many megabytes of data can
be transferred per second.
gpfsNodeMaxFilesToCache The number of inodes to cache for recently-used
files that have been closed.
gpfsNodeMaxStatCache The number of inodes to keep in the stat cache.
gpfsNodeWorker1Threads The maximum number of worker threads that can
be started.
gpfsNodeDmapiEventTimeout The maximum time the file operation threads
block while waiting for a DMAPI synchronous event
(milliseconds).
gpfsNodeDmapiMountTimeout The maximum time that the mount operation waits
for a disposition for the mount event to be set
(seconds).
gpfsNodeDmapiSessFailureTimeout The maximum time the file operation threads
wait for the recovery of the failed DMAPI session
(seconds).
gpfsNodeNsdServerWaitTimeWindowOnMount Specifies a window of time during which a mount
can wait for NSD servers to come up (seconds).
gpfsNodeNsdServerWaitTimeForMount The maximum time that the mount operation waits
for NSD servers to come up (seconds).
gpfsNodeUnmountOnDiskFail Indicates how the GPFS daemon responds when
a disk failure is detected. If it is "true", any
disk failure causes only the local node to forcibly
unmount the file system that contains the failed
disk.

File system status information


The following table shows the collected status information for each GPFS file system:

Table 46. gpfsFileSystemStatusTable: File system status information


Value Description
gpfsFileSystemName Indicates the file system name.
gpfsFileSystemStatus Indicates the status of the file system.
gpfsFileSystemXstatus Indicates the executable status of the file system.
gpfsFileSystemTotalSpaceL Indicates the total disk space of the file system in
kilobytes (low 32 bits).
gpfsFileSystemTotalSpaceH Indicates the total disk space of the file system in
kilobytes (high 32 bits).
gpfsFileSystemNumTotalInodesL Indicates the total number of file system inodes
(low 32 bits).
gpfsFileSystemNumTotalInodesH Indicates the total number of file system inodes
(high 32 bits).

Chapter 10. GPFS SNMP support 221


Table 46. gpfsFileSystemStatusTable: File system status information (continued)
Value Description
gpfsFileSystemFreeSpaceL Indicates the free disk space of the file system in
kilobytes (low 32 bits).
gpfsFileSystemFreeSpaceH Indicates the free disk space of the file system in
kilobytes (high 32 bits).
gpfsFileSystemNumFreeInodesL Indicates the number of free file system inodes
(low 32 bits).
gpfsFileSystemNumFreeInodesH Indicates the number of free file system inodes
(high 32 bits).

File system performance information


The following table shows the GPFS file system performance information:

Table 47. gpfsFileSystemPerfTable: File system performance information


Value Description
gpfsFileSystemPerfName Indicates the file system name.
gpfsFileSystemBytesReadL Indicates the number of bytes read from disk, not
counting those read from cache (low 32 bits).
gpfsFileSystemBytesReadH Indicates the number of bytes read from disk, not
counting those read from cache (high 32 bits).
gpfsFileSystemBytesCacheL Indicates the number of bytes read from the cache
(low 32 bits).
gpfsFileSystemBytesCacheH Indicates the number of bytes read from the cache
(high 32 bits).
gpfsFileSystemBytesWrittenL Indicates the number of bytes written, to both disk
and cache (low 32 bits).
gpfsFileSystemBytesWrittenH Indicates the number of bytes written, to both disk
and cache (high 32 bits).
gpfsFileSystemReads Indicates the number of read operations supplied
from disk.
gpfsFileSystemCaches Indicates the number of read operations supplied
from cache.
gpfsFileSystemWrites Indicates the number of write operations to both
disk and cache.
gpfsFileSystemOpenCalls Indicates the number of file system open calls.
gpfsFileSystemCloseCalls Indicates the number of file system close calls.
gpfsFileSystemReadCalls Indicates the number of file system read calls.
gpfsFileSystemWriteCalls Indicates the number of file system write calls.
gpfsFileSystemReaddirCalls Indicates the number of file system readdir calls.
gpfsFileSystemInodesWritten Indicates the number of inode updates to disk.
gpfsFileSystemInodesRead Indicates the number of inode reads.

222 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 47. gpfsFileSystemPerfTable: File system performance information (continued)
Value Description
gpfsFileSystemInodesDeleted Indicates the number of inode deletions.
gpfsFileSystemInodesCreated Indicates the number of inode creations.
gpfsFileSystemStatCacheHit Indicates the number of stat cache hits.
gpfsFileSystemStatCacheMiss Indicates the number of stat cache misses.

Storage pool information


The following table lists the collected information for each GPFS storage pool:

Table 48. GPFS storage pool information


Value Description
gpfsStgPoolName The name of the storage pool.
gpfsStgPoolFSName The name of the file system to which the storage
pool belongs.
gpfsStgPoolTotalSpaceL The total disk space in the storage pool in kilobytes
(low 32 bits).
gpfsStgPoolTotalSpaceH The total disk space in the storage pool in kilobytes
(high 32 bits).
gpfsStgPoolFreeSpaceL The free disk space in the storage pool in kilobytes
(low 32 bits).
gpfsStgPoolFreeSpaceH The free disk space in the storage pool in kilobytes
(high 32 bits).
gpfsStgPoolNumDisks The number of disks in the storage pool.

Disk status information


The following table lists the status information collected for each GPFS disk:

Table 49. gpfsDiskStatusTable: Disk status information


Value Description
gpfsDiskName The disk name.
gpfsDiskFSName The name of the file system to which the disk
belongs.
gpfsDiskStgPoolName The name of the storage pool to which the disk
belongs.
gpfsDiskStatus The status of a disk (values: NotInUse, InUse,
Suspended, BeingFormatted, BeingAdded, To Be
Emptied, Being Emptied, Emptied, BeingDeleted,
BeingDeleted-p, ReferencesBeingRemoved,
BeingReplaced or Replacement).
gpfsDiskAvailability The availability of the disk (Unchanged, OK,
Unavailable, Recovering).

Chapter 10. GPFS SNMP support 223


Table 49. gpfsDiskStatusTable: Disk status information (continued)
Value Description
gpfsDiskTotalSpaceL The total disk space in kilobytes (low 32 bits).
gpfsDiskTotalSpaceH The total disk space in kilobytes (high 32 bits).
gpfsDiskFullBlockFreeSpaceL The full block (unfragmented) free space in
kilobytes (low 32 bits).
gpfsDiskFullBlockFreeSpaceH The full block (unfragmented) free space in
kilobytes (high 32 bits).
gpfsDiskSubBlockFreeSpaceL The sub-block (fragmented) free space in kilobytes
(low 32 bits).
gpfsDiskSubBlockFreeSpaceH The sub-block (fragmented) free space in kilobytes
(high 32 bits).

Disk configuration information


The following table lists the configuration information collected for each GPFS disk:

Table 50. gpfsDiskConfigTable: Disk configuration information


Value Description
gpfsDiskConfigName The disk name.
gpfsDiskConfigFSName The name of the file system to which the disk
belongs.
gpfsDiskConfigStgPoolName The name of the storage pool to which the disk
belongs.
gpfsDiskMetadata Indicates whether the disk holds metadata.
gpfsDiskData Indicates whether the disk holds data.

Disk performance information


The following table lists the performance information collected for each disk:

Table 51. gpfsDiskPerfTable: Disk performance information


Value Description
gpfsDiskPerfName The disk name.
gpfsDiskPerfFSName The name of the file system to which the disk
belongs.
gpfsDiskPerfStgPoolName The name of the storage pool to which the disk
belongs.
gpfsDiskReadTimeL The total time spent waiting for disk read
operations (low 32 bits).
gpfsDiskReadTimeH The total time spent waiting for disk read
operations (high 32 bits).

224 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 51. gpfsDiskPerfTable: Disk performance information (continued)
Value Description
gpfsDiskWriteTimeL The total time spent waiting for disk write
operations in microseconds (low 32 bits).
gpfsDiskWriteTimeH The total time spent waiting for disk write
operations in microseconds (high 32 bits).
gpfsDiskLongestReadTimeL The longest disk read time in microseconds (low 32
bits).
gpfsDiskLongestReadTimeH The longest disk read time in microseconds (high
32 bits).
gpfsDiskLongestWriteTimeL The longest disk write time in microseconds (low
32 bits).
gpfsDiskLongestWriteTimeH The longest disk write time in microseconds (high
32 bits).
gpfsDiskShortestReadTimeL The shortest disk read time in microseconds (low
32 bits).
gpfsDiskShortestReadTimeH The shortest disk read time in microseconds (high
32 bits).
gpfsDiskShortestWriteTimeL The shortest disk write time in microseconds (low
32 bits).
gpfsDiskShortestWriteTimeH The shortest disk write time in microseconds (high
32 bits).
gpfsDiskReadBytesL The number of bytes read from the disk (low 32
bits).
gpfsDiskReadBytesH The number of bytes read from the disk (high 32
bits).
gpfsDiskWriteBytesL The number of bytes written to the disk (low 32
bits).
gpfsDiskWriteBytesH The number of bytes written to the disk (high 32
bits).
gpfsDiskReadOps The number of disk read operations.
gpfsDiskWriteOps The number of disk write operations.

Net-SNMP traps
Traps provide asynchronous notification to the SNMP application when a particular event has been
triggered in GPFS. The following table lists the defined trap types:

Table 52. Net-SNMP traps


Net-SNMP trap type This event is triggered by:
Mount By the mounting node when the file system is
mounted on a node.
Unmount By the unmounting node when the file system is
unmounted on a node.

Chapter 10. GPFS SNMP support 225


Table 52. Net-SNMP traps (continued)
Net-SNMP trap type This event is triggered by:
Add Disk By the file system manager when a disk is added to
a file system on a node.
Delete Disk By the file system manager when a disk is deleted
from a file system.
Change Disk By the file system manager when the status of a
disk or the availability of a disk is changed within
the file system.
SGMGR Takeover By the cluster manager when a file system
manager takeover is successfully completed for the
file system.
Node Failure By the cluster manager when a node fails.
Node Recovery By the cluster manager when a node recovers
normally.
File System Creation By the file system manager when a file system is
successfully created.
File System Deletion By the file system manager when a file system is
deleted.
File System State Change By the file system manager when the state of a file
system changes.
New Connection When a new connection thread is established
between the events exporter and the management
application.
Event Collection Buffer Overflow By the collector node when the internal event
collection buffer in the GPFS daemon overflows.
Hung Thread By the affected node when a hung thread is
detected. The GPFS Events Exporter Watchdog
thread periodically checks for threads that have
been waiting for longer than a threshold amount of
time.
Storage Pool Utilization By the file system manager when the utilization of
a storage pool becomes full or almost full.

226 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 11. Monitoring the IBM Storage Scale system
by using call home
The call home feature collects files, logs, traces, and details of certain system health events from different
nodes and services.

Uploading custom files using call home


The uploaded packages that contain the daily or weekly scheduled uploads as well as the non-ticket-
related sent files are saved on the corresponding call home node for an inspection by the customer.
You can upload the data using the following command:

mmcallhome run SendFile --file file [--desc DESC | --pmr {xxxxx.yyy.zzz | TSxxxxxxxxx}]

Discuss this procedure with the IBM support before using it.
You can also use the following command to find the exact location of the uploaded packages:

mmcallhome status list --verbose

1. Monitor the tasks.


• To view the status of the currently running and the already completed call home tasks, issue the
following command:

mmcallhome status list

This command gives an output similar to the following:

=== Executed call home tasks ===


Group Task Start Time Status
------------------------------------------------------
autoGroup_1 daily 20181203105943.289 success
autoGroup_1 daily 20181204023601.112 success
autoGroup_1 daily 20181205021401.524 success
autoGroup_1 daily 20181206021401.281 success
autoGroup_1 weekly 20181203110048.724 success
autoGroup_1 sendfile 20181203105920.936 success
autoGroup_1 sendfile 20181203110130.732 success

• To view the details of the status of the call home tasks, issue the following command:

mmcallhome status list --verbose

This command gives an output similar to the following:

=== Executed call home tasks ===

Group Task Start Time Updated Time Status RC or Step Package File Name Original Filename
-----------------------------------------------------------------------------------------------------------------------------------------
autoGroup_1 daily 20181203105943.289 20181203110008 success RC=0 /tmp/mmfs/callhome/rsENUploaded/
13445038716695.5_0_3_0.123456...
autoGroup_1.gat_daily.g_daily.
scale.20181203105943289.cl0.DC
autoGroup_1 weekly 20181209031101.186 20181209031122 success RC=0 /tmp/mmfs/callhome/rsENUploaded/
13445038716695.5_0_3_0.123456...
autoGroup_1.gat_weekly.g_weekly.
scale.20181209031101186.cl0.DC
autoGroup_1 sendfile 20181203105920.936 20181203105928 success RC=0 /tmp/mmfs/callhome/rsENUploaded/ /root/stanza.txt
13445038716695.5_0_3_0.123456...
autoGroup_1.NoText.s_file.scale.
20181203105920936.cl0.DC
autoGroup_1 sendfile 20181203110130.732 20181203110138 success RC=0 /tmp/mmfs/callhome/rsENUploaded/ /root/anaconda-ks.cfgg
13445038716695.5_0_3_0.123456...
autoGroup_1.NoText.s_file.scale.
20181203110130732.cl0.DC

• To list the registered tasks for gather-send, issue the following command:

© Copyright IBM Corp. 2015, 2024 227


mmcallhome schedule list

This command gives an output similar to the following:

=== List of registered schedule tasks ===

group scheduleType isEnabled confFile


-------- -------------- ----------- -------------
global daily enabled daily.conf
global weekly enabled weekly.conf

2. Upload the collected data.


The call home functionality provides the following data upload methods to collect and upload the data:
a. File upload: Any file can be specified for upload.
b. Package upload: If the call home schedules are enabled, call home automatically collects
predefined data regularly. The definitions for the data co collect can be found in /usr/lpp/
mmfs/lib/mmsysmon/callhome/callhomeSchedules.json. After the upload, the data
packages are stored in the data package directory for backup. You can find the exact path names by
running the mmcallhome status list --verbose command.
If a prespecified RAS event that degrades the current state of an mmhealth component occurs on
one of the nodes of a call home group, the debugging data is automatically collected and uploaded to
ECuRep for analysis. This feature is called FTDC2CallHome. For more information on FTDC2CallHome,
see the Event-based uploads section in the IBM Storage Scale: Concepts, Planning, and Installation
Guide.

228 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 12. Monitoring remote cluster through GUI
The IBM Storage Scale GUI can monitor and manage a single cluster. There are cluster setups where
multiple clusters exchange data through AFM or cross cluster mounts. To provide consolidated monitoring
of multiple clusters using the IBM Storage Scale GUI, it is possible to exchange monitoring information
between GUI nodes of different clusters.
By establishing a connection between the GUI nodes, both the clusters can monitor the other cluster.
To enable remote monitoring capability among clusters, the release-level of the GUI nodes that are
communicating with each other must be 5.0.0 or later.
To establish a connection with the remote cluster, perform the following steps:
1. Perform the following steps on the local cluster to raise the access request:
a. Select the Request Access option that is available under the Outgoing Requests tab to raise the
request for access.
b. In the Request Remote Cluster Access dialog, enter an alias for the remote cluster name and
specify the GUI nodes to which the local GUI node must establish the connection.
c. If you know the credentials of the security administrator of the remote cluster, you can also add the
user name and password of the remote cluster administrator and skip the step “2” on page 229.
d. Click Send to submit the request.
2. Perform the following steps on the remote cluster to grant access:
a. When the request for connection is received, the GUI displays the details of the request in the
Cluster > Remote Connections > Incoming Requests page.
b. Select Grant Access to grant the permission and establish the connection.
Now, the requesting cluster GUI can monitor the remote cluster. To enable both clusters to monitor each
other, repeat the procedure with reversed roles through the respective GUIs.
Note: Only the GUI user with Security Administrator role can grant access to the remote connection
requests.
You can see the details of the connections established with the remote clusters under the Remote
Cluster tab.

Monitoring remote clusters


The following table lists the remote cluster monitoring options that are available in the GUI.

Table 53. Remote cluster monitoring options available in GUI


GUI option Description
Home The Remote clusters grouping shows the following
details:
• Number of remote clusters connected to
resource cluster.
• Number of file systems that are mounted on the
local nodes.
• Number of local nodes on which the remote file
systems are mounted.

© Copyright IBM Corp. 2015, 2024 229


Table 53. Remote cluster monitoring options available in GUI (continued)
GUI option Description
Files > File Systems The grid view provides the following remote cluster
monitoring details:
• Whether the file system is mounted on a remote
cluster.
• Capacity information.
• Number of local nodes on which the file system is
mounted.
• Performance details.
• Pools, NSDs, filesets, and snapshots.

Files > File Systems > View Details > Remote Provides the details of the remote cluster nodes
Nodes where the local file system is mounted.
Files > Filesets The Remote Fileset column in the filesets grid
shows whether the fileset belongs to a remote file
system.
The fileset table also displays the same level of
details for both remote and local filesets. For
example, capacity, parent file system, inodes, AFM
role, snapshots, and so on.

Files > Active File Management When remote monitoring is enabled, you can view
the following AFM details:
• On home and secondary, you can see the AFM
relationships configuration, health status, and
performance values of the Cache and Disaster
Recovery grids.
• On the Overview tab of the detailed view,
the available home and secondary inodes are
available.
• On the Overview tab of the detailed view, the
details such as NFS throughput, IOPs, and
latency details are available, if the protocol is
NFS.

Files > Quotas When remote monitoring is enabled, you can view
quota limits, capacity and inode information for
users, groups and filesets of a file system mounted
from a remote cluster. The user and group name
resolution of the remote cluster is used in this
view. It is not possible to change quota limits on a
file system that is mounted from a remote cluster.
Cluster > Remote Connections Provides the following options:
• Send a connection request to a remote cluster.
• Grant or reject the connection requests received
from remote clusters.
• View the details of the remote clusters that are
connected to the local cluster.

230 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 53. Remote cluster monitoring options available in GUI (continued)
GUI option Description
Monitoring > Statistics and Monitoring > You can create customized performance charts to
Dashboard monitor the remote cluster performance.

Monitoring performance of a remote cluster


You can monitor the performance of the remote cluster with the help of performance monitoring tools
that are configured in both the remote and local clusters. The performance details collected in the remote
cluster is shared with the local cluster using REST APIs.
After establishing the connection with the remote cluster by using the Cluster > Remote Connections
page, you can access the performance details of the remote cluster from the following GUI pages:
• Monitoring > Statistics
• Monitoring > Dashboard
• Files > File Systems
To monitor performance details of the remote cluster in the Statistics page, you need to create
customized performance charts by performing the following steps:
1. Access the edit mode by clicking the icon that is available on the upper right corner of the performance
chart and selecting Edit.
2. In the edit mode, select the remote cluster to be monitored from the Cluster field. You can either
select the local cluster or remote cluster from this field.
3. Select Resource type. This is the area from which the data is taken to create the performance analysis.
4. Select Aggregation level. The aggregation level determines the level at which the data is aggregated.
The aggregation levels that are available for selection varies based on the resource type.
5. Select the entities that need to be graphed. The table lists all entities that are available for the chosen
resource type and aggregation level. When a metric is selected, you can also see the selected metrics
in the same grid and use methods like sorting, filtering, or adjusting the time frame to select the
entities that you want to select.
6. Select Metrics. Metrics is the type of data that need to be included in the performance chart. The list of
metrics that is available for selection varies based on the resource type and aggregation type.
7. Click Apply to create the customized chart.
• After creating the customized performance chart, you can mark it as favorite charts to get them
displayed on the Dashboard page.
If a file system is mounted on the remote cluster nodes, the performance details of such remote cluster
nodes are available in the Remote Nodes tab of the detailed view of file systems in the Files > File
Systems page.
After creating the customized performance chart, you can mark it as favorite charts to get them displayed
on the Dashboard page.
If a file system is mounted on the remote cluster nodes, the performance details of such remote cluster
nodes are available in the Remote Nodes tab of the detailed view of file systems in the Files > File
Systems page.

Chapter 12. Monitoring remote cluster through GUI 231


232 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 13. Monitoring file audit logging
The following topics describe various ways to monitor file audit logging in IBM Storage Scale.

Monitoring file audit logging states


File audit logging is integrated into the system health infrastructure. Alerts are generated for producers
that write audit events and create the audit logs.
Overview of mmhealth states for file audit logging
• UNKNOWN: The state of the node or the service that is hosted on the node is not known.
• HEALTHY: The node or the service that is hosted on the node is working as expected. There are no
active error events.
• CHECKING: The monitoring of a service or a component that is hosted on the node is starting at the
moment. This state is a transient state and is updated when the startup is completed.
• TIPS: There might be an issue with the configuration and tuning of the components. This status is
only assigned to a tip event.
• DEGRADED: The node or the service that is hosted on the node is not working as expected. A
problem occurred with the component, but it did not result in a complete failure.
• FAILED: The node or the service that is hosted on the node failed due to errors or cannot be reached
anymore.
• DEPEND: The node or the service that is hosted on the node failed due to the failure of some
components. For example, an NFS or SMB service shows this status if authentication failed.
Overview of mmhealth events for mmaudit
For more information, see File audit logging events in the IBM Storage Scale: Problem Determination
Guide.

Monitoring the file audit logging fileset for events


To verify that a node is getting events in the file audit logging fileset after file audit logging is enabled, use
the tail command and write IO to the audited device.
If fileset auditing or skip fileset auditing is enabled, ensure that you write IO to the correct directory path.
The audit log files that contain the events are in the same location for all types.

tail -f /<path>/<to>/<audit>/<fileset>/auditLogFile.latest*
for i in {1..10};do touch /<path>/<to>/<audited_device>/file$i;done

The output should look similar to the following example:

==> auditLogFile.latest_node1.ibm.com <==


{"LWE_JSON": "0.0.2", "path": "/gpfs/gpfs5040/file2", "clusterName": "cluster.ibm.com",
"nodeName": "node1", "nfsClientIp": "",
"fsName": "gpfs5040", "event": "CREATE", "inode": "167938", "linkCount": "1", "openFlags": "0",
"poolName": "system", "fileSize": "0",
"ownerUserId": "0", "ownerGroupId": "0", "atime": "2020-04-06_05:23:41-0700", "ctime":
"2020-04-06_05:23:41-0700",
"mtime": "2020-04-06_05:23:41-0700", "eventTime": "2020-04-06_05:23:41-0700", "clientUserId":
"0", "clientGroupId": "0", "processId": "29909",
"permissions": "200100644", "acls": null, "xattrs": null, "subEvent": "NONE"}
{"LWE_JSON": "0.0.2", "path": "/gpfs/gpfs5040/file2", "clusterName": "cluster", "nodeName":
"node1", "nfsClientIp": "",
"fsName": "gpfs5040", "event": "OPEN", "inode": "167938", "linkCount": "1", "openFlags":
"35138", "poolName": "system", "fileSize": "0",
"ownerUserId": "0", "ownerGroupId": "0", "atime": "2020-04-06_05:23:41-0700", "ctime":
"2020-04-06_05:23:41-0700",
"mtime": "2020-04-06_05:23:41-0700", "eventTime": "2020-04-06_05:23:41-0700", "clientUserId":
"0", "clientGroupId": "0", "processId": "29909",
"permissions": "200100644", "acls": null, "xattrs": null, "subEvent": "NONE"}

© Copyright IBM Corp. 2015, 2024 233


{"LWE_JSON": "0.0.2", "path": "/gpfs/gpfs5040/file2", "clusterName": "cluster", "nodeName":
"node1", "nfsClientIp": "",
"fsName": "gpfs5040", "event": "CLOSE",
"inode": "167938", "linkCount": "1", "openFlags": "35138", "poolName": "system", "fileSize":
"0",
"ownerUserId": "0", "ownerGroupId": "0", "atime": "2020-04-06_05:23:41-0700", "ctime":
"2020-04-06_05:23:41-0700",
"mtime": "2020-04-06_05:23:41-0700", "eventTime": "2020-04-06_05:23:41-0700", "clientUserId":
"0", "clientGroupId": "0", "processId": "29909",
"permissions": "200100644", "acls": null, "xattrs": null, "subEvent": "NONE"}

Monitoring file audit logging using mmhealth commands


You can use mmhealth commands to monitor the status of file audit logging.
The following mmhealth commands can be used to view the producer status on a per node or per cluster
basis and file audit logging overall.
Use the following command to view the producer status on a per node basis:

# mmhealth node show FILEAUDITLOG

This command generates output similar to the following example:


Node name: ibmnode1.ibm.com

Component Status Status Change Reasons


-----------------------------------------------------------------------------------
FILEAUDITLOG HEALTHY 4 days ago -
device0 HEALTHY 4 days ago -
device1 HEALTHY 4 days ago -

There are no active error events for the component FILEAUDITLOG on this node (ibmnode1.ibm.com).

Use the following command to view more details about the producers on a per node basis:

# mmhealth node show FILEAUDITLOG -v

This command generates output similar to the following example:

Node name: ibmnode1.ibm.com

Component Status Status Change Reasons


------------------------------------------------------------------------------------------
FILEAUDITLOG HEALTHY 2018-04-09 15:28:27 -
device0 HEALTHY 2018-04-09 15:28:56 -
device1 HEALTHY 2018-04-09 15:31:27 -

Event Parameter Severity Active Since Event Message


------------------------------------------------------------------------------------------------------------------------------

auditp_ok device0 INFO 2018-04-09 15:28:27 Event producer for file system device0 is ok.
auditp_ok device1 INFO 2018-04-09 15:31:26 Event producer for file system device1 is ok.

Use the following command to view the status for the entire cluster:

# mmhealth cluster show FILEAUDITLOG

This command generates output similar to the following example:

Component Node Status Reasons


------------------------------------------------------------------------------
FILEAUDITLOG ibmnode1.ibm.com HEALTHY -
FILEAUDITLOG ibmnode2.ibm.com HEALTHY -
FILEAUDITLOG ibmnode3.ibm.com HEALTHY -
FILEAUDITLOG ibmnode4.ibm.com HEALTHY -

Use the following command to view more details about each file system that has file audit logging
enabled:

# mmhealth cluster show FILEAUDITLOG -v

This command generates output similar to the following example:

234 IBM Storage Scale 5.1.9: Problem Determination Guide


Component Node Status Reasons
-------------------------------------------------------------------------
FILEAUDITLOG ibmnode1.ibm.com HEALTHY -
device0 HEALTHY -
device1 HEALTHY -
FILEAUDITLOG ibmnode2.ibm.com HEALTHY -
device0 HEALTHY -
device1 HEALTHY -
FILEAUDITLOG ibmnode3.ibm.com HEALTHY -
device0 HEALTHY -
device1 HEALTHY -
FILEAUDITLOG ibmnode4.ibm.com HEALTHY -
device0 HEALTHY -
device1 HEALTHY -

Monitoring file audit logging using the GUI


You can use the GUI to monitor file audit logging.
• To monitor the health of the file audit logging nodes and file systems or see any events that might be
present, use the Services > File Auditing page.

Monitoring file audit logging using audit log parser


The audit_parser script is located in /usr/lpp/mmfs/samples/util, which can be used to query
the events in the file system's .audit_log directory.
This script might be helpful for troubleshooting the events in a file system by viewing logs with reduced or
specific information.
Instructions on how to use the script are in the README file, which is in the same directory as the script.

Example of parsing file audit logs with python


You can use a program to parse the JSON of file audit logging to get information about specific attributes.
The file audit logs files include individual JSON objects, but they do not include arrays. The following
example is a Python 3 program that can be used to point to the audit logs, read each object, and print
specific information. In the example, the code is parsing for just the event-type, event-time, file-owner,
and file-path:

import json
import sys

# open the audit log file


fName="/tmp/fileAuditLog"
try:
fn = open(fName, 'r')
except IOError:
print ('ERROR: Opening file', fName, '\n')
sys.exit(1)

# read the file line by line and display the relevant events: Event-type, event-time, file-
owner, file-path
i = 0
for line in fn:
obj = json.loads(line)
if i == 0:
print ("\n{:10} {:26} {:6} {}".format("Event", "Event-time", "Owner", "Path"))
print
('---------------------------------------------------------------------------------------------'
)
print ("{:10} {:26} {:6} {}".format(obj["event"], obj["eventTime"], obj["ownerUserId"],
obj["path"]))
i = i + 1

Note: There are many open source JSON parsers. There are currently no restrictions on using other
parsing programs.

Chapter 13. Monitoring file audit logging 235


Monitoring file audit logging with rsyslog and SELinux
Files and directories within the file audit logging fileset can inherit Security-Enhanced Linux (SELinux)
security contexts from their parent directories.
When the security context is set on the audit log fileset, any newly created subdirectories and audit logs
files inherit the security contexts from the parent directory. This allows rsyslog to read the file audit logs
when SELinux is enabled. This rsyslog mechanism can then be used by IBM Security QRadar® to ingest file
audit logs.
The steps in the procedure assume that your filesystem is fs0 and your file audit logging fileset is named
and linked as the.audit_log directory within /ibm/fs0.
Note:
• The filesystem name and audit log fileset can be different from the ones that are mentioned in the
steps. You can change them based on their settings.
• You can find the name of your audit log fileset by issuing the mmaudit <fsName> list command and
looking at the Audit Fileset Name column.
• SeLinux must be enabled. For more information, see Security-Enhanced Linux support in the IBM
Storage Scale: Concepts, Planning, and Installation Guide.
Follow the steps to set the security context on all existing files and folders:
1. Define a rule in /etc/selinux/targeted/contexts/files/file_contexts.local file by
issuing the following command:

semanage fcontext -a -t var_log_t "/ibm/fs0/.audit_log(/.*)?"

Note: The semanage command is available in the policycoreutils-python-utils rpm.


2. Set the security context of the existing files and folders that matches the rule from the above step by
issuing the following command:

restorecon -Rv /ibm/fs0/.audit_log

3. List the security context of the audit log fileset to verify whether it is set correctly by issuing the
following command:

ls -laZ /ibm/fs0

where the output might look like this:

drwx------. root root system_u:object_r:var_log_t:s0 .audit_log

236 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 14. Monitoring clustered watch folder
The following topics describe various ways to monitor clustered watch folder in IBM Storage Scale.

Monitoring clustered watch folder states


The mmhealth command can be used to check the various states of a clustered watch folder.

mmhealth states for a clustered watch folder


• UNKNOWN: The state of the node or the service that is hosted on the node is not known.
• HEALTHY: The node or the service that is hosted on the node is working as expected. There are no
active error events.
• CHECKING: The monitoring of a service or a component that is hosted on the node is starting at the
moment. This state is a transient state and is updated when the startup is completed.
• TIPS: There might be an issue with the configuration and tuning of the components. This status is only
assigned to a tip event.
• DEGRADED: The node or the service that is hosted on the node is not working as expected. A problem
occurred with the component, but it did not result in a complete failure.
• FAILED: The node or the service that is hosted on the node failed due to errors or cannot be reached
anymore.
• DEPEND: The node or the service that is hosted on the node failed due to the failure of some
components. For example, an NFS or SMB service shows this status if authentication failed.

mmhealth events for the mmwatch command


For more information, see Watch folder events in the IBM Storage Scale: Problem Determination Guide.

Monitoring clustered watch folder with the mmhealth command


Use the following information to monitor clustered watch folder with the mmhealth command.
To see the active status on a watch in mmhealth, use the mmhealth node show command:

# mmhealth node show

Node name: node1.ibm.com


Node status: HEALTHY
Status Change: 1 day ago

Component Status Status Change Reasons


-------------------------------------------------------------------------
GPFS HEALTHY 1 day ago -
NETWORK HEALTHY 1 day ago -
FILESYSTEM HEALTHY 5 hours ago -
CES HEALTHY 1 day ago -
GUI HEALTHY 1 day ago -
PERFMON HEALTHY 1 day ago -
THRESHOLD HEALTHY 1 day ago -
WATCHFOLDER HEALTHY 1 day ago -

This command displays the current state of the watch folder components on the defined node.
To see the status across the cluster, use the mmhealth cluster show command:

# mmhealth cluster show

© Copyright IBM Corp. 2015, 2024 237


Component Total Failed Degraded Healthy Other
------------------------------------------------------------------------------------------------
---
NODE 4 0 2 0 2
GPFS 4 0 1 0 3
NETWORK 4 0 0 4 0
FILESYSTEM 3 0 0 3 0
DISK 6 0 0 6 0
CES 2 0 1 1 0
CESIP 1 0 0 1 0
GUI 1 0 1 0 0
PERFMON 4 0 0 4 0
THRESHOLD 4 0 0 4 0
WATCHFOLDER 4 0 0 4 0

You can then see a more verbose status of the health with the mmhealth node show watchfolder
-v command:
# mmhealth node show watchfolder -v

Node name: node1.ibm.com

Component Status Status Change Reasons


--------------------------------------------------------------------------------------------------------------
---
WATCHFOLDER HEALTHY 2019-03-19 16:50:57 -
gpfs0/13222185860284578504/CLW1553028625 HEALTHY 2019-03-19 16:50:57 -

Event Parameter Severity Active Since Event Message


--------------------------------------------------------------------------------------------------------------
watchfolder_service_ok gpfs0/13222185860284578504/ INFO 2019-03-19 16:50:57 Watchfolder service is
running.
CLW1553028625

watchfolderp_ok gpfs0/13222185860284578504/ INFO 2019-03-19 16:50:57 Event producer for file


system
CLW1553028625 gpfs0 is ok.

Monitoring clustered watch folder with the mmwatch status


command
Use this information to monitor clustered watch folder with the mmwatch status command.
To see the statuses for all of the current watches, use the mmwatch all status command:

# mmwatch all status


Device Watch Path Watch ID Watch State
gpfs0 /ibm/gpfs0 CLW1553608557 Active
Node Name Status
node1.ibm.com HEALTHY
node2.ibm.com HEALTHY
node3.ibm.com HEALTHY
node4.ibm.com HEALTHY

To get a more verbose status for a specific watch ID, use the mmwatch <device> status --watch-
id <watchID> -v command. This command shows you the status of that single watch ID and lists up to
10 of the most recent entries from the local node system health database.

mmwatch gpfs0 status --watch-id CLW1584552058 -v

Device Watch Path Watch ID Watch State


gpfs0 /ibm/gpfs0/ CLW1584552058 Active
Node Name Status
node1.ibm.com HEALTHY
=== Log entries ===
2020-03-18 10:23:00.729191 MST watchfolderp_found INFO New event producer
for gpfs0/12532915696020090994/CLW1584552058
was configured.

Node Name Status


node2.ibm.com HEALTHY
=== Log entries ===
2020-03-18 10:23:29.662479 MST watchfolderp_found INFO New event producer
for gpfs0/12532915696020090994/CLW1584552058
was configured.

238 IBM Storage Scale 5.1.9: Problem Determination Guide


Node Name Status
node3.ibm.com HEALTHY
=== Log entries ===
2020-03-18 10:23:34.222176 MST watchfolderp_found INFO New event producer
for gpfs0/12532915696020090994/CLW1584552058
was configured.

Node Name Status


node4.ibm.com HEALTHY
=== Log entries ===
2020-03-18 10:23:41.099290 MST watchfolderp_found INFO New event producer
for gpfs0/12532915696020090994/CLW1584552058
was configured.

Node Name Status


node5.ibm.com HEALTHY
=== Log entries ===
2020-03-18 10:23:20.233489 MST watchfolderp_found INFO New event producer
for gpfs0/12532915696020090994/CLW1584552058
was configured.

Node Name Status


node6.ibm.com HEALTHY
=== Log entries ===
2020-03-18 10:23:09.611998 MST watchfolderp_found INFO New event producer
for gpfs0/12532915696020090994/CLW1584552058
was configured.

Monitoring clustered watch folder using GUI


You can use the Files > Clustered Watch Folder page in the IBM Storage Scale GUI to enable and disable
clustered watch for a file system. This page also provides the details of the clustered watch folders that
are configured in the system.
Enabling or disabling clustered watch
To enable clustered watch for a file system, click Enable Clustered Watch. The Create Clustered Watch
Folder wizard appears. Configure the following details in the Create Clustered Watch Folder wizard:
• File system for which clustered watch needs to be configured.
• Type of watch. Available options are: File System, Fileset, and Inode Space.
• Events to be watched.
• Sink details.
To disable a clustered watch, select the watch from the list and then select Disable watch from the
Actions menu.
You can also view the details of each watch folder and the health events that are raised for each watch
folder from the detailed view. To access the detailed view of a watch folder, select the watch folder and
click View Details.
The Files > File Systems provides details such as whether the clustered watch is enabled at the file
system level and the number of watched filesets under a file system. Similarly, the Files > Filesets page
shows whether the clustered watch is enabled at the fileset level and the number of watched inode
spaces.

Chapter 14. Monitoring clustered watch folder 239


240 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 15. Monitoring local read-only cache
The following topics describe various ways to monitor local read-only cache (LROC) in IBM Storage Scale.
For more information about LROC devices, see Local read-only cache in IBM Storage Scale: Administration
Guide.

Monitoring health and events with mmhealth commands


You can use the mmhealth commands to monitor the health and events of LROC devices.
The following mmhealth commands can be used to display the events and health status of all monitored
LROC components in the cluster.
Use the following command to display the status of LROC:

# mmhealth node show localcache

Node name: nodexyz.example.net


Component Status Status Change Reasons
------------------------------------------------------------------------------------
LOCALCACHE TIPS 4 days ago lroc_set_buffer_desc_tip
lroc_1 HEALTHY 6 days ago -
lroc_2 HEALTHY 6 days ago -
Event Parameter Severity Active Since Event Message
--------------------------------------------------------------------------------------------------------------
-----------------
lroc_set_buffer LOCALCACHE TIP 4 days ago This node has LROC devices with a total capacity of
12345 GB.
_desc_tip Optimal LROC performance requires setting the
maxBufferDescs config option.
The value of desired buffer descriptors for this node is
'12345'
based on assumed 4 MB data block size.

There are no active error events for the LOCALCACHE component on this node (nodexyz.example.net).
Use the following command to display the status of LROC by using --verbose:

# mmhealth node show localcache -v

Node name: nodexyz.example.net


Component Status Status Change Reasons
-------------------------------------------------------------------------------------------
LOCALCACHE TIPS 2021-03-19 10:17:16 lroc_set_buffer_desc_tip
lroc_1 HEALTHY 2021-03-17 10:20:13 -
lroc_2 HEALTHY 2021-03-17 10:43:43 -
Event Parameter Severity Active Since Event Message
--------------------------------------------------------------------------------------------------------------
------------------------------------
lroc_daemon_running LOCALCACHE INFO 2021-03-17 10:20:12 lroc is normal
lroc_set_buffer_desc_tip LOCALCACHE TIP 2021-03-19 10:17:15 This node has LROC devices with a
total capacity of 12345 GB.
Optimal LROC performance requires
setting the maxBufferDescs config option.
The value of desired buffer
descriptors for this node is '12345' based
on assumed 4 MB data block size.
lroc_disk_normal lroc_1 INFO 2021-03-17 10:20:12 lroc lroc_1 device is normal
lroc_disk_normal lroc_2 INFO 2021-03-17 10:43:43 lroc lroc_2 device is normal

Use the following command to display the health status of all monitored LROC components in the cluster:

# mmhealth cluster show localcache

Component Node Status Reasons


------------------------------------------------------------------------------------------
LOCALCACHE nodexyz.example.net TIPS lroc_set_buffer_desc_tip

© Copyright IBM Corp. 2015, 2024 241


Use the following command to display the health status of all monitored LROC components in the cluster
by using --verbose:

# mmhealth cluster show localcache --verbose

Component Node Status Reasons


------------------------------------------------------------------------------------------
LOCALCACHE nodexyz.example.net TIPS lroc_set_buffer_desc_tip
lroc_1 HEALTHY -
lroc_2 HEALTHY -

Monitoring LROC status using mmdiag command


You can use the mmdiag commands to monitor LROC component and I/O history.
The following mmdiag commands can be used to monitor LROC component and I/O history in the cluster.
Use the following command to display LROC statistics.

# mmdiag --lroc

LROC Device(s): '090BD5CD603456D2#/dev/nvme1n1;090BD5CD60354659#/dev/nvme0n1;' status Running


Cache inodes 1 dirs 1 data 1 Config: maxFile -1 stubFile -1
Max capacity: 3051313 MB, currently in use: 3559 MB
Statistics starting from: Tue Feb 23 11:42:32 2021

Inode objects stored 312454 (1220 MB) recalled 157366 (614 MB) = 50.36 %
Inode objects queried 0 (0 MB) = 0.00 % invalidated 157460 (615 MB)
Inode objects failed to store 6 failed to recall 0 failed to query 0 failed to inval 0

Directory objects stored 84 (2 MB) recalled 979 (226 MB) = 1165.48 %


Directory objects queried 0 (0 MB) = 0.00 % invalidated 80 (6 MB)
Directory objects failed to store 0 failed to recall 0 failed to query 0 failed to inval
0 inval no recall 0

Data objects stored 57412 (188807 MB) recalled 10150641 (40597918 MB) = 17680.35 %
Data objects queried 1 (0 MB) = 100.00 % invalidated 57612 (228070 MB)
Data objects failed to store 30 failed to recall 407 failed to query 0 failed to inval 0
inval no recall 54307

agent inserts=1074934, reads=162501285


response times (usec):
insert min/max/avg=3/8030/121
read min/max/avg=1/66717/2919

ssd writeIOs=380060, writePages=48671744


readIOs=10345039194, readPages=10124700092
response times (usec):
write min/max/avg=192/11985/233
read min/max/avg=13/49152/225

Use the following command to monitor I/O history of LROC component.

# mmdiag --iohist

=== mmdiag: iohist ===


I/O history:
I/O start time RW Buf type disk:sectorNum nSec time ms Type Device/NSD ID
NSD node
--------------- -- ----------- ----------------- ----- ------- ---- ------------------
---------------
10:24:37.612888 LR data -1:307454520 8192 2163.291 lrc Maj/Min 000000000
10:24:36.392477 LR data -1:307430856 8192 3543.937 lrc Maj/Min 000000000
10:24:36.283254 LR data -1:307429464 8192 3758.281 lrc Maj/Min 000000000
10:24:36.288204 LR data -1:307430160 8192 3912.547 lrc Maj/Min 000000000
10:24:36.407451 LR data -1:307433640 8192 3805.185 lrc Maj/Min 000000000
10:24:36.398778 LR data -1:307431552 8192 3841.871 lrc Maj/Min 000000000
10:24:36.401041 LR data -1:307432248 8192 3842.304 lrc Maj/Min 000000000
10:24:36.404512 LR data -1:307432944 8192 4045.038 lrc Maj/Min 000000000

For more information, see mmdiag command topic in IBM Storage Scale: Command and Programming
Reference Guide.

242 IBM Storage Scale 5.1.9: Problem Determination Guide


Monitoring file contents in LROC with mmcachectl command
You can use the mmcachectl --show-lroc command to monitor the file contents in LROC.
Use the following command to monitor the file contents in LROC.

# mmcachectl show --device fs1 --inode-num 4038 --show-lroc

FSname Fileset Inode SnapID FileType NumOpen NumDirect Size


Cached Cached Cached
ID Instances IO (Total)
(InPagePool) (InFileCache) (InLroc)
------------------------------------------------------------------------------------------------
--------------------------------------------------
fs1 0 4038 0 file 1 0 2147483648
297795584 F 1526202368
File count: 1

For more information, see mmcachectl command topic in IBM Storage Scale: Command and Programming
Reference Guide.

Chapter 15. Monitoring local read-only cache 243


244 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 16. Best practices for troubleshooting
Following certain best practices makes the troubleshooting process easier.

How to get started with troubleshooting


Troubleshooting the issues that are reported in the system is easier when you follow the process step-by-
step.
When you experience some issues with the system, go through the following steps to get started with the
troubleshooting:
1. Check the events that are reported in various nodes of the cluster by using the mmhealth cluster
show and mmhealth node show commands.
2. Check the user action corresponding to the active events and take the appropriate action. For more
information on the events and corresponding user action, see “Events” on page 559.
3. If you are facing a deadlock issue, see Chapter 19, “Managing deadlocks,” on page 333 to know how to
resolve the issue.
4. Check for events that happened before the event you are trying to investigate. They might give
you an idea about the root cause of problems. For example, if you see an event nfs_in_grace and
node_resumed a minute before you get an idea about the root cause why NFS entered the grace
period, it means that the node resumed after a suspend.
5. Collect the details of the issues through logs, dumps, and traces. You can use various CLI commands
and the Settings > Diagnostic Data GUI page to collect the details of the issues reported in the
system. For more information, see Chapter 18, “Collecting details of the issues,” on page 255.
6. Based on the type of issue, browse through the various topics that are listed in the troubleshooting
section and try to resolve the issue.
7. If you cannot resolve the issue by yourself, contact IBM Support. For more information on how to
contact IBM Support, see Chapter 41, “Support for troubleshooting,” on page 555.

Back up your data


You need to back up data regularly to avoid data loss. It is also recommended to take backups before you
start troubleshooting. IBM Storage Scale provides various options to create data backups.
Follow the guidelines in the following sections to avoid any issues while creating backup:
• GPFS(tm) backup data in IBM Storage Scale: Concepts, Planning, and Installation Guide
• Backup considerations for using IBM Storage Protect in IBM Storage Scale: Concepts, Planning, and
Installation Guide
• Configuration reference for using IBM Storage Protect with IBM Storage Scale in IBM Storage Scale:
Administration Guide
• Protecting data in a file system using backup in IBM Storage Scale: Administration Guide
• Backup procedure with SOBAR in IBM Storage Scale: Administration Guide
The following best practices help you to troubleshoot the issues that might arise in the data backup
process:
1. Enable the most useful messages in the mmbackup command by setting the
MMBACKUP_PROGRESS_CONTENT and MMBACKUP_PROGRESS_INTERVAL environment variables
in the command environment prior to issuing the mmbackup command. Setting
MMBACKUP_PROGRESS_CONTENT=7 provides the most useful messages. For more information on
these variables, see the mmbackup command in the IBM Storage Scale: Command and Programming
Reference Guide.

© Copyright IBM Corp. 2015, 2024 245


2. If the mmbackup process is failing regularly, enable debug options in the backup process:
Use the DEBUGmmbackup environment variable or the -d option that is available in the mmbackup
command to enable debugging features. This variable controls what debugging features are enabled.
It is interpreted as a bitmask with the following bit meanings:
0x001
Specifies that basic debug messages are printed to STDOUT. There are multiple components that
comprise mmbackup, so the debug message prefixes can vary. Some examples include:

mmbackup:mbackup.sh
DEBUGtsbackup33:

0x002
Specifies that temporary files are to be preserved for later analysis.
0x004
Specifies that all dsmc command output is to be mirrored to STDOUT.
The -d option in the mmbackup command line is equivalent to DEBUGmmbackup=1 .
3. To troubleshoot problems with backup subtask execution, enable debugging in the tsbuhelper
program.
Use the DEBUGtsbuhelper environment variable to enable debugging features in the mmbackup
helper program tsbuhelper.

Resolve events in a timely manner


Resolving the issues in a timely manner helps to get attention on the new and most critical events. If there
are a number of unfixed alerts, fixing any one event might become more difficult because of the effects of
the other events. You can use either the CLI or the GUI to view the list of issues that are reported in the
system.
You can use the mmhealth node eventlog to list the events that are reported in the system.
The Monitoring > Events GUI page lists all events reported in the system. You can also mark certain
events as read to change the status of the event in the events view. The status icons become gray in
case an error or warning is fixed or if it is marked as read. Some issues can be resolved by running a fix
procedure. Use the action Run Fix Procedure to do so. The Events page provides a recommendation for
which fix procedure to run next.

Keep your software up to date


Check for new code releases and update your code on a regular basis.
This can be done by checking the IBM support website to see if new code releases are available: IBM
Storage Scale support website. The release notes provide information about new functions in a release
plus any issues that are resolved with the new release. Update your code regularly if the release notes
indicate a potential issue.
Note: If a critical problem is detected on the field, IBM may send a flash, advising the user to contact IBM
for an efix. The efix when applied might resolve the issue.

Subscribe to the support notification


Subscribe to support notifications so that you are aware of best practices and issues that might affect
your system.
Subscribe to support notifications by visiting the IBM support page on the following IBM website: http://
www.ibm.com/support/mynotifications.
By subscribing, you are informed of new and updated support site information, such as publications, hints
and tips, technical notes, product flashes (alerts), and downloads.

246 IBM Storage Scale 5.1.9: Problem Determination Guide


Know your IBM warranty and maintenance agreement details
If you have a warranty or maintenance agreement with IBM, know the details that must be supplied when
you call for support.
For more information on the IBM Warranty and maintenance details, see Warranties, licenses and
maintenance.

Know how to report a problem


If you need help, service, technical assistance, or want more information about IBM products, then you
can find a wide variety of sources available from IBM to assist you.
IBM maintains pages on the web where you can get information about IBM products and fee services,
product implementation and usage assistance, break and fix service support, and the latest technical
information. The following table provides the URLs of the IBM websites where you can find the support
information.

Table 54. IBM websites for help, services, and information


Website Address
IBM home page http://www.ibm.com
Directory of worldwide contacts http://www.ibm.com/planetwide
Support for IBM Storage Scale IBM Storage Scale support website
Support for IBM System Storage® and IBM Total http://www.ibm.com/support/entry/portal/
Storage products product/system_storage/

Note: Available services, telephone numbers, and web links are subject to change without notice.
Before you call
Make sure that you have taken steps to try to solve the problem yourself before you call. Some
suggestions for resolving the problem before calling IBM Support include:
• Check all hardware for issues beforehand.
• Use the troubleshooting information in your system documentation. The troubleshooting section of the
IBM Documentation contains procedures to help you diagnose problems.
To check for technical information, hints, tips, and new device drivers or to submit a request for
information, go to the IBM Storage Scale support website.
Using the documentation
Information about your IBM storage system is available in the documentation that comes with the
product. That documentation includes printed documents, online documents, readme files, and help files
in addition to the IBM Documentation. See the troubleshooting information for diagnostic instructions. To
access this information, go to IBM Storage Scale support website, and follow the instructions. The entire
product documentation is available at IBM Storage Scale documentation.

Other problem determination hints and tips


These hints and tips might be helpful when investigating problems related to logical volumes, quorum
nodes, or system performance that can be encountered while using GPFS.
See these topics for more information:
• “Which physical disk is associated with a logical volume in AIX systems?” on page 248
• “Which nodes in my cluster are quorum nodes?” on page 248
• “What is stored in the /tmp/mmfs directory and why does it sometimes disappear?” on page 249

Chapter 16. Best practices for troubleshooting 247


• “Why does my system load increase significantly during the night?” on page 249
• “What do I do if I receive message 6027-648?” on page 249
• “Why can't I see my newly mounted Windows file system?” on page 249
• “Why is the file system mounted on the wrong drive letter?” on page 250
• “Why does the offline mmfsck command fail with "Error creating internal storage"?” on page 250
• “Questions related to active file management” on page 250

Which physical disk is associated with a logical volume in AIX systems?


Earlier releases of GPFS allowed AIX logical volumes to be used in GPFS file systems. Their use is now
discouraged because they are limited with regard to their clustering ability and cross platform support.
Existing file systems using AIX logical volumes are, however, still supported. This information might be of
use when working with those configurations.
If an error report contains a reference to a logical volume pertaining to GPFS, you can use the lslv -l
command to list the physical volume name. For example, if you want to find the physical disk associated
with logical volume gpfs7lv, issue:

lslv -l gpfs44lv

Output is similar to this, with the physical volume name in column one.

gpfs44lv:N/A
PV COPIES IN BAND DISTRIBUTION
hdisk8 537:000:000 100% 108:107:107:107:108

Which nodes in my cluster are quorum nodes?


Use the mmlscluster command to determine which nodes in your cluster are quorum nodes.
Output is similar to this:

GPFS cluster information


========================
GPFS cluster name: cluster2.kgn.ibm.com
GPFS cluster id: 13882489265947478002
GPFS UID domain: cluster2.kgn.ibm.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR

Node Daemon node name IP address Admin node name Designation


------------------------------------------------------------------------------
1 k164n04.kgn.ibm.com 198.117.68.68 k164n04.kgn.ibm.com quorum
2 k164n05.kgn.ibm.com 198.117.68.71 k164n05.kgn.ibm.com quorum
3 k164n06.kgn.ibm.com 198.117.68.70 k164n06.kgn.ibm.com

In this example, k164n04 and k164n05 are quorum nodes and k164n06 is a non-quorum node.
To change the quorum status of a node, use the mmchnode command. To change one quorum node to
nonquorum, GPFS does not have to be stopped. If you are changing more than one node at the same
time, GPFS needs to be down on all the affected nodes. GPFS does not have to be stopped when changing
nonquorum nodes to quorum nodes, nor does it need to be stopped on nodes that are not affected.
For example, to make k164n05 a non-quorum node, and k164n06 a quorum node, issue these
commands:

mmchnode --nonquorum -N k164n05


mmchnode --quorum -N k164n06

To set a node's quorum designation at the time that it is added to the cluster, see mmcrcluster or
mmaddnode command in IBM Storage Scale: Command and Programming Reference Guide.

248 IBM Storage Scale 5.1.9: Problem Determination Guide


What is stored in the /tmp/mmfs directory and why does it sometimes
disappear?
When GPFS encounters an internal problem, certain state information is saved in the GPFS dump
directory for later analysis by the IBM Support Center.
The default dump directory for GPFS is /tmp/mmfs. This directory might disappear on Linux if cron is
set to run the /etc/cron.daily/tmpwatch script. The tmpwatch script removes files and directories
in /tmp that have not been accessed recently. Administrators who want to use a different directory for
GPFS dumps can change the directory by issuing this command:

mmchconfig dataStructureDump=/name_of_some_other_big_file_system

Note: This state information (possibly large amounts of data in the form of GPFS dumps and traces) can
be dumped automatically as part the first failure data capture mechanisms of GPFS, and can accumulate
in the (default /tmp/mmfs) directory that is defined by the dataStructureDump configuration
parameter. It is recommended that a cron job (such as /etc/cron.daily/tmpwatch) be used to
remove dataStructureDump directory data that is older than two weeks, and that such data is collected
(for example, via gpfs.snap) within two weeks of encountering any problem that requires investigation.
Note: You must not remove the contents of the callhome subdirectory in dataStructureDump. For
example, /tmp/mmfs/callhome. Call Home automatically ensures that it does not take up too much
space in the dataStructureDump directory. If you remove the call home files, the copies of the
latest uploads are prematurely removed, which reduces the usability of the mmcallhome command. For
example, mmcallhome status diff.

Why does my system load increase significantly during the night?


On some Linux distributions, cron runs the /etc/cron.daily/slocate.cron or /etc/cron.daily/
mlocate job every night. These jobs try to index all the files in GPFS, which adds a very large load on the
GPFS token manager.
You can exclude all GPFS file systems by adding gpfs to the excludeFileSytemType list in this script,
or exclude specific GPFS file systems in the excludeFileSytemType list.

/usr/bin/updatedb -f "excludeFileSystemType" -e "excludeFileSystem"

If indexing GPFS file systems is desired, only one node should run the updatedb command and build the
database in a GPFS file system. If the database is built within a GPFS file system, then it is visible on all
nodes after one node finishes building it.

What do I do if I receive message 6027-648?


The mmedquota or mmdefedquota commands can fail with message 6027-648: EDITOR environment
variable must be full path name.
To resolve this error, do the following:
1. Change the value of the EDITOR environment variable to an absolute path name.
2. Check to see if the EDITOR variable is set in the login profiles such as $HOME/.bashrc and
$HOME/.kshrc. If it is set, check to see if it is an absolute path name because the mmedquota or
mmdefedquota command could retrieve the EDITOR environment variable from that file.

Why can't I see my newly mounted Windows file system?


On Windows, a newly mounted file system might not be visible to you if you are currently logged on to a
system. This can happen if you have mapped a network share to the same drive letter as GPFS.
Once you start a new session (by logging out and logging back in), the use of the GPFS drive letter
supersedes any of your settings for the same drive letter. This is standard behavior for all local file
systems on Windows.

Chapter 16. Best practices for troubleshooting 249


Why is the file system mounted on the wrong drive letter?
Before mounting a GPFS file system, you must be certain that the drive letter required for GPFS is freely
available and is not being used by a local disk or a network-mounted file system on all nodes where the
GPFS file system is mounted.

Why does the offline mmfsck command fail with "Error creating internal
storage"?
Use mmfsck command on the file system manager for storing internal data during a file system scan. The
command fails if the GPFS fails to provide a temporary file of the required size.
The mmfsck command requires some temporary space on the file system manager for storing internal
data during a file system scan. The internal data is placed in the directory specified by the mmfsck
-t command line parameter (/tmp by default). The amount of temporary space that is needed is
proportional to the number of inodes (used and unused) in the file system that is being scanned. If
GPFS is unable to create a temporary file of the required size, then the mmfsck command fails with the
following error message:

Error creating internal storage

This failure could be caused by:


• The lack of sufficient disk space in the temporary directory on the file system manager
• The lack of sufficient page pool space on the file system manager as shown in mmlsconfig pagepool
output
• Insufficiently high filesize limit set for the root user by the operating system
• The lack of support for large files in the file system that is being used for temporary storage. Some
file systems limit the maximum file size because of architectural constraints. For example, JFS on AIX
does not support files larger than 2 GB, unless the Large file support option has been specified
when the file system was created. Check local operating system documentation for maximum file size
limitations.

Why do I get timeout executing function error message?


If any of the commands fails due to timeout while executing mmccr, rerun the command to fix the issue.
This timeout issue is likely related to an increased workload of the system.

Questions related to active file management


Issues and explanations pertaining to active file management.
The following questions are related to active file management (AFM).

How can I change the mode of a fileset?


The mode of an AFM client cache fileset cannot be changed from local-update mode to any other mode;
however, it can be changed from read-only to single-writer (and vice versa), and from either read-only or
single-writer to local-update.
To change the mode, do the following:
1. Ensure that fileset status is active and that the gateway is available.
2. Unmount the file system.
3. Unlink the fileset.
4. Run the mmchfileset command to change the mode.
5. Mount the file system again.
6. Link the fileset again.

250 IBM Storage Scale 5.1.9: Problem Determination Guide


Why are setuid/setgid bits in a single-writer cache reset at home after data is
appended?
The setuid/setgid bits in a single-writer cache are reset at home after data is appended to files on which
those bits were previously set and synced. This is because over NFS, a write operation to a setuid file
resets the setuid bit.

How can I traverse a directory that has not been cached?


On a fileset whose metadata in all sub-directories is not cached, any application that optimizes by
assuming that directories contain two fewer sub-directories than their hard link count does not traverse
the last subdirectory. One such example is find; on Linux, a workaround for this is to use find -noleaf
to correctly traverse a directory that has not been cached.

What extended attribute size is supported?


For an operating system in the gateway whose Linux kernel version is less than 2.6.32, the NFS max rsize
is 32K, so AFM would not support an extended attribute size of more than 32K on that gateway.

What should I do when my file system or fileset is getting full?


The .ptrash directory is present in cache and home. In some cases, where there is a conflict that AFM
cannot resolve automatically, the file is moved to .ptrash at cache or home. In cache the .ptrash gets
cleaned up when eviction is triggered. At home, it is not cleared automatically. When the administrator is
looking to clear some space, the .ptrash should be cleaned up first.

Chapter 16. Best practices for troubleshooting 251


252 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 17. Understanding the system limitations
It is important to understand the system limitations to analyze whether you are facing a real issue in the
IBM Storage Scale system.
The following topics list the IBM Storage Scale system limitations:
AFM limitations
See AFM limitations in IBM Storage Scale: Concepts, Planning, and Installation Guide.
AFM-based DR limitations
See AFM-based DR limitations in IBM Storage Scale: Concepts, Planning, and Installation Guide.
Authentication limitations
See Authentication limitations in IBM Storage Scale: Administration Guide.
Cluster configuration repository (CCR) limitations
See Limitations of CCR in IBM Storage Scale: Administration Guide.
File authorization limitations
See Authorization limitations in IBM Storage Scale: Administration Guide.
File compression limitations
See File compression in IBM Storage Scale: Administration Guide.
FPO limitations
See Restrictions in IBM Storage Scale: Administration Guide.
General NFS V4 Linux Exceptions and Limitations
See General NFS V4 Linux exceptions and limitations in IBM Storage Scale: Administration Guide.
GPFS exceptions and limitations to NFSv4 ACLs
See GPFS exceptions and limitations to NFS V4 ACLs in IBM Storage Scale: Administration Guide.
GUI limitations
See GUI limitations. in IBM Storage Scale: Administration Guide.
HDFS transparency limitations
See Configuration that differs from native HDFS in IBM Storage Scale in IBM Storage Scale: Big Data
and Analytics Guide.
HDFS transparency federation limitations
See Known limitations in IBM Storage Scale: Big Data and Analytics Guide.
Installation toolkit limitations
See Limitations of the IBM Storage Scale installation toolkit in IBM Storage Scale: Concepts, Planning,
and Installation Guide.
mmuserauth service create command limitations
See Limitations of the mmuserauth service create command while configuring AD with RFC2307 in IBM
Storage Scale: Administration Guide.
Multiprotocol export limitations
See Multiprotocol export considerations in IBM Storage Scale: Administration Guide.
Performance monitoring limitations
See Performance monitoring limitations in IBM Storage Scale: Administration Guide.
Protocol cluster disaster recovery limitations
See Protocols cluster disaster recovery limitations in IBM Storage Scale: Administration Guide.
Protocol data security limitations
See Data security limitations in IBM Storage Scale: Administration Guide.
S3 API support limitations
See Managing OpenStack access control lists using S3 API in IBM Storage Scale: Administration Guide.

© Copyright IBM Corp. 2015, 2024 253


SMB limitations
See SMB limitations topic in IBM Storage Scale: Administration Guide.
Transparent cloud tiering limitations
See Known limitations of Transparent cloud tiering in IBM Storage Scale: Administration Guide.
Unified file and object access limitations
See Limitations of unified file and object access in IBM Storage Scale: Administration Guide.

254 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 18. Collecting details of the issues
To start the troubleshooting process, collect details of the issues that are reported in the system.
The IBM Storage Scale provides the following options for collecting details:
• Logs
• Dumps
• Traces
• Diagnostic data collection through CLI
• Diagnostic data collection through GUI
Note: It is a good idea not to store collected data in a directory of a GPFS file system that is managed by
the cluster against which you are running the data collection.

Collecting details of issues by using logs, dumps, and traces


The problem determination tools that are provided with IBM Storage Scale are intended to be used by
experienced system administrators who know how to collect data and run debugging routines.
You can collect various types of logs such as GPFS logs, protocol service logs, operating system logs, and
transparent cloud tiering logs. The GPFS™ log is a repository of error conditions that are detected on each
node, as well as operational events such as file system mounts. The operating system error log is also
useful because it contains information about hardware failures and operating system or other software
failures that can affect the IBM Storage Scale system.
Note: The GPFS error logs and messages contain the MMFS prefix to distinguish it from the components
of the IBM Multi-Media LAN Server, a related licensed program.
The IBM Storage Scale system also provides a system snapshot dump, trace, and other utilities that can
be used to obtain detailed information about specific problems.
The information is organized as follows:
• “GPFS logs” on page 256
• “Operating system error logs” on page 274
• “Using the gpfs.snap command” on page 295
• “mmdumpperfdata command” on page 310
• “mmfsadm command” on page 312
• “Trace facility” on page 283

Time stamp in GPFS log entries


The timestamp in a GPFS log entry indicates the time of an event.

ISO 8601 timestamp format


By default, the timestamp in logs and traces follows a format similar to the ISO 8601 standard:

YYYY-MM-DD_hh:mm:ss.sss±hhmm

where
YYYY-MM-DD
Is the year, month, and day.
_
Is a separator character.

© Copyright IBM Corp. 2015, 2024 255


hh:mm:ss.sss
Is the hours (24-hour clock), minutes, seconds, and milliseconds.
±hhmm
Is the time zone designator, in hours and minutes offset from UTC.
The following examples show the ISO 8601 format:

2016-05-09_15:12:20.603-0500
2016-08-15_07:04:33.078+0200

Logs
This topic describes various logs that are generated in the IBM Storage Scale.

GPFS logs
The GPFS log is a repository of error conditions that are detected on each node, as well as operational
events such as file system mounts. The GPFS log is the first place to look when you start debugging the
abnormal events. As GPFS is a cluster file system, events that occur on one node might affect system
behavior on other nodes and all GPFS logs can have relevant data.
The GPFS log can be found in the /var/adm/ras directory on each node. The GPFS log file is named
mmfs.log.date.nodeName, where date is the time stamp when the instance of GPFS started on the
node and nodeName is the name of the node. The latest GPFS log file can be found by using the symbolic
file name /var/adm/ras/mmfs.log.latest.
The GPFS log from the prior startup of GPFS can be found by using the symbolic file
name /var/adm/ras/mmfs.log.previous. All other files have a time stamp and node name
appended to the file name.
At GPFS startup, log files that are not accessed during the last 10 days are deleted. If you want to save old
log files, then copy them elsewhere.
Many GPFS log messages can be sent to syslog on Linux. The systemLogLevel attribute of the
mmchconfig command determines the GPFS log messages to be sent to the syslog. For more
information, see the mmchconfig command in the IBM Storage Scale: Command and Programming
Reference Guide.
This example shows normal operational messages that appear in the GPFS log file on Linux node:

2022-07-26_13:59:55.090-0400: runmmfs starting (1135)


2022-07-26_13:59:55.096-0400: [I] Removing old /var/adm/ras/mmfs.log.* files:
2022-07-26_13:59:55.212-0400: runmmfs: [I] Unloading modules from /lib/modules/
3.10.0-514.el7.x86_64/extra
2022-07-26_13:59:55.312-0400: runmmfs: [I] Loading modules from /lib/modules/
3.10.0-514.el7.x86_64/extra
Module Size Used by
mmfs26 2842309 0
mmfslinux 824803 1 mmfs26
tracedev 48529 2 mmfs26,mmfslinux
2022-07-26_13:59:56.321-0400: runmmfs: [I] Invoking /usr/lpp/mmfs/bin/mmfsd
2022-07-26_13:59:56.331-0400: [I] This node has a valid advanced license
2022-07-26_13:59:56.330-0400: [I] Initializing the fast condition variables at
0x7F5821FA5E00 ...
2022-07-26_13:59:56.331-0400: [I] mmfsd initializing. {Version: 5.1.4.0 Built: May 20 2022
16:49:25} ...
2022-07-26_13:59:56.353-0400: [I] Tracing in overwrite mode
2022-07-26_13:59:56.353-0400: [I] Cleaning old shared memory ...
2022-07-26_13:59:56.353-0400: [I] First pass parsing mmfs.cfg ...
2022-07-26_13:59:56.353-0400: [I] Enabled automated long waiter detection.
2022-07-26_13:59:56.353-0400: [I] Enabled automated long waiter debug data collection.
2022-07-26_13:59:56.353-0400: [I] Enabled automated expel debug data collection.
2022-07-26_13:59:56.353-0400: [I] Verifying minimum system memory configurations.
2022-07-26_13:59:56.353-0400: [I] The system memory configuration is 64235 MiB
2022-07-26_13:59:56.353-0400: [I] The daemon memory configuration hard floor is 1536 MiB
2022-07-26_13:59:56.353-0400: [I] Initializing the main process ...
2022-07-26_13:59:56.358-0400: [I] CreateLocalConnTab: err 0
2022-07-26_13:59:56.359-0400: [I] Second pass parsing mmfs.cfg ...
2022-07-26_13:59:56.386-0400: [I] Calling parsing /var/mmfs/gen/local.cfg ...

256 IBM Storage Scale 5.1.9: Problem Determination Guide


2022-07-26_13:59:56.386-0400: [I] Initializing NUMA support ...
2022-07-26_13:59:56.387-0400: [I] NUMA loaded platform library libnuma.
2022-07-26_13:59:56.387-0400: [I] NUMA BIOS/platform support for NUMA is enabled.
2022-07-26_13:59:56.387-0400: [I] NUMA Cgroup Version V1; is default cgroup? yes; is system
cgroup? yes; Cgroup path: /sys/fs/cgroup/systemd/system.slice/gpfs.service
2022-07-26_13:59:56.387-0400: [I] NUMA mmfsd running on CPUs 0-31 with NUMA policy MPOL_DEFAULT
2022-07-26_13:59:56.387-0400: [I] NUMA discover num numa nodes 2 num numa mem nodes 2
numa_num_configured_cpus 32 get_nprocs_conf 32 numa_max_node 1
2022-07-26_13:59:56.387-0400: [I] NUMA discover system CPUs present 0-31
2022-07-26_13:59:56.387-0400: [I] NUMA discover system CPUs online 0-31
2022-07-26_13:59:56.387-0400: [I] NUMA discover system NUMA nodes has_memory 0-1
2022-07-26_13:59:56.387-0400: [I] NUMA discover system NUMA nodes has_normal_memory 0-1
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset cpus count 32
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset node count 2
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset nodes 0-1
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset node 0 with CPUs 0-7,16-23
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset node 1 with CPUs 8-15,24-31
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset nodes max NUMA distances of 11 for nodes
with vCPUs and 11 for all nodes.
2022-07-26_13:59:56.387-0400: [I] GPFS vCPU limits include all vCPUs that Linux sees as online
or possibly online via hot add, ht/smt changes, etc.
2022-07-26_13:59:56.387-0400: [I] GPFS detected 32 vCPUs.
2022-07-26_13:59:56.387-0400: [I] GPFS detected NUMA Complexity Metric values of 2 for nodes
with vCPUs and 2 for all nodes.
2022-07-26_13:59:56.387-0400: [I] NUMA discover node 0 online CPUs: 0-7,16-23
2022-07-26_13:59:56.387-0400: [I] NUMA discover node 1 online CPUs: 8-15,24-31
2022-07-26_13:59:56.387-0400: [I] NUMA discover system online CPUs: 0-31
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset node 0 (normal) with memory: 32739 MiB,
RDMA devices: No, CPUs: 0-7,16-23.
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset node 1 (normal) with memory: 32768 MiB,
RDMA devices: No, CPUs: 8-15,24-31.
2022-07-26_13:59:56.387-0400: [I] NUMA discover cpuset node -1 (system) with memory: 65507 MiB,
RDMA devices: No, CPUs: 0-31.
2022-07-26_13:59:56.387-0400: [I] Initializing User Counter support ...
2022-07-26_13:59:56.388-0400: [I] User counters CPU limit 2048 CPUs; found 32 system CPUs
2022-07-26_13:59:56.388-0400: [I] Initializing the page pool ...
2022-07-26_14:00:05.983-0400: [I] Initializing the mailbox message system ...
2022-07-26_14:00:05.984-0400: [I] Initializing encryption ...
2022-07-26_14:00:06.020-0400: [I] Encryption: loaded crypto library: GSKit FIPS context (ver:
8.6.0.0).
2022-07-26_14:00:06.020-0400: [I] Encryption key cache expiration time = 0 (cache does not
expire).
2022-07-26_14:00:06.020-0400: [I] Initializing the thread system ...
2022-07-26_14:00:06.020-0400: [I] Creating threads ...
2022-07-26_14:00:06.026-0400: [I] Initializing inter-node communication ...
2022-07-26_14:00:06.027-0400: [I] Creating the main SDR server object ...
2022-07-26_14:00:06.027-0400: [I] Initializing the sdrServ library ...
2022-07-26_14:00:06.029-0400: [I] Initializing the ccrServ library ...allowRemoteConnections=1,
noAuthentication=0
2022-07-26_14:00:06.045-0400: [I] proactiveReconnect is enabled by default
2022-07-26_14:00:06.045-0400: [I] Initializing the cluster manager ...
2022-07-26_14:00:06.458-0400: [I] Initializing the token manager ...
2022-07-26_14:00:06.641-0400: [I] Initializing network shared disks ...
2022-07-26_14:00:06.830-0400: [I] Register client command RPC handlers (15 handlers)
2022-07-26_14:00:07.176-0400: [I] VERBS RDMA not starting because configuration option
verbsRdma is not enabled
2022-07-26_14:00:07.178-0400: [I] Starting CCR server (mmfsd) ...
2022-07-26_14:00:07.189-0400: [D] PFD load: mostRecent: 0 seq: 464621 (8192)
2022-07-26_14:00:07.189-0400: [D] PFD load: nextToWriteIdx: 1 seq: 464620 (8192)
2022-07-26_14:00:07.223-0400: [I] Initializing compression libraries handlers...
2022-07-26_14:00:07.225-0400: [I] Listening for local client connections on fd 6 in pid 1753
2022-07-26_14:00:07.725-0400: [N] Connecting to 192.168.118.154 node0 <c0p0>:[0]
2022-07-26_14:00:07.742-0400: [I] Connected to 192.168.118.154 node0 <c0p0>:[0]
2022-07-26_14:00:07.765-0400: [I] Accepted and connected to 192.168.118.154 node0 <c0p0>:[1]
2022-07-26_14:00:07.858-0400: [I] Node 192.168.118.154 (node0) is now the Group Leader.
2022-07-26_14:00:07.869-0400: [I] Calling user exit script mmClusterManagerRoleChange: event
clusterManagerTakeOver, Async command /usr/lpp/mmfs/bin/mmsysmonc.
2022-07-26_14:00:07.879-0400: [N] mmfsd ready
2022-07-26_14:00:07.882-0400: [I] Calling user exit script mmMountFs: event mount, Async
command /usr/lpp/mmfs/lib/mmsysmon/sendRasEventToMonitor.
2022-07-26_14:00:07.981-0400: mmcommon mmfsup invoked. Parameters: 192.168.118.153
192.168.118.154 all
2022-07-26_14:00:07.995-0400: [I] sendRasEventToMonitor: Successfully sent a file system event
to the monitor. Event code=999306
2022-07-26_14:00:08.126-0400: [I] Accepted and connected to 192.168.118.155 node2 <c0n2>:[0]
2022-07-26_14:00:08.152-0400: [N] Connecting to 192.168.118.155 node2 <c0n2>:[1]
2022-07-26_14:00:08.162-0400: [I] Connected to 192.168.118.155 node2 <c0n2>:[1]
2022-07-26_14:00:08.172-0400: [I] Calling user exit script mmSysMonGpfsStartup: event startup,
Async command /usr/lpp/mmfs/bin/mmsysmoncontrol.
2022-07-26_14:00:08.174-0400: [I] Calling user exit script mmSinceShutdownRoleChange:

Chapter 18. Collecting details of the issues 257


event startup, Async command /usr/lpp/mmfs/bin/mmsysmonc.2017-08-29_15:53:04.196-0400: runmmfs
starting

The mmcommon logRotate command can be used to rotate the GPFS log without shutting down
and restarting the daemon. After the mmcommon logRotate command is issued, /var/adm/ras/
mmfs.log.previous contains the messages that occurred since the previous startup of GPFS or the
last run of the mmcommon logRotate command. The /var/adm/ras/mmfs.log.latest file starts
over at the point in time that the mmcommon logRotate command was run.
Depending on the size and complexity of your system configuration, the amount of time to start GPFS
varies. If you cannot access a file system that is mounted, then examine the log file for error messages.

Creating a master GPFS log file


The GPFS log frequently shows problems on one node that actually originated on another node.
GPFS is a file system that runs on multiple nodes of a cluster. This means that problems originating on one
node of a cluster often have effects that are visible on other nodes. It is often valuable to merge the GPFS
logs in pursuit of a problem. Having accurate time stamps aids the analysis of the sequence of events.
Before following any of the debug steps, IBM suggests that you:
1. Synchronize all clocks of all nodes in the GPFS cluster. If this is not done, and clocks on different nodes
are out of sync, there is no way to establish the real timeline of events occurring on multiple nodes.
Therefore, a merged error log is less useful for determining the origin of a problem and tracking its
effects.
2. Merge and chronologically sort all of the GPFS log entries from each node in the cluster. The --
gather-logs option of the gpfs.snap command can be used to achieve this:

gpfs.snap --gather-logs -d /tmp/logs -N all

The system displays information similar to:

gpfs.snap: Gathering mmfs logs ...


gpfs.snap: The sorted and unsorted mmfs.log files are in /tmp/logs

If the --gather-logs option is not available on your system, you can create your own script to
achieve the same task; use /usr/lpp/mmfs/samples/gatherlogs.samples.sh as an example.

Audit messages for cluster configuration changes


As an aid to troubleshooting and to improve cluster security, IBM Storage Scale can send an audit
message to syslog and the GPFS log whenever a GPFS command changes the configuration of the cluster.
You can use the features of syslog to mine, process, or redirect the audit messages.
Restriction: Audit messages are not available on Windows operating systems.

Configuring syslog
On Linux operating systems, syslog typically is enabled by default. On AIX, syslog must be set up and
configured. See the corresponding operating system documentation for details.

Configuring audit messages


By default, audit messages are enabled and messages are sent to syslog but not to the GPFS log. You
can control audit messages with the commandAudit attribute of the mmchconfig command. For more
information, see the topic mmchconfig command in the IBM Storage Scale: Command and Programming
Reference Guide.
Audit messages are not affected by the systemLogLevel attribute of the mmchconfig command.
If audit logs are enabled, the GUI receives the updates on configuration changes that you made through
CLI and updates its configuration cache to reflect the changes in the GUI. You can also disable audit

258 IBM Storage Scale 5.1.9: Problem Determination Guide


logging with the mmchconfig command. If the audit logs are disabled, the GUI does not show the
configuration changes immediately. It might be as much as an hour late in reflecting configuration
changes that are made through the CLI.

Message format
For security, sensitive information such as a password is replaced with asterisks (*) in the audit message.
Audit messages are sent to syslog with an identity of mmfs, a facility code of user, and a severity level of
informational. For more information about the meaning of these terms, see the syslog documentation.
The format of the message depends on the source of the GPFS command:
• Messages about GPFS commands that are entered at the command line have the following format:

CLI user_name user_name [AUDIT_TYPE1,AUDIT_TYPE2] 'command' RC=return_code

where:
CLI
The source of the command. Indicates that the command was entered from the command line.
user_name user_name
The name of the user who entered the command, such as root. The same name appears twice.
AUDIT_TYPE1
The point in the command when the message was sent to syslog. Always EXIT.
AUDIT_TYPE2
The action taken by the command. Always CHANGE.
command
The text of the command.
return_code
The return code of the GPFS command.
• Messages about GPFS commands that are issued by GUI commands have a similar format:

GUI-CLI user_name GUI_user_name [AUDIT_TYPE1,AUDIT_TYPE2] 'command' RC=return_code

where:
GUI-CLI
The source of the command. Indicates that the command was called by a GUI command.
user_name
The name of the user, such as root.
GUI_user_name
The name of the user who logged on to the GUI.
The remaining fields are the same as in the CLI message.
The following lines are examples from a syslog:

Apr 24 13:56:26 c12c3apv12 mmfs[63655]: CLI root root [EXIT, CHANGE] 'mmchconfig
autoload=yes' RC=0
Apr 24 13:58:42 c12c3apv12 mmfs[65315]: CLI root root [EXIT, CHANGE] 'mmchconfig
deadlockBreakupDelay=300' RC=0
Apr 24 14:04:47 c12c3apv12 mmfs[67384]: CLI root root [EXIT, CHANGE] 'mmchconfig
FIPS1402mode=no' RC=0

The following lines are examples from a syslog where GUI is the originator:

Apr 24 13:56:26 c12c3apv12 mmfs[63655]: GUI-CLI root admin [EXIT, CHANGE] 'mmchconfig
autoload=yes' RC=0

Chapter 18. Collecting details of the issues 259


Commands
IBM Storage Scale sends audit messages to syslog for the following commands and options:
mmaddcallback
mmadddisk
mmaddnode
mmafmconfig add
mmafmconfig delete
mmafmconfig disable
mmafmconfig enable
mmafmconfig update
mmafmcosaccess
mmafmcosconfig
mmafmcosctl
mmafmcoskeys
mmafmctl
mmapplypolicy
mmaudit
mmauth add
mmauth delete
mmauth deny
mmauth gencert
mmauth genkey
mmauth grant
mmauth update
mmbackup
mmbackupconfig
mmces address add
mmces address change
mmces address move
mmces address remove
mmces log
mmces node resume
mmces node suspend
mmces service disable
mmces service enable
mmces service start
mmces service stop
mmcesmonitor
mmchcluster
mmchconfig
mmchdisk
mmchfileset
mmchfs
mmchlicense
mmchmgr
mmchnode
mmchnodeclass
mmchnsd
mmchpolicy
mmchpool

260 IBM Storage Scale 5.1.9: Problem Determination Guide


mmchqos
mmcloudgateway account create
mmcloudgateway account delete
mmcloudgateway account update
mmcloudgateway config set
mmcloudgateway config unset
mmcloudgateway files delete
mmcloudgateway files migrate
mmcloudgateway files recall
mmcloudgateway files reconcile
mmcloudgateway files restore
mmcloudgateway filesystem create
mmcloudgateway filesystem delete
mmcloudgateway service start
mmcloudgateway service stop
mmcrcluster
mmcrfileset
mmcrfs
mmcrnodeclass
mmcrnsd
mmcrsnapshot
mmdefedquota
mmdefquotaoff
mmdefquotaon
mmdefragfs
mmdelcallback
mmdeldisk
mmdelfileset
mmdelfs
mmdelnode
mmdelnodeclass
mmdelnsd
mmdelsnapshot
mmedquota
mmexpelnode
mmexportfs
mmfsctl
mmimgbackup
mmimgrestore
mmimportfs
mmkeyserv
mmlinkfileset
mmmigratefs
mmnfs config change
mmnfs export add
mmnfs export change
mmnfs export load
mmnfs export remove
mmnsddiscover
mmobj config change
mmobj file access

Chapter 18. Collecting details of the issues 261


mmobj multiregion enable
mmobj multiregion export
mmobj multiregion import
mmobj multiregion remove
mmobj policy change
mmobj policy create
mmobj policy deprecate
mmobj swift base
mmperfmon config add
mmperfmon config delete
mmperfmon config generate
mmperfmon config update
mmpsnap create
mmpsnap delete
mmquotaoff
mmquotaon
mmremotecluster add
mmremotecluster delete
mmremotecluster update
mmremotefs add
mmremotefs delete
mmremotefs update
mmrestoreconfig
mmrestorefs
mmrestripefile
mmrestripefs
mmrpldisk
mmsdrrestore
mmsetquota
mmshutdown
mmsmb config change
mmsmb export add
mmsmb export change
mmsmb export remove
mmsmb exportacl add
mmsmb exportacl change
mmsmb exportacl delete
mmsmb exportacl remove
mmsmb exportacl replace
mmsnapdir
mmstartup
mmumount
mmunlinkfileset
mmuserauth service create
mmuserauth service remove
mmwinservctl
mmwatch

262 IBM Storage Scale 5.1.9: Problem Determination Guide


Protocol services logs
The protocol service logs contain the information that helps you to troubleshoot the issues related to the
NFS, SMB, and Object services.
By default, the NFS protocol logs are stored at /var/log. The SMB protocol logs are
stored at: /var/log/messages. The Object protocol logs are stored in the directories for the
service: /var/log/swift, /var/log/keystone/, and /var/log/httpd.
For more information on logs of the installation toolkit, see Logging and debugging for installation toolkit in
IBM Storage Scale: Concepts, Planning, and Installation Guide.

SMB logs
The SMB services write the most important messages to syslog.
The SMB service in IBM Storage Scale writes its log message into the syslog of the CES nodes. Thus, it
needs a working syslog daemon and configuration. An SMB snap expects syslog on CES nodes to be found
in a file in the distribution's default paths. If syslog gets redirected to another location, the customer
should provide the logs in case of support.
With the standard syslog configuration, you can search for the terms such as ctdbd or smbd in
the /var/log/messages file to see the relevant logs. For example:
grep ctdbd /var/log/messages
The system displays output similar to the following example:
May 31 09:11:23 prt002st001 ctdbd: Updated hot key database=locking.tdb key=0x2795c3b1 id=0 hop_count=1
May 31 09:27:33 prt002st001 ctdbd: Updated hot key database=smbXsrv_open_global.tdb key=0x0d0d4abe id=0 hop_count=1
May 31 09:37:17 prt002st001 ctdbd: Updated hot key database=brlock.tdb key=0xc37fe57c id=0 hop_count=1

grep smbd /var/log/messages


The system displays output similar to the following example:
May 31 09:40:58 prt002st001 smbd[19614]: [2015/05/31 09:40:58.357418, 0] ../source3
/lib/dbwrap/dbwrap_ctdb.c:962(db_ctdb_record_destr)
May 31 09:40:58 prt002st001 smbd[19614]: tdb_chainunlock on db /var/lib/ctdb/locking.tdb.2,
key FF5B87B2A3FF862E96EFB400000000000000000000000000 took 5.261000 milliseconds
May 31 09:55:26 prt002st001 smbd[1431]: [2015/05/31 09:55:26.703422, 0] ../source3
/lib/dbwrap/dbwrap_ctdb.c:962(db_ctdb_record_destr)
May 31 09:55:26 prt002st001 smbd[1431]: tdb_chainunlock on db /var/lib/ctdb/locking.tdb.2,
key FF5B87B2A3FF862EE5073801000000000000000000000000 took 17.844000 milliseconds

Additional SMB service logs are available in following folders:


• /var/adm/ras/log.smbd
• /var/adm/ras/log.smbd.old
When the size of the log.smbd file becomes 100 MB, the system changes the file as log.smbd.old. To
capture more detailed traces for problem determination, use the mmprotocoltrace command.
Some of the issues with SMB services are related to winbind service also. For more information about
winbind tracing, see “Winbind logs” on page 269.
Related concepts
“Determining the health of integrated SMB server” on page 452
There are some IBM Storage Scale commands to determine the health of the SMB server.

NFS logs
The clustered export services (CES) NFS server writes log messages in the /var/log/ganesha.log file
at runtime.
Operating system's log rotation facility is used to manage NFS logs. The NFS logs are configured and
enabled during the NFS server packages installation.
The following example shows a sample log file:
# tail -f /var/log/ganesha.log
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Init_admin_thread :NFS CB

Chapter 18. Collecting details of the issues 263


:EVENT :Admin thread initialized
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs4_start_grace
:STATE :EVENT :NFS Server Now IN GRACE,
duration 59
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT
:Callback creds directory (/var/run/ganesha) already exists
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_rpc_cb_init_ccache
:NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328378:0)
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[reaper]
nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_start :NFS STARTUP :EVENT :-------------------------------------------------

Log levels can be displayed by using the mmnfs config list | grep LOG_LEVEL command. For
example:

mmnfs config list | grep LOG_LEVEL

The system displays output similar to the following example:

LOG_LEVEL: EVENT

By default, the log level is EVENT. Additionally, the following NFS log levels can also be used; starting from
lowest to highest verbosity:
• FATAL
• MAJ
• CRIT
• WARN
• EVENT
• INFO
• DEBUG
• MID_DEBUG
• FULL_DEBUG
Note: The FULL_DEBUG level increases the size of the log file. Use it in the production mode only if
instructed by the IBM Support.
Increasing the verbosity of the NFS server log impacts the overall NFS I/O performance.
To change the logging to the verbose log level INFO, use the following command:
mmnfs config change LOG_LEVEL=INFO
The system displays output similar to the following example:

NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS
server is running.

This change is cluster-wide and restarts all NFS instances to activate this setting. The log file now displays
more informational messages, for example:
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_rpc_dispatch_threads
:THREAD :INFO :5 rpc dispatcher threads were started successfully
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[disp] rpc_dispatcher_thread

264 IBM Storage Scale 5.1.9: Problem Determination Guide


:DISP :INFO :Entering nfs/rpc dispatcher
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[disp] rpc_dispatcher_thread
:DISP :INFO :Entering nfs/rpc dispatcher
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[disp] rpc_dispatcher_thread
:DISP :INFO :Entering nfs/rpc dispatcher
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[disp] rpc_dispatcher_thread
:DISP :INFO :Entering nfs/rpc dispatcher
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_Start_threads
:THREAD :EVENT :gsh_dbusthread was started successfully
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_Start_threads
:THREAD :EVENT :admin thread was started successfully
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_Start_threads
:THREAD :EVENT :reaper thread was started successfully
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_Start_threads
:THREAD :EVENT :General fridge was started successfully
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[reaper] nfs_in_grace
:STATE :EVENT :NFS Server Now IN GRACE
2015-06-03 12:49:32 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_start
:NFS STARTUP :EVENT :-------------------------------------------------
2015-06-03 12:49:32 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_start
:NFS STARTUP :EVENT : NFS SERVER INITIALIZED
2015-06-03 12:49:32 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_start
:NFS STARTUP :EVENT :-------------------------------------------------
2015-06-03 12:50:32 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[reaper] nfs_in_grace
:STATE :EVENT :NFS Server Now NOT IN GRACE

Enabling logrotate for NFS


To enable logrotate for NFS based on the log file size, complete the following steps:
1. Modify the default NFS logrotate configuration located at /etc/logrotate.d/ganesha with the
following changes:
a. Add the size option with an appropriate size value, such as 100M or 1G, that indicates the
maximum size for log rotation.
b. Remove the weekly or daily option, as it is not required for the size-based rotation.
c. Replace the dateext option with dateformat -%Y%m%d%H%M%S to specify the desired date
format for rotated log files.
For example,

# cat /etc/logrotate.d/ganesha
/var/log/ganesha.log {
size 10M
rotate 52
copytruncate
dateformat -%Y%m%d%H%M%S
compress missingok
}

2. Add the logrotate service to the crontab by issuing the following command:

# crontab -u <user> -e

For example, to add the logrotate service for a root user and set it to run every 5 minutes:

# crontab -u root -e */5 * * * * /etc/cron.daily/logrotate

After these steps, logrotate is configured to rotate the nfs-ganesha log file based on the specified
size, and the logrotate service is scheduled to run at a desired frequency.
To display the currently configured CES log level, use the following command:
mmces log level
The system displays output similar to the following example:

CES log level is currently set to 0

The log file is /var/adm/ras/mmfs.log.latest. By default, the log level is 0 and other possible values
are 1, 2, and 3. To increase the log level, use the following command:
mmces log level 1

Chapter 18. Collecting details of the issues 265


NFS-related log information is written to the standard GPFS log files as part of the overall CES
infrastructure. This information relates to the NFS service management and recovery orchestration within
CES.

Object logs
There are a number of locations where messages are logged with the object protocol.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
The core object services, proxy, account, container, and object server have their own logging level sets
in their respective configuration files. By default, unified file and object access logging is set to show
messages at or beyond the ERROR level, but can be changed to INFO or DEBUG levels if more detailed
logging information is needed.
By default, the messages logged by these services are saved in the /var/log/swift directory.
You can also configure these services to use separate syslog facilities by using the log_facility
parameter in one or virtually all of the object service configuration files and by updating the rsyslog
configuration. These parameters are described in the Swift Deployment Guide (docs.openstack.org/
developer/swift/deployment_guide.html) that is available in the OpenStack documentation.
An example of how to set up this configuration can be found in the SAIO - Swift All In
One documentation (docs.openstack.org/developer/swift/development_saio.html#optional-setting-up-
rsyslog-for-individual-logging) that is available in the OpenStack documentation.
Note: To configure rsyslog for unique log facilities in the protocol nodes, the administrator needs to make
sure that the manual steps mentioned in the preceding link are carried out on each of those protocol
nodes.
The Keystone authentication service writes its logging messages to /var/log/keystone/
keystone.log file. By default, Keystone logging is set to show messages at or beyond the WARNING
level.
For information on how to view or change log levels on any of the object-related services, see the “CES
tracing and debug data collection” on page 287 section.
The following commands can be used to determine the health of object services:
• To see whether there are any nodes in an active (failed) state, run the following command:
mmces state cluster OBJ
The system displays output similar to the following output:

NODE COMPONENT STATE EVENTS


prt001st001 OBJECT HEALTHY
prt002st001 OBJECT HEALTHY
prt003st001 OBJECT HEALTHY
prt004st001 OBJECT HEALTHY
prt005st001 OBJECT HEALTHY

266 IBM Storage Scale 5.1.9: Problem Determination Guide


prt006st001 OBJECT HEALTHY
prt007st001 OBJECT HEALTHY

No active events are shown because the nodes are healthy.


• Run the following command to display the history of events generated by the monitoring framework:
mmces events list OBJ
The system displays output similar to the following output:
Node Timestamp Event Name Severity Details
node1 2015-06-03 13:30:27.478725+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-03 14:26:30.567245+08:08PDT object-server_ok INFO object process as expected
node1 2015-06-03 14:26:30.720534+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-03 14:28:30.689257+08:08PDT account-server_ok INFO account process as expected
node1 2015-06-03 14:28:30.853518+08:08PDT container-server_ok INFO container process as expected
node1 2015-06-03 14:28:31.015307+08:08PDT object-server_ok INFO object process as expected
node1 2015-06-03 14:28:31.177589+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-03 14:28:49.025021+08:08PDT postIpChange_info INFO IP addresses modified 192.167.12.21_0-_1.
node1 2015-06-03 14:28:49.194499+08:08PDT enable_Address_database_node INFO Enable Address Database Node
node1 2015-06-03 14:29:16.483623+08:08PDT postIpChange_info INFO IP addresses modified 192.167.12.22_0-_2.
node1 2015-06-03 14:29:25.274924+08:08PDT postIpChange_info INFO IP addresses modified 192.167.12.23_0-_3.
node1 2015-06-03 14:29:30.844626+08:08PDT postIpChange_info INFO IP addresses modified 192.167.12.24_0-_4.

• To retrieve the OBJ-related event entries, query the monitor client and grep for the name of the
component that you want to filter on. The component is object, proxy, account, container, Keystone, or
postgres. To see proxy-server related events, run the following command:

mmces events list | grep proxy

The system displays output similar to the following output:


node1 2015-06-01 14:39:49.120912+08:08PDT proxy-server_failed ERROR proxy process should be started but is
stopped
node1 2015-06-01 14:44:49.277940+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-01 16:27:37.923696+08:08PDT proxy-server_failed ERROR proxy process should be started but is
stopped
node1 2015-06-01 16:40:39.789920+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-03 13:28:18.875566+08:08PDT proxy-server_failed ERROR proxy process should be started but is
stopped
node1 2015-06-03 13:30:27.478725+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-03 13:30:57.482977+08:08PDT proxy-server_failed ERROR proxy process should be started but is
stopped
node1 2015-06-03 14:26:30.720534+08:08PDT proxy-server_ok INFO proxy process as expected
node1 2015-06-03 14:27:00.759696+08:08PDT proxy-server_failed ERROR proxy process should be started but is
stopped
node1 2015-06-03 14:28:31.177589+08:08PDT proxy-server_ok INFO proxy process as expected

• To check the monitor log, grep for the component you want to filter on, either object, proxy, account,
container, keystone or postgres. For example, to see object-server related log messages:
grep object /var/adm/ras/mmsysmonitor.log | head -n 10
The system displays output similar to the following output:
2015-06-03T13:59:28.805-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ running command
'systemctl status openstack-swift-proxy'
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ command result
ret:3 sout:openstack-swift-proxy.service - OpenStack Object Storage (swift) - Proxy Server
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com I:522632:Thread-9:object:OBJ openstack-swift-proxy is not
started, ret3
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJProcessMonitor openstack-swift-proxy
failed:
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJProcessMonitor memcached started
2015-06-03T13:59:28.917-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ running command
'systemctl status memcached'
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ command result
ret:0 sout:memcached.service - Memcached
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com I:522632:Thread-9:object:OBJ memcached is started and active
running
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJProcessMonitor memcached succeeded
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com I:522632:Thread-9:object:OBJ service started checks
after monitor loop, event count:6

The following tables list the IBM Storage Scale for object storage log files.

Table 55. Core object log files in /var/log/swift


Log file Component Configuration file

account-auditor.log Account auditor Swift service account-server.conf


account-auditor.error

Chapter 18. Collecting details of the issues 267


Table 55. Core object log files in /var/log/swift (continued)
Log file Component Configuration file

account-reaper.log Account reaper Swift service account-server.conf


account-reaper.error

account-replicator.log Account replicator Swift service account-server.conf


account-replicator.error

account-server.log Account server Swift service account-server.conf


account-server.error

container-auditor.log Container auditor Swift service container-server.conf


container-auditor.error

container-replicator.log Container replicator Swift service container-server.conf


container-
replicator.error

container-server.log Container server Swift service container-server.conf


container-server.error

container-updater.log Container updater Swift service container-server.conf


container-updater.error

object-auditor.log Object auditor Swift service object-server.conf


object-auditor.error

object-expirer.log Object expiring Swift service object-expirer.conf


object-expirer.error

object-replicator.log Object replicator Swift service object-server.conf


object-replicator.error

object-server.log Object server Swift service object-server.conf


object-server.error object-server-sof.conf

object-updater.log Object updater Swift service object-server.conf


object-updater.error

proxy-server.log Proxy server Swift service proxy-server.conf


proxy-server.error

268 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 56. Extra object log files in /var/log/swift
Log file Component Configuration file

ibmobjectizer.log Unified file and object access spectrum-scale-


objectizer service objectizer.conf
ibmobjectizer.error
spectrum-scale-
object.conf

policyscheduler.log Object storage policies spectrum-scale-object-


policies.conf
policyscheduler.error

swift.log Performance metric collector


(pmswift)
swift.error

Table 57. General system log files in /var/adm/ras


Log file Component

mmsysmonitor.log Use the log for everything that is monitored in the


monitoring framework.

mmfs.log Use the log for various command loggings.

Winbind logs
The winbind services write the most important messages to syslog.
When using Active Directory, the most important messages are written to syslog, similar to the logs in
SMB protocol. For example:
grep winbindd /var/log/messages
The system displays output similar to the following example:
Jun 3 12:04:34 prt001st001 winbindd[14656]: [2015/06/03 12:04:34.271459, 0] ../lib/util/become_daemon.c:124(daemon_ready)
Jun 3 12:04:34 prt001st001 winbindd[14656]: STATUS=daemon 'winbindd' finished starting up and ready to serve connections

Additional logs are available in /var/adm/ras/log.winbindd* and /var/adm/ras/log.wb*. There


are multiple files that get rotated with the "old" suffix, when the size becomes 100 MB.

To capture debug traces for Active Directory authentication, use mmprotocoltrace command for the
winbind component. To start the tracing of winbind component, issue this command:
mmprotocoltrace start winbind
After performing all steps, relevant for the trace, issue this command to stop tracing winbind component
and collect tracing data from all participating nodes:
mmprotocoltrace stop winbind
Note: There must be only one active trace. If you start multiple traces, you may need to remove the
previous data by using the mmprotocoltrace clear winbind command.
Related concepts
“Determining the health of integrated SMB server” on page 452

Chapter 18. Collecting details of the issues 269


There are some IBM Storage Scale commands to determine the health of the SMB server.

The IBM Storage Scale HDFS transparency log


In IBM Storage Scale HDFS transparency, all logs are recorded using log4j. The log4j.properties
file is under /var/mmfs/hadoop/etc/hadoop for HDFS Transparency version 3.x and
under /usr/lpp/mmfs/hadoop/etc/hadoop for HDFS Transparency version 2.7.x.
By default, the logs are written under:
• /var/log/hadoop/root for HDFS Transparency version 2.6.x and later for HDP distribution
• /var/log/transparency for HDFS Transparency version 3.x for Open Source Apache distribution
• /usr/lpp/mmfs/hadoop/logs for HDFS Transparency version 2.7.x for Open Source Apache
distribution
The following entries can be added into the log4j.properties file to turn on the debugging
information:

log4j.logger.org.apache.hadoop.yarn=DEBUG
log4j.logger.org.apache.hadoop.hdfs=DEBUG
log4j.logger.org.apache.hadoop.gpfs=DEBUG
log4j.logger.org.apache.hadoop.security=DEBUG

Protocol authentication log files


The log files pertaining to protocol authentication are described here.

Table 58. Authentication log files


Log
configuration
Service name file Log files Logging levels
Keystone /etc/ /var/log/keystone/ In keystone.conf change
keystone/ keystone.log 1. debug = true- for getting
keystone.conf debugging information in log
/var/log/keystone/
/etc/ httpd-error.log file.
keystone/ 2. verbose = true - for
/var/log/keystone/
logging.conf getting Info messages in log
httpd-access.log
file .
By default, these values
are false and only warning
messages are logged.
Finer grained control of
keystone logging levels can
be specified by updating the
keystones logging.conf file.
For information on the logging
levels in the logging.conf
file, see OpenStack
logging.conf documentation
(docs.openstack.org/
kilo/config-reference/
content/section_keystone-
logging.conf.html).

270 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 58. Authentication log files (continued)
Log
configuration
Service name file Log files Logging levels
SSSD /etc/sssd/ /var/log/sssd/sssd.log 0x0010: Fatal failures. Issue
sssd.conf with invoking or running SSSD.
/var/log/sssd/
sssd_nss.log 0x0020: Critical failures. SSSD
does not stop functioning.
/var/log/sssd/
However, this error indicates
sssd_LDAPDOMAIN.log
that at least one major feature
(depends upon configuration)
of SSSD is not to work properly.
/var/log/sssd/
0x0040: Serious failures. A
sssd_NISDOMAIN.log
particular request or operation
(depends upon configuration)
has failed.
Note: For more information on
0x0080: Minor failures. These
SSSD log files, see Red Hat®
are the errors that would
Linux documentation.
percolate down to cause the
operation failure of 2.
0x0100: Configuration settings.
0x0200: Function data.
0x0400: Trace messages for
operation functions.
0x1000: Trace messages for
internal control functions.
0x2000: Contents of function-
internal variables that might be
interesting.
0x4000: Extremely low-level
tracing information.
Note: For more information
on SSSD log levels, see
Troubleshooting SSSD in
Red Hat Enterprise Linux
documentation.

Winbind /var/ /var/adm/ras/log.wb- Log level is an integer. The value


mmfs /ces/ <DOMAIN> can be from 0-10.
smb.conf
[Depends upon available The default value for log level is
domains] 1.
/var/adm/ras/
log.winbindd-dc-connect
/var/adm/ras/
log.winbindd-idmap
/var/adm/ras/
log.winbindd

Note: Some of the authentication modules like keystone services log information also in /var/log/
messages.

Chapter 18. Collecting details of the issues 271


If you change the log levels, the respective authentication service must be restarted manually on each
protocol node. Restarting authentication services might result in disruption of protocol I/O.

CES monitoring and troubleshooting


You can monitor system health, query events, and perform maintenance and troubleshooting tasks in
Cluster Export Services (CES).

System health monitoring


Each CES node runs a separate GPFS process that monitors the network address configuration of the
node. If a conflict between the network interface configuration of the node and the current assignments of
the CES address pool is found, corrective action is taken. If the node is unable to detect an address that is
assigned to it, the address is reassigned to another node.
Additional monitors check the state of the services that are implementing the enabled protocols on the
node. These monitors cover NFS, SMB, Object, and Authentication services that monitor, for example,
daemon liveliness and port responsiveness. If it is determined that any enabled service is not functioning
correctly, the node is marked as failed and its CES addresses are reassigned. When the node returns
to normal operation, it returns to the normal (healthy) state and is available to host addresses in the CES
address pool.
An additional monitor runs on each protocol node if Microsoft Active Directory (AD), Lightweight Directory
Access Protocol (LDAP), or Network Information Service (NIS) user authentication is configured. If a
configured authentication server does not respond to test requests, GPFS marks the affected node as
failed.

Querying state and events


Aside from the automatic failover and recovery of CES addresses, two additional outputs are provided by
the monitoring that can be queried: events and state.
State can be queried by entering the mmces state show command, which shows you the state of each
of the CES components. The possible states for a component follow:
HEALTHY
The component is working as expected.
DISABLED
The component has not been enabled.
SUSPENDED
When a CES node is in the suspended state, most components also report suspended.
STARTING
The component (or monitor) recently started. This state is a transient state that is updated after the
startup is complete.
UNKNOWN
Something is preventing the monitoring from determining the state of the component.
STOPPED
The component was intentionally stopped. This situation might happen briefly if a service is being
restarted due to a configuration change. It might also happen because a user ran the mmces
service stop protocol command for a node.
DEGRADED
There is a problem with the component but not a complete failure. This state does not cause the CES
addresses to be reassigned.
FAILED
The monitoring detected a significant problem with the component that means it is unable to function
correctly. This state causes the CES addresses of the node to be reassigned.

272 IBM Storage Scale 5.1.9: Problem Determination Guide


DEPENDENCY_FAILED
This state implies that a component has a dependency that is in a failed state. An example would be
NFS or SMB reporting DEPENDENCY_FAILED because the authentication failed.
Looking at the states themselves can be useful to find out which component is causing a node to fail and
have its CES addresses reassigned. To find out why the component is being reported as failed, you can
look at events.
The mmces events command can be used to show you either events that are currently causing a
component to be unhealthy or a list of historical events for the node. If you want to know why a
component on a node is in a failed state, use the mmces events active invocation. This command
gives you a list of any currently active events that are affecting the state of a component, along with a
message that describes the problem. This information should provide a place to start when you are trying
to find and fix the problem that is causing the failure.
If you want to get a complete idea of what is happening with a node over a longer time period, use the
mmces events list invocation. By default, this command prints a list of all events that occurred on
this node, with a time stamp. This information can be narrowed down by component, time period, and
severity. As well as being viewable with the command, all events are also pushed to the syslog.

Maintenance and troubleshooting


A CES node can be marked as unavailable by the monitoring process. The command mmces node list
can be used to show the nodes and the current state flags that are associated with it. When unavailable
(one of the following node flags are set), the node does not accept CES address assignments. The
following possible node states can be displayed:
Suspended
Indicates that the node is suspended with the mmces node suspend command. When suspended,
health monitoring on the node is discontinued. The node remains in the suspended state until it is
resumed with the mmces node resume command.
Network-down
Indicates that monitoring found a problem that prevents the node from bringing up the CES addresses
in the address pool. The state reverts to normal when the problem is corrected. Possible causes
for this state are missing or non-functioning network interfaces and network interfaces that are
reconfigured so that the node can no longer host the addresses in the CES address pool.
No-shared-root
Indicates that the CES shared root directory cannot be accessed by the node. The state reverts to
normal when the shared root directory becomes available. Possible cause for this state is that the file
system that contains the CES shared root directory is not mounted.
Failed
Indicates that monitoring found a problem with one of the enabled protocol servers. The state reverts
to normal when the server returns to normal operation or when the service is disabled.
Starting up
Indicates that the node is starting the processes that are required to implement the CES services that
are enabled in the cluster. The state reverts to normal when the protocol servers are functioning.
Additionally, events that affect the availability and configuration of CES nodes are logged in the GPFS
log file /var/adm/ras/mmfs.log.latest. The verbosity of the CES logging can be changed with the
mmces log level n command, where n is a number from 0 (less logging) to 4 (more logging). The
current log level can be viewed with the mmlscluster --ces command.

Error displayed when an IP dress is removed from the ces_group


When the user upgrades to IBM Storage Scale version 5.0.5 or higher, the system displays an error
message when the user removes an IP from the CES group.
The following error message is displayed by the system:

[root@st1clscale101p ~]# mmces address change --ces-ip <ip address> --remove-group


No cidr objects in cidr pool.

Chapter 18. Collecting details of the issues 273


mmces address change: Command failed. Examine previous error messages to determine cause.
mmces address change: Command failed. Examine previous error messages to determine cause.

You can remove the IP addresses and groups, and add them again after the upgrade. You can resolve this
issue by running the following steps:
1. 1. Remove all the CES IP addresses and then add them again.
Then, the cidrPool entry is created.
2. Run the mmlsconfig command to verify whether the IP address entry exists or not.
3. Run the following command to remove the group:

mmces address change --ces-ip 192.168.0.100 -remove-group External

If a customer wants to use IPv6, the following steps must be taken:


1. Interface mode NIC must be defined.
2. All CES IP must be removed.
3. The mode must be switched to Interface mode, and then the IP address must be added again. This
is done because CIDR is not an IP address.

Operating system error logs


GPFS records file system or disk failures by using the error logging facility provided by the operating
system: syslog facility on Linux, errpt facility on AIX, and Event Viewer on Windows.
The error logging facility is referred to as the error log regardless of operating system specific error log
facility naming conventions.
Note: Most logs use the UNIX command logrotate to tidy up older logs. Not all options of the command
are supported on some older operating systems. This could lead to unnecessary log entries. However, it
does not interfere with the script. While using logrotate you might come across the following errors:
• error opening /var/adm/ras/mmsysmonitor.log:Too many levels of symbolic
links.
• unknown option 'maxsize' -- ignoring line.
This is the expected behavior and the error can be ignored.
Failures in the error log can be viewed by issuing this command on an AIX node:

errpt -a

and this command on a Linux node:

grep "mmfs:" /var/log/messages

You can also grep the appropriate filename where syslog messages are redirected to. For example, in
Ubuntu, after the Natty release, this file is at /var/log/syslog.
On Windows, use the Event Viewer and look for events with a source label of GPFS in the Application
event category.
On Linux, syslog might include GPFS log messages and the error logs described in this section. The
systemLogLevel attribute of the mmchconfig command controls which GPFS log messages are sent
to syslog. It is recommended that some kind of monitoring for GPFS log messages be implemented,
particularly MMFS_FSSTRUCT errors. For more information, see the mmchconfig command in the IBM
Storage Scale: Command and Programming Reference Guide.
The error log contains information about several classes of events or errors. These classes are:
• “MMFS_ABNORMAL_SHUTDOWN” on page 275
• “MMFS_DISKFAIL” on page 275

274 IBM Storage Scale 5.1.9: Problem Determination Guide


• “MMFS_ENVIRON” on page 275
• “MMFS_FSSTRUCT” on page 275
• “MMFS_GENERIC” on page 275
• “MMFS_LONGDISKIO” on page 276
• “MMFS_QUOTA” on page 276
• “MMFS_SYSTEM_UNMOUNT” on page 277
• “MMFS_SYSTEM_WARNING” on page 277

MMFS_ABNORMAL_SHUTDOWN
The MMFS_ABNORMAL_SHUTDOWN error log entry means that GPFS has determined that it must shutdown
all operations on this node because of a problem. Insufficient memory on the node to handle critical
recovery situations can cause this error. In general, there are other error log entries from GPFS or some
other component associated with this error log entry.

MMFS_DISKFAIL
This topic describes the MMFS_DISKFAIL error log available in IBM Storage Scale.
The MMFS_DISKFAIL error log entry indicates that GPFS has detected the failure of a disk and forced the
disk to the stopped state. This is ordinarily not a GPFS error but a failure in the disk subsystem or the path
to the disk subsystem.

MMFS_ENVIRON
This topic describes the MMFS_ENVIRON error log available in IBM Storage Scale.
MMFS_ENVIRON error log entry records are associated with other records of the MMFS_GENERIC or
MMFS_SYSTEM_UNMOUNT types. They indicate that the root cause of the error is external to GPFS and
usually in the network that supports GPFS. Check the network and its physical connections. The data
portion of this record supplies the return code provided by the communications code.

MMFS_FSSTRUCT
This topic describes the MMFS_FSSTRUCT error log available in IBM Storage Scale.
The MMFS_FSSTRUCT error log entry indicates that GPFS has detected a problem with the on-disk
structure of the file system. The severity of these errors depends on the exact nature of the inconsistent
data structure. If it is limited to a single file, then EIO errors are reported to the application and operation
continues. If the inconsistency affects vital metadata structures, then operation ceases on this file
system. These errors are often associated with an MMFS_SYSTEM_UNMOUNT error log entry and probably
occurs on all nodes. If the error occurs on all nodes, some critical piece of the file system is inconsistent.
This can occur as a result of a GPFS error or an error in the disk system.
Note: When the mmhealth command displays an fsstruct error, the command prompts you to run a file
system check. When the problem is resolved, issue the following command to clear the fsstruct error
from the mmhealth command. You must specify the file system name twice:

mmsysmonc event filesystem fsstruct_fixed <filesystem_name> <filesystem_name>

If the file system is severely damaged, then the best course of action is to follow the procedures in
“Additional information to collect for file system corruption or MMFS_FSSTRUCT errors” on page 556, and
then contact the IBM Support Center.

MMFS_GENERIC
This topic describes MMFS_GENERIC error logs available in IBM Storage Scale.
The MMFS_GENERIC error log entry means that GPFS self diagnostics have detected an internal error,
or that additional information is being provided with an MMFS_SYSTEM_UNMOUNT report. If the record is
associated with an MMFS_SYSTEM_UNMOUNT report, the event code fields in the records are the same.
The error code and return code fields might describe the error. See “Messages” on page 728 for a listing
of codes generated by GPFS.

Chapter 18. Collecting details of the issues 275


If the error is generated by the self diagnostic routines, then service personnel should interpret the return
and error code fields since the use of these fields varies by the specific error. Errors caused by the self
checking logic results in the shutdown of GPFS on this node.
MMFS_GENERIC errors can result from an inability to reach a critical disk resource. These errors might
look different depending on the specific disk resource that has become unavailable, like logs and
allocation maps. This type of error is usually associated with other error indications. Other errors
generated by disk subsystems, high availability components, and communications components at the
same time as, or immediately preceding, the GPFS error should be pursued first because they might be
the cause of these errors. MMFS_GENERIC error indications without an associated error of those types
represent a GPFS problem that requires the IBM Support Center.
Before you contact IBM support center, see “Information to be collected before contacting the IBM
Support Center” on page 555.

MMFS_LONGDISKIO
This topic describes the MMFS_LONGDISKIO error log available in IBM Storage Scale.
The MMFS_LONGDISKIO error log entry indicates that GPFS is experiencing very long response time for
disk requests. This is a warning message and can indicate that your disk system is overloaded or that a
failing disk is requiring many I/O retries. Follow your operating system's instructions for monitoring the
performance of your I/O subsystem on this node and on any disk server nodes that might be involved. The
data portion of this error record specifies the disk involved. There might be related error log entries from
the disk subsystems that points to the actual cause of the problem. If the disk is attached to an AIX node,
refer to AIX in IBM Documentation and search for performance management. To enable or disable, use the
mmchfs -w command. For more details, contact the IBM Support Center.
The mmpmon command can be used to analyze I/O performance on a per-node basis. For more
information, see “Monitoring I/O performance with the mmpmon command” on page 59 and “Failures
using the mmpmon command” on page 498.

MMFS_QUOTA
This topic describes the MMFS_QUOTA error log available in IBM Storage Scale.
The MMFS_QUOTA error log entry is used when GPFS detects a problem in the handling of quota
information. This entry is created when the quota manager has a problem reading or writing the quota file.
If the quota manager cannot read all entries in the quota file when mounting a file system with quotas
enabled, the quota manager shuts down but file system manager initialization continues. Mounts do not
succeed and return an appropriate error message (see “File system forced unmount” on page 382).
Quota accounting depends on a consistent mapping between user names and their numeric identifiers.
This means that a single user accessing a quota enabled file system from different nodes should map
to the same numeric user identifier from each node. Within a local cluster this is usually achieved by
ensuring that /etc/passwd and /etc/group are identical across the cluster.
When accessing quota enabled file systems from other clusters, you need to either ensure individual
accessing users have equivalent entries in /etc/passwd and /etc/group, or use the user identity mapping
facility as outlined in the IBM white paper UID Mapping for GPFS in a Multi-cluster Environment (https://
www.ibm.com/docs/en/storage-scale?topic=STXKQY/uid_gpfs.pdf).
It might be necessary to run an offline quota check (mmcheckquota command) to repair or recreate the
quota file. If the quota file is corrupted, then the mmcheckquota command does not restore it. The file
must be restored from the backup copy. If there is no backup copy, an empty file can be set as the new
quota file. This is equivalent to recreating the quota file. To set an empty file or use the backup file, issue
the mmcheckquota command with the appropriate operand:
• -u UserQuotaFilename for the user quota file
• -g GroupQuotaFilename for the group quota file
• -j FilesetQuotaFilename for the fileset quota file
After replacing the appropriate quota file, reissue the mmcheckquota command to check the file system
inode and space usage.

276 IBM Storage Scale 5.1.9: Problem Determination Guide


For information about running the mmcheckquota command, see “The mmcheckquota command” on
page 325.

MMFS_SYSTEM_UNMOUNT
This topic describes the MMFS_SYSTEM_UNMOUNT error log available in IBM Storage Scale.
The MMFS_SYSTEM_UNMOUNT error log entry means that GPFS has discovered a condition that might
result in data corruption if operation with this file system continues from this node. GPFS has marked
the file system as disconnected and applications accessing files within the file system receives ESTALE
errors. This can be the result of:
• The loss of a path to all disks containing a critical data structure.
If you are using SAN attachment of your storage, consult the problem determination guides provided by
your SAN switch vendor and your storage subsystem vendor.
• An internal processing error within the file system.
See “File system forced unmount” on page 382. Follow the problem determination and repair actions
specified.

MMFS_SYSTEM_WARNING
This topic describes the MMFS_SYSTEM_WARNING error log available in IBM Storage Scale.
The MMFS_SYSTEM_WARNING error log entry means that GPFS has detected a system level value
approaching its maximum limit. This might occur as a result of the number of inodes (files) reaching
its limit. If so, issue the mmchfs command to increase the number of inodes for the file system so there is
at least a minimum of 5% free.

Error log entry example


This topic describes an example of an error log entry in IBM Storage Scale.
This is an example of an error log entry that indicates a failure in either the storage subsystem or
communication subsystem:

LABEL: MMFS_SYSTEM_UNMOUNT
IDENTIFIER: C954F85D

Date/Time: Thu Jul 8 10:17:10 CDT


Sequence Number: 25426
Machine Id: 000024994C00
Node Id: nos6
Class: S
Type: PERM
Resource Name: mmfs

Description
STORAGE SUBSYSTEM FAILURE

Probable Causes
STORAGE SUBSYSTEM
COMMUNICATIONS SUBSYSTEM

Failure Causes
STORAGE SUBSYSTEM
COMMUNICATIONS SUBSYSTEM

Recommended Actions
CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
EVENT CODE
15558007
STATUS CODE
212
VOLUME
gpfsd

Chapter 18. Collecting details of the issues 277


Transparent cloud tiering logs
This topic describes how to collect logs that are associated with Transparent cloud tiering.
To collect details of issues specific to Transparent cloud tiering, issue this command:

gpfs.snap [--cloud-gateway {BASIC | FULL}]

With the BASIC option, the Transparent cloud tiering service debugs information such as logs, traces,
Java™ cores, along with minimal system and IBM Storage Scale cluster information is collected. No
customer sensitive information is collected.
With the FULL option, extra details such as Java Heap dump are collected, along with the information
captured with the BASIC option.
Successful invocation of this command generates a new .tar file at a specified location, and the file can be
shared with IBM support team to debug a field issue.

Performance monitoring tool logs


The performance monitoring tool logs can be found in the /var/log/zimon directory on each node
configured for performance monitoring.
The nodes that are configured as Collector have two files in this directory: ZIMonCollector.log
and ZIMonSensors.log. For nodes configured as Sensor, only the ZIMonSensors.log file is present.
These log files contain information, warning, and error messages for the collector service pmcollector,
and the sensor service pmsensors.
Both log files are rotated every day. The previous logs are compressed and saved in the same /var/log/
zimon directory.
During installation, the log level is set to info. Issue the mmperfmon config show command to see the
current log level as shown in the following sample output:
# mmperfmon config show

cephMon = "/opt/IBM/zimon/CephMonProxy"
cephRados = "/opt/IBM/zimon/CephRadosProxy"
colCandidates = "nsd003st001", "nsd004st001"
colRedundancy = 2
collectors = {
host =""
port = "4739"
}
config = "/opt/IBM/zimon/ZIMonSensors.cfg"
ctdbstat = ""
daemonize = T
hostname = ""
ipfixinterface = "0.0.0.0"
logfile = "/var/log/zimon/ZIMonSensors.log"
loglevel = "info"

File audit logging logs


All major actions performed in configuring, starting, and stopping file audit logging produce messages in
multiple logs at varying degrees of granularity.
The primary log associated with the mmaudit command is the primary log to look at for issues specific to
file audit logging. There are some other logs that might contain useful information as well. Because some
logs might grow quickly, log rotation is used for all logs. Therefore, it is important to gather logs as soon as
an issue is found and to look for logs that are not by default captured with gpfs.snap (no gzip versions
of old logs are gathered by default by gpfs.snap). The following list describes the types of logs that are
useful for problem determination:
• mmaudit log: This log contains information regarding the setup and configuration operations that affect
file audit logging. Information is put into this log on any node running the file audit logging command
or location where the subcommand might be run. This log is located at /var/adm/ras/mmaudit.log.
This log is collected by gpfs.snap.

278 IBM Storage Scale 5.1.9: Problem Determination Guide


• mmfs.log.latest: This is the most current version of the IBM Storage Scale daemon log. It contains
entries from when major file audit logging activity occurs. Types of activity within this log file are enable
or disable of file audit logging for a given device, or if an error is encountered when attempting to enable
or disable file audit logging for a given device. This log file is collected by gpfs.snap.
The gpfs.snap command gathers log files from multiple components including file audit logging. For file
audit logging, the following file is collect ted: /var/adm/ras/mmaudit.log. In addition, one file is held
in the CCR and saved when the gpfs.snap command is run: spectrum-scale-file-audit.conf.
This CCR file contains the file audit logging configuration for all devices in the local cluster that is being
audited.

Active File Management error logs


Some errors occur during the replication. These errors impact the replication. You can check information
about the errors that are logged in the /var/adm/ras/mmfs.log.latest file.

Table 59. AFM logged errors


Error Description User action
E_PERM (1) Operation not permitted. This Check the appropriate privileges
error occurred when an operation for the complete path. Running
that is limited to processes with operation can be requeued.
appropriate privileges or to the Report this issue.
owner of a file or other resources
is performed.
E_NOENT (2) No file or directory found. A A file or a directory is
component of a specified named deleted. Check the path. Running
path did not exist. operation can be requeued.
Report this issue.
E_INTR (4) Interrupted system call error. You can reset operations that are
An asynchronous signal was running operations. Report this
caught by the process during error if it occurs regularly.
the execution of an interruptible
function. This error occurs when
the recovery is in-progress and
the gateway node is interrupted.
E_IO (5) Input/output error. Some Check the file and its valid
physical or network input or attribute. It might be a temporary
output error occurred. The same error, and it might not occur in a
file descriptor is lost (over written subsequent operation.
or by any subsequent errors).
E_ACCES (13) Permission denied error. An Check the permission for
attempt was made to access a the complete path. Running
file that is forbidden by its file operation can be requeued.
access permissions. Report this issue.
E_EXIST (17) File exists error. This error is Check the file and its valid
logged when a file or a directory attribute. Running operation can
exist at other site and failed to be requeued. Report this issue.
replace due to compatibility.
E_ISDIR (20) Not a directory error. A Running operation can be
component of a specified file requeued. Report this issue.
path existed, but it was not a
directory, when a directory was
expected.

Chapter 18. Collecting details of the issues 279


Table 59. AFM logged errors (continued)
Error Description User action
E_INVAL (22) Invalid argument error. Some Running operation can be
invalid argument was supplied. requeued. Report this issue.
This error occurs when input is
invalid for processing.
E_ROFS (30) Read-only file system. An Check the path. Running
attempt was made to modify a operation can be requeued.
file or a directory on a file system Report this issue.
that was read-only at the time.
E_RANGE (34) Result too large. A numerical Running operation can be
result of the function was too requeued. Report this issue.
large to fit in the available space
(might have exceeded precision).
E_STALE (52) File system or mount path is not Check the home path from
accessible or connection is reset. the cache. It is a temporary
failure and can be resolved when
the home or home fileset is
accessible.
E_OPNOTSUPP (64) Operation not supported. The Running operation can be
attempted operation is not requeued. Report this issue.
supported for the type of object
referenced. Usually, this error
occurs when a file descriptor
refers to a file or socket that
cannot support this operation.
E_TIMEDOUT (78) Operation timed out. A connect It is a temporary error and can be
or send request failed because succeeded in next operation.
the user did not properly respond
after a time. The timeout period
depends on the communication
protocol, and also network is
overloaded and the replication
does not get enough bandwidth.
E_DQUOT (88) Disc quota exceeded error. It Check and increase the disc
occurs when the disc quota quota. It is a temporary failure
exhaust. and can be resolved in a next
access.
E_NOATTR (112) No attribute found. A file or Check the file and its valid
directory does not have an attribute. Running operation can
attribute to access the file. The be requeued. Report this issue.
file is no longer valid and not
accessible.
E_DAEMON_DEATH (307) This error is logged when All running operations can be
daemon is interrupted and failed interrupted. Report this issue.
to service.
E_NODE_FAILED (734) This error occurs because of Running operation can be
the RPC failure. The sender was requeued. Report this issue.
notified by the config manager
that the RPC destination failed.

280 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 59. AFM logged errors (continued)
Error Description User action
E_NOT_MOUNTED(753) File system was not mounted. Check the path. Running
This error can be logged because operation can be requeued.
of this issue. Report this issue. This error is
resolved when the file system is
mounted back.
E_CACHE_CONFLICT (755) Cache file and original are not Check the file and its valid
synchronized. Handling depends attribute. Running operation can
on the current conflict semantics. be requeued. Report this issue.
E_REQUEUED (765) Conflict error in the replication. Running operation can be
It occurs when a replication requeued. Report this issue.
operation does not find correct
action and fails to replicate.
E_PANIC (666) This error occurs because of an Running operation can be
unrecoverable error. requeued. Report this issue.
E_REMOTEIO (1121) Remote I/O error. Home or home Check the home path from
fileset is not accessible. This the cache. It is a temporary
error occurs because the special failure and can be resolved when
ctl operation is failed. the home or home fileset is
accessible.

Setting up core dumps on a client RHEL or SLES system


No core dump configuration is set up by IBM Storage Scale by default. Core dumps can be configured
in a few ways. For information about SLES core dumps, see SUSE Knowledge base article: How to obtain
application core dumps: https://www.suse.com/support/kb/doc/?id=3054866
core_pattern + ulimit
The simplest way is to change the core_pattern file at /proc/sys/kernel/core_pattern
and to enable core dumps by using the ulimit -c unlimited command. Setting it
to something like /var/log/cores/core.%e.%t.%h.%p produces core dumps similar to
core.bash.1236975953.node01.2344 in /var/log/cores. This action creates core dumps for
Linux binaries but does not produce information for Java or Python exceptions.
ABRT
ABRT can be used to produce more detailed output as well as output for Java and Python exceptions.
The following packages should be installed to configure ABRT:
• abrt (Core package)
• abrt-cli (CLI tools)
• abrt-libs (Libraries)
• abrt-addon-ccpp (C/C++ crash handler)
• abrt-addon-python (Python unhandled exception handler)
• abrt-java-connector (Java crash handler)
This overwrites the values stored in core_pattern to pass core dumps to abrt. It then writes this
information to the abrt directory configured in /etc/abrt/abrt.conf. Python exceptions is caught by
the python interpreter automatically importing the abrt.pth file installed in /usr/lib64/python2.7/
site-packages/. If some custom configuration has changed this behavior, Python dumps may not be
created.

Chapter 18. Collecting details of the issues 281


To get Java runtimes to report unhandled exceptions through abrt, they must be executed with the
command line argument -agentpath=/usr/lib64/libabrt-java-connector.so.
Note: Passing exception information to ABRT by using the ABRT library causes a decrease in the
performance of the application.
ABRT Config files
The ability to collect core dumps has been added to gpfs.snap using the '--protocol core' option.
This attempts to gather core dumps from a number of locations:
• If core_pattern is set to dump to a file it attempts to get dumps from the absolute path or from the root
directory (the CWD for all IBM Storage Scale processes)
• If core_pattern is set to redirect to abrt it tries to read the /etc/abrt/abrt.conf file and read the
DumpLocation variable. All files and folders under this directory are gathered.
• If the DumpLocation value cannot be read, then a default of /var/tmp/abrt is used.
• If core_pattern is set to use something other than abrt or a file path, then core dumps are not collected
for the OS.
Samba can dump to the directory /var/adm/ras/cores/. Any files in this directory are gathered.
Verification steps for RHEL: After the setup is complete, check whether the contents of the /proc/sys/
kernel/core_pattern file are starting with |/usr/libexec/abrt-hook-ccpp.

Configuration changes required on protocol nodes to collect core dump data


To collect core dumps for debugging programs in provided packages, these system configuration changes
need to be made on all protocol nodes in the cluster.
1. Install the abrt-cli RPM if not already installed. For example, run rpm -qa | grep abrt-cli to
check if it is already installed, or yum install abrt-cli to install the RPM.
2. Set OpenGPGCheck=no in the /etc/abrt/abrt-action-save-package-data.conf file.
3. Set MaxCrashReportsSize = 0 in the /etc/abrt/abrt.conf file.
4. Start (or restart) the abort daemon (for example, run systemctl start abrtd to start the abort
daemon after a new install, or systemctl restart abrtd if the daemon was already running and
the values in steps 2 and 3 were changed).
For additional details about RHEL 6 and RHEL 7, see the relevant documentation at Red Hat.
Additional setup steps applicable for NFS
A core dump might not be generated for code areas where the CES NFS process has changed credentials.
To avoid this, do the following steps:
1. Insert the following entry into the /etc/sysctl.conf file:

fs.suid_dumpable = 2

2. Issue the following command to refresh with the new configuration:

sysctl -p

3. Verify that /proc/sys/fs/suid_dumpable is correctly set:

cat /proc/sys/fs/suid_dumpable

Note: The system displays the following output if it is correctly set:

282 IBM Storage Scale 5.1.9: Problem Determination Guide


Setting up an Ubuntu system to capture crash files
This is the procedure for setting up an Ubuntu system for capturing crash files and debugging CES NFS
core dump.
This setup is IBM Storage Scale version independent and applies to Ubuntu 16.04.1 and 16.04.2.
1. Install apport. For more information, see https://wiki.ubuntu.com/Apport.
2. Modify the /etc/apport/crashdb.conf file and comment this line 'problem_types': ['Bug',
'Package'], as follows:
# 'problem_types': ['Bug', 'Package'],
Note: After these steps are performed, crash files are saved to the /var/crash/ folder.
Verification steps for Ubuntu: After the setup is completed, verify the following:
a. Run the systemctl status apport.service command to check if Apport service is running.
b. If Apport service is not running, then start it with the systemctl start apport.service
command and again verify whether it has started successfully by running the systemctl status
apport.service command.
Note: Apport modifies the /proc/sys/kernel/core_pattern file. Verify the core_pattern
file content starts with |/usr/share/apport/apport.

Trace facility
The IBM Storage Scale system includes many different trace points to facilitate rapid problem
determination of failures.
IBM Storage Scale tracing is based on the kernel trace facility on AIX, embedded GPFS trace subsystem
on Linux, and the Windows ETL subsystem on Windows. The level of detail that is gathered by the trace
facility is controlled by setting the trace levels using the mmtracectl command.
The mmtracectl command sets up and enables tracing using default settings for various common
problem situations. Using this command improves the probability of gathering accurate and reliable
problem determination information. For more information about the mmtracectl command, see the IBM
Storage Scale: Command and Programming Reference Guide.

Generating GPFS trace reports


Use the mmtracectl command to configure trace-related configuration variables and to start and stop
the trace facility on any range of nodes in the GPFS cluster.
To configure and use the trace properly:
1. Issue the mmlsconfig dataStructureDump command to verify that a directory for dumps was
created when the cluster was configured. The default location for trace and problem determination
data is /tmp/mmfs. Use mmtracectl, as instructed by the IBM Support Center, to set trace
configuration parameters as required if the default parameters are insufficient. For example, if the
problem results in GPFS shutting down, set the traceRecycle variable with --trace-recycle as
described in the mmtracectl command in order to ensure that GPFS traces are performed at the time
the error occurs.
If desired, specify another location for trace and problem determination data by issuing this command:

mmchconfig dataStructureDump=path_for_storage_of_dumps

2. To start the tracing facility on all nodes, issue this command:

mmtracectl --start

3. Re-create the problem.


4. When the event to be captured occurs, stop the trace as soon as possible by issuing this command:

Chapter 18. Collecting details of the issues 283


mmtracectl --stop

5. The output of the GPFS trace facility is stored in /tmp/mmfs, unless the location was changed using
the mmchconfig command in Step “1” on page 283. Save this output.
6. If the problem results in a shutdown and restart of the GPFS daemon, set the traceRecycle variable
as necessary to start tracing automatically on daemon startup and stop the trace automatically on
daemon shutdown.
If the problem requires more detailed tracing, the IBM Support Center might ask you to modify the GPFS
trace levels. Use the mmtracectl command to establish the required trace classes and levels of tracing.
The syntax to modify trace classes and levels is as follows:

mmtracectl --set --trace={io | all | def | "Class Level [Class Level ...]"}

For example, to tailor the trace level for I/O, issue the following command:

mmtracectl --set --trace=io

Once the trace levels are established, start the tracing by issuing:

mmtracectl --start

After the trace data has been gathered, stop the tracing by issuing:

mmtracectl --stop

To clear the trace settings and make sure tracing is turned off, issue:

mmtracectl --off

Other possible values that can be specified for the trace Class include:
afm
active file management
alloc
disk space allocation
allocmgr
allocation manager
basic
'basic' classes
brl
byte range locks
ccr
cluster configuration repository
cksum
checksum services
cleanup
cleanup routines
cmd
ts commands
defrag
defragmentation
dentry
dentry operations
dentryexit
daemon routine entry/exit

284 IBM Storage Scale 5.1.9: Problem Determination Guide


disk
physical disk I/O
disklease
disk lease
dmapi
Data Management API
ds
data shipping
errlog
error logging
eventsExporter
events exporter
file
file operations
fs
file system
fsck
online multinode fsck
ialloc
inode allocation
io
physical I/O
kentryexit
kernel routine entry/exit
kernel
kernel operations
klockl
low-level vfs locking
ksvfs
generic kernel vfs information
lock
interprocess locking
log
recovery log
malloc
malloc and free in shared segment
mb
mailbox message handling
mmpmon
mmpmon command
mnode
mnode operations
msg
call to routines in SharkMsg.h
mutex
mutexes and condition variables
nsd
network shared disk
perfmon
performance monitors

Chapter 18. Collecting details of the issues 285


pgalloc
page allocator tracing
pin
pinning to real memory
pit
parallel inode tracing
quota
quota management
rdma
rdma
sanergy
SANergy
scsi
scsi services
sec
cluster security
shared
shared segments
smb
SMB locks
sp
SP message handling
super
super_operations
tasking
tasking system but not Thread operations
thread
operations in Thread class
tm
token manager
ts
daemon specific code
user1
miscellaneous tracing and debugging
user2
miscellaneous tracing and debugging
vbhvl
behaviorals
vnode
vnode layer of VFS kernel support
vnop
one line per VNOP with all important information
Values that can be specified for the trace Class, relating to vdisks, include:
vdb
vdisk debugger
vdisk
vdisk
vhosp
vdisk hospital

286 IBM Storage Scale 5.1.9: Problem Determination Guide


For more information about vdisks and IBM Storage Scale RAID, see IBM Storage Scale RAID:
Administration.
The trace Level can be set to a value from 0 through 14, which represents an increasing level of detail. A
value of 0 turns tracing off. To display the trace level in use, issue the mmfsadm showtrace command.
On AIX, the –aix-trace-buffer-size option can be used to control the size of the trace buffer in
memory.
On Linux nodes only, use the mmtracectl command to change the following:
• The trace buffer size in blocking mode.
For example, to set the trace buffer size in blocking mode to 8K, issue:

mmtracectl --set --tracedev-buffer-size=8K

• The raw data compression level.


For example, to set the trace raw data compression level to the best ratio, issue:

mmtracectl --set --tracedev-compression-level=9

• The trace buffer size in overwrite mode.


For example, to set the trace buffer size in overwrite mode to 500M, issue:

mmtracectl --set --tracedev-overwrite-buffer-size=500M

• When to overwrite the old data.


For example, to wait to overwrite the data until the trace data is written to the local disk and the buffer
is available again, issue:

mmtracectl --set --tracedev-write-mode=blocking

--tracedev-write-mode=blocking specifies that if the trace buffer is full, wait until the trace data
is written to the local disk and the buffer becomes available again to overwrite the old data. This is the
default. --tracedev-write-mode=overwrite specifies that if the trace buffer is full, overwrite the
old data.
Note: Before switching between --tracedev-write-mode=overwrite and --tracedev-write-
mode=blocking, or vice versa, run the mmtracectl --stop command first. Next, run the
mmtracectl --set --tracedev-write-mode command to switch to the desired mode. Finally,
restart tracing with the mmtracectl --start command.
For more information about the mmtracectl command, see the IBM Storage Scale: Command and
Programming Reference Guide.

CES tracing and debug data collection


You can collect debugging information in Cluster Export Services.

Data collection
To diagnose the cause of an issue, it might be necessary to gather some extra information from the
cluster. This information can then be used to determine the root cause of an issue.
Collection of debugging information, such as configuration files and logs, can be gathered by using
the gpfs.snap command. This command gathers data about GPFS, operating system information, and
information for each of the protocols. Following services can be traced by gpfs.snap command:
GPFS + OS
GPFS configuration and logs plus operating system information such as network configuration or
connected drives.

Chapter 18. Collecting details of the issues 287


CES
Generic protocol information such as configured CES nodes.

NFS
CES NFS configuration and logs.
SMB
SMB and CTDB configuration and logs.
OBJECT
Openstack Swift and Keystone configuration and logs.
AUTHENTICATION
Authentication configuration and logs.
PERFORMANCE
Dump of the performance monitor database.
Information for each of the enabled protocols is gathered automatically when the gpfs.snap command
is run. If any protocol is enabled, then information for CES and authentication is gathered.
To gather performance data, add the --performance option. The --performance option causes
gpfs.snap to try to collect performance information.
Note: Gather the performance data only if necessary, as this process can take up to 30 minutes to run.
If data is only required for one protocol or area, the automatic collection can be bypassed. Provided one
or more of the following options to the --protocol argument: smb,nfs,object,ces,auth,none
If the --protocol command is provided, automatic data collection is disabled. If --protocol
smb,nfs is provided to gpfs.snap, only NFS and SMB information is gathered and no CES or
Authentication data is collected. To disable all protocol data collection, use the argument --protocol
none.

Types of tracing
Tracing is logging at a high level. The mmprotocoltrace command that is used for starting and stopping
tracing supports SMB, Winbind, Network, and Object tracing.
NFS tracing can be done with a combination of commands.
NFS
NFS tracing is achieved by increasing the log level, repeating the issue, capturing the log file, and then
restoring the log level.
To increase the log level, use the command mmnfs config change LOG_LEVEL=FULL_DEBUG.
The mmnfs config change command restarts the server on all nodes. You can increase the log
level by using ganesha_mgr. This command takes effect without restart, only on the node on which
the command is run.
You can set the log level to the following values: NULL,FATAL, MAJ, CRIT, WARN, EVENT, INFO,
DEBUG, MID_DEBUG, and FULL_DEBUG.
FULL_DEBUG is the most useful for debugging purposes. This command collects large amount of data,
straining disk usage and affecting performance.
After the issue is re-created by running the gpfs.snap command either with no arguments or with
the --protocol nfs argument, the NFS logs are captured. The logs can then be used to diagnose
any issues.
To return the log level to normal, use the same command but with a lower logging level. The default
value is EVENT.
HDFS
CES also supports HDFS protocols. For more information, see CES HDFS troubleshooting under IBM
Storage Scale support for Hadoop in Big data and analytics support documentation.

288 IBM Storage Scale 5.1.9: Problem Determination Guide


Collecting trace information
Use the mmprotocoltrace command to collect trace information for debugging system problems or
performance issues. For more information, see the mmprotocoltrace command in the IBM Storage
Scale: Command and Programming Reference Guide.

Running a typical trace


The following steps describe how to run a typical trace.
It is assumed that the trace system is reset for the type of trace that you want to run: SMB, Network, or
Object. The examples use the SMB trace.
1. Before you start the trace, you can check the configuration settings for the type of trace that you plan
to run:

mmprotocoltrace config smb

The response to this command displays the current settings from the trace configuration file. For more
information about this file, see “Trace configuration file” on page 291.
2. Clear the trace records from the previous trace of the same type:

mmprotocoltrace clear smb

This command responds with an error message if the previous state of a trace node is something other
than DONE or FAILED. If this error occurs, follow the instructions in the “Resetting the trace system”
on page 292 section.
3. Start the new trace:

mmprotocoltrace start smb -c <clientIP>

The following response is typical:

Setting up traces
Trace '5d3f0138-9655-4970-b757-52355ce146ef' created successfully for 'smb'
Waiting for all participating nodes...
Trace ID: 5d3f0138-9655-4970-b757-52355ce146ef
State: ACTIVE
Protocol: smb
Start Time: 09:22:46 24/04/20
End Time: 09:32:46 24/04/20
Trace Location: /tmp/mmfs/smb.20200424_092246.trc
Origin Node: ces5050-41.localnet.com
Client IPs: 10.0.100.42, 10.0.100.43
Syscall: False
Syscall Only: False
Nodes:
Node Name: ces5050-41.localnet.com
State: ACTIVE
Node Name: ces5050-42.localnet.com
State: ACTIVE
Node Name: ces5050-43.localnet.com
State: ACTIVE

To display more status information, add the -v (verbose) option:

mmprotocoltrace -v status smb

If the status of a node is FAILED, the node did not start successfully. Look at the logs for the node
to determine the problem. After you fix the problem, reset the trace system by following the steps in
“Resetting the trace system” on page 292.
4. If all the nodes started successfully, perform the actions that you want to trace. For example, if you are
tracing a client IP address, enter commands that create traffic on that client.
5. Stop the trace:

mmprotocoltrace stop smb

Chapter 18. Collecting details of the issues 289


The following response is typical. The last line gives the location of the trace log file:

Stopping traces
Trace '01239483-be84-wev9-a2d390i9ow02' stopped for smb
Waiting for traces to complete
Waiting for node 'node1'
Waiting for node 'node2'
Finishing trace '01239483-be84-wev9-a2d390i9ow02'
Trace tar file has been written to '/tmp/mmfs/smb.20150513_162322.trc/
smb.trace.20150513_162542.tar.gz'

If you do not stop the trace, it continues until the trace duration expires. For more information, see
“Trace timeout” on page 290.
6. Look in the trace log files for the results of the trace. For more information, see “Trace log files” on
page 290.

Trace timeout
If you do not stop a trace manually, the trace runs until its trace duration expires. The default trace
duration is 10 minutes, but you can set a different value in the mmprotocoltrace command.
Each node that participates in a trace starts a timeout process that is set to the trace duration. When a
timeout occurs, the process checks the trace status. If the trace is active, the process stops the trace,
writes the file location to the log file, and exits. If the trace is not active, the timeout process exits.
If a trace stops because of a timeout, you can use the mmprotocoltrace status command to find the
location of the trace log file. The command gives an output similar to the following:

[root@ces5040-41 ~]# mmprotocoltrace status


Trace ID: f5d75a67-621e-4f09-8d00-3f9efc4093f2
State: DONE
Protocol: smb
Start Time: 15:24:30 25/09/19
End Time: 15:34:30 25/09/19
Trace Location: /tmp/mmfs/smb.20190925_152430.trc
Origin Node: ces5040-41.localnet.com
Client IPs: 10.0.100.42, 10.0.100.43
Trace results file: ces5040-41.localnet.com:/tmp/mmfs/smb.trace.20190925_152552.tar.gz
Syscall: False
Syscall Only: False
Nodes:
Node Name: ces5040-41.localnet.com
State: DONE

Node Name: ces5040-42.localnet.com


State: DONE

Node Name: ces5040-43.localnet.com


State: DONE

Trace log files


Trace log files are compressed files in the /tmp/mmfs directory. The contents of a trace log file depends
on the type of trace.
The product supports four types of tracing: SMB, Network, Object, and Winbind.
SMB
SMB tracing captures System Message Block information. The resulting trace log file contains an
smbd.log file for each node for which information is collected and for each client that is connected to
this node. A trace captures information for all clients with the specified IP address.
Network
Network tracing uses tcpdump utility to capture network packets. The resulting trace log file contains
a pcappng file that is readable by Wireshark and other programs. The file name is similar to
bfn22-10g_all_00001_20150907125015.pcap.
If the mmprotocoltrace command specifies a client IP address, the trace captures traffic between
that client and the server. If no IP address is specified, the trace captures traffic across all network
interfaces of each participating node.

290 IBM Storage Scale 5.1.9: Problem Determination Guide


Object
The trace log file contains log files for each node, one for each of the object services.
Object tracing sets the log location in the rsyslog configuration file. For more information about this
file, see the description of the rsyslogconflocation configuration parameter in “Trace configuration file”
on page 291.
It is not possible to configure an Object trace by clients so that information for all connections is
recorded.
Winbind
Winbind tracing collects detailed logging information (level 10) for the winbind component when it is
used for protocol authentication.

Trace configuration file


Each node in the cluster has its own trace configuration file, which is stored in the /var/mmfs/ces
directory.
The configuration file contains settings for logging and for each type of tracing:
[logging]
filename
The name of the log file.
level
The current logging level, which can be debug, info, warning, error, or critical.
[smb]
defaultloglocation
The default log location that is used by the reset command or when current information is not
retrievable.
defaultloglevel
The default log level that is used by the reset command or when current information is not
retrievable.
traceloglevel
The log level for tracing.
maxlogsize
The maximum size of the log file in kilobytes.
esttracesize
The estimated trace size in kilobytes.
[network]
numoflogfiles
The maximum number of log files.
logfilesize
The maximum size of the log file in kilobytes.
esttracesize
The estimated trace size in kilobytes.
[object]
defaultloglocation
The default log location that is used by the reset command or when current information is not
retrievable.
defaultloglevel
The default log level that is used by the reset command or when current information is not
retrievable.
traceloglevel
The log level for tracing.

Chapter 18. Collecting details of the issues 291


rsyslogconflocation
The location of the rsyslog configuration file. Rsyslog is a service that is provided by Red Hat, Inc.
that redirects log output. The default location is /etc/rsyslog.d/00-swift.conf.
esttracesize
The estimated trace size in kilobytes.
[winbind]
defaultlogfiles
The location of the winbind log files. The default location is /var/adm/ras/log.w*.
defaultloglevel
The default log level that is used by the reset command or when current information is not
retrievable. The value of defaultloglevel is set to 1.
traceloglevel
The log level for tracing. The value for traceloglevel is set to 10.
esttracesize
The estimated trace size in kilobytes. The value of esttracesize is set to 500000.
[syscalls]
args
The CLI arguments, which are used while executing the strace_executable. By default: -T -tt
-C.

Resetting the trace system


Before you run a new trace, verify that the trace system is reset for the type of trace that you want to run:
SMB, Network, or Object.
The mmprotocoltrace command allows only one simultaneous trace per component to avoid
unexpected side effects. Before starting a new trace for a component, you need to stop the previous
trace if it is still running. After the trace is stopped, its results are saved for later analysis. The trace
results can be analyzed by using the mmprotocoltrace status command. The trace results can
be removed explicitly by using the mmprotocoltrace clear command. If the trace results are not
removed explicitly, the system displays the following prompt:

[root@ces5050-41 MC_ces5050-41]# mmprotocoltrace start smb -c 1.1.1.1


For the following protocols traces are still running: smb
Starting new traces requires that all previous traces for the corresponding protocols are
cleared.
Do you want to clear these traces? (yes/no - default no):

You cannot proceed with the new trace, unless you have removed the old trace results.
If a trace cannot be stopped and cleared as described, you must perform the following recovery
procedure:
1. Run the mmprotocoltrace clear command in the force mode to clear the trace as shown:

mmprotocoltrace --force clear smb

Note: After a forced clear, the trace system might still be in an invalid state.
2. Run the mmprotocoltrace reset command as shown:

mmprotocoltrace reset smb

292 IBM Storage Scale 5.1.9: Problem Determination Guide


Using advanced options
The reset command restores the trace system to the default values that are set in the trace
configuration file.
The status command with the -v (verbose) option provides more trace information, including the values
of trace variables. The following command returns verbose trace information for the SMB trace:

mmprotocoltrace -v status smb

mmprotocoltrace reset smb

The command also performs special actions for each type of trace:
• For an SMB trace, the reset removes any IP-specific configuration files and sets the log level and log
location to the default values.
• For a Network trace, the reset stops all dumpcap processes.
• For an Object trace, the reset sets the log level to the default value. It then sets the log location to the
default location in the rsyslog configuration file, and restarts the rsyslog service.
The following command resets the SMB trace:

mmprotocoltrace -v status smb

Tips for using mmprotocoltrace


Follow these tips for mmprotocoltrace.

Specifying nodes with the -N and -c parameters.


It is important to understand the difference between the -N and -c parameters of the
mmprotocoltrace command:
• The -N parameter specifies the CES nodes where you want tracing to be done. The default value is all
CES nodes.
• The -c parameter specifies the IP addresses of clients whose incoming connections are to be traced.
Where these clients are connected to the CES nodes that are specified in the -N parameter, those CES
nodes trace the connections with the clients.
For example, in the SMB trace started by the following command, the CES node 10.40.72.105 traces
incoming connections from clients 192.168.4.1, 192.168.4.26, and 192.168.4.22. The command
is all on one line:

mmprotocoltrace start smb -c 192.168.4.1,192.168.4.26,192.168.4.22 -N 10.40.72.105

Discovering client IP addresses for an smb trace


If you have only a few clients that you want to trace, you can list their IP addresses by running the
smbstatus system command on a CES node. This command lists the IP addresses of all smb clients that
are connected to the node.
However, if many clients are connected to the CES node, running the smbstatus command on the node
to discover client IP addresses might not be practical. The command sets a global lock on the node for the
entire duration of the command, which might be a long time if many clients are connected.
Instead, run the system command ip on each client that you are interested in and filter the results
according to the type of device that you are looking for. In the following example, the command is run on
client ch-41 and lists the IP address 10.0.100.41 for that client:

[root@ch-41 ~]# ip a | grep "inet "


inet 127.0.0.1/8 scope host lo

Chapter 18. Collecting details of the issues 293


inet 10.0.100.41/24 brd 10.255.255.255 scope global eth0

A client might have more than one IP address, as in the following example where the command ip is run
on client ch-44:

[root@ch-44 ~]# ip a | grep "inet "


inet 127.0.0.1/8 scope host lo
inet 10.0.100.44/24 brd 10.255.255.255 scope global eth0
inet 192.168.4.1/16 brd 192.168.255.255 scope global eth1
inet 192.168.4.26/16 brd 192.168.255.255 scope global secondary eth1:0
inet 192.168.4.22/16 brd 192.168.255.255 scope global secondary eth1:1

In such a case, specify all the possible IP addresses in the mmprotocoltrace command because you
cannot be sure which IP address the client uses. The following example specifies all the IP addresses that
the previous example listed for client ch-44, and by default all CES nodes trace incoming connections
from any of these IP addresses:

mmprotocoltrace start smb -c 10.0.100.44,192.168.4.1,192.168.4.26,192.168.4.22

Collecting diagnostic data through GUI


IBM Support might ask you to collect logs, trace files, and dump files from the system to help them
resolve a problem. You can perform this task from the management GUI or by using the gpfs.snap
command. Use the Support > Diagnostic Data page in the IBM Storage Scale GUI to collect details of the
issues reported in the system.
The entire set of diagnostic data available in the system helps to analyze all kinds of IBM Storage Scale
issues. Depending on the data selection criteria, these files can be large (gigabytes) and might take an
hour to download. The diagnostic data is collected from each individual node in a cluster. In a cluster with
hundreds of nodes, downloading the diagnostic data might take a long time and the downloaded file can
be large in size.
It is always better to reduce the size of the log file as you might need to send it to IBM Support to help fix
the issues. You can reduce the size of the diagnostic data file by reducing the scope. The following options
are available to reduce the scope of the diagnostic data:
• Include only affected functional areas
• Include only affected nodes
• Reduce the number of days for which the diagnostic data needs to be collected
The following three modes are available in the GUI to select the functional areas of the diagnostic data:
1. Standard diagnostics
The data that is collected in the standard diagnostics consists of the configuration, status, log files,
dumps, and traces in the following functional areas:
• Core IBM Storage Scale
• Network
• GUI
• NFS
• SMB
• Object
• Authentication
• Cluster export services (CES)
• Crash dumps
You can download the diagnostic data for the above functional areas at the following levels:

294 IBM Storage Scale 5.1.9: Problem Determination Guide


• All nodes
• Specific nodes
• All nodes within one or more node classes
2. Deadlock diagnostics
The data that is collected in this category consists of the minimum amount of data that is needed to
investigate a deadlock problem.
3. Performance diagnostics
The data that is collected in this category consists of the system performance details collected from
performance monitoring tools. You can only use this option if it is requested by the IBM Support.
The GUI log files contain the issues that are related to GUI and it is smaller in size as well. The GUI log
consists of the following types of information:
• Traces from the GUI that contain the information about errors occurred inside GUI code
• Several configuration files of GUI and postgreSQL
• Dump of postgreSQL database that contains IBM Storage Scale configuration data and events
• Output of most mmls* commands
• Logs from the performance collector
Note: Instead of collecting the diagnostic data again, you can also utilize the diagnostic data that Is
collected in the past. You can analyze the relevance of the historic data based on the date on which the
issue is reported in the system. Ensure to delete the diagnostic data that is no longer needed to save disk
space.

Sharing the diagnostic data with the IBM Support using call home
The call home shares support information and your contact information with IBM on a scheduled basis.
The IBM Support monitors the details that are shared by the call home and takes necessary action in case
of any issues or potential issues. Enabling call home reduces the response time for the IBM Support to
address the issues.
You can also manually upload the diagnostic data that is collected through the Support > Diagnostic Data
page in the GUI to share the diagnostic data to resolve a Problem Management Record (PMR). To upload
data manually, perform the following steps:
1. Go to Support > Diagnostic Data.
2. Collect diagnostic data based on the requirement. You can also use the previously collected data for
the upload.
3. Select the relevant data set from the Previously Collected Diagnostic Data section and then right-
click and select Upload to PMR.
4. Select the PMR to which the data must be uploaded and then click Upload.

CLI commands for collecting issue details


You can issue several CLI commands to collect details of the issues that you might encounter while using
IBM Storage Scale.

Using the gpfs.snap command


This topic describes the usage of gpfs.snap command in IBM Storage Scale.
Running the gpfs.snap command with no options is similar to running the gpfs.snap -a command.
It collects data from all nodes in the cluster. This invocation creates a file that is made up of multiple
gpfs.snap snapshots. The file that is created includes a master snapshot of the node from which the
gpfs.snap command was invoked, and non-master snapshots of each of other nodes in the cluster.

Chapter 18. Collecting details of the issues 295


If the node on which the gpfs.snap command is run is not a file system manager node, the gpfs.snap
creates a non-master snapshot on the file system manager nodes.
The difference between a master snapshot and a non-master snapshot is the data that is gathered. A
master snapshot gathers information from nodes in the cluster. A master snapshot contains all data that a
non-master snapshot has. There are two categories of data that is collected:
1. Data that is always gathered by the gpfs.snap command for master snapshots and non-master
snapshots:
• “Data gathered by gpfs.snap on all platforms” on page 296
• “Data gathered by gpfs.snap on AIX” on page 297
• “Data gathered by gpfs.snap on Linux” on page 298
• “Data gathered by gpfs.snap on Windows” on page 301
2. Data that is gathered by the gpfs.snap command in the case of only a master snapshot. For more
information, see “Data gathered by gpfs.snap for a controller snapshot” on page 302.
When the gpfs.snap command runs with no options, data is collected for each of the enabled protocols.
You can turn off the collection of all protocol data and specify the type of protocol information to be
collected using the --protocol option. For more information, see gpfs.snap command in IBM Storage
Scale: Command and Programming Reference Guide.
The following data is gathered by gpfs.snap on Linux for protocols:
• “Data gathered for SMB on Linux” on page 303
• “Data gathered for NFS on Linux” on page 304
• “Data gathered for Object on Linux” on page 304
• “Data gathered for CES on Linux” on page 306
• “Data gathered for authentication on Linux” on page 307
• “Data gathered for performance on Linux” on page 309
• “Data gathered by gpfs.snap for File audit logging and Watchfolder components” on page 308

Data gathered by gpfs.snap on all platforms


These items are always obtained by the gpfs.snap command when gathering data for an AIX, Linux, or
Windows node:
1. The output of these commands:
• arp, arp -a
• ethtool with -i, -g, -a, -k, -c, -S per networking interface
• df -k
• ifconfig interface
• ip neigh
• ip link
• ip maddr
• ip route show table all
• ipcs -a
• iptables -L -v
• ls -l /dev
• ls -l /user/lpp/mmfs/bin
• mmdevdiscover
• mmfsadm dump alloc hist

296 IBM Storage Scale 5.1.9: Problem Determination Guide


• mmfsadm dump alloc stats
• mmfsadm dump allocmgr
• mmfsadm dump allocmgr hist
• mmfsadm dump allocmgr stats
• mmfsadm dump cfgmgr
• mmfsadm dump config
• mmfsadm dump dealloc stats
• mmfsadm dump disk
• mmfsadm dump fs
• mmfsadm dump malloc
• mmfsadm dump mmap
• mmfsadm dump mutex
• mmfsadm dump nsd
• mmfsadm dump rpc
• mmfsadm dump sgmgr
• mmfsadm dump stripe
• mmfsadm dump tscomm
• mmfsadm dump version
• mmfsadm dump waiters
• netstat with the -i, -r, -rn, -s, and -v options
• ps -edf
• tspreparedisk -S
• gdscheck -p
• vmstat
2. The contents of these files:
• /etc/cufile.json
• /etc/syslog.conf or /etc/syslog-ng.conf
• /proc/interrupts
• /proc/net/softnet_stat
• /proc/softirqs
• /tmp/mmfs/internal*
• /tmp/mmfs/trcrpt*
• /var/adm/ras/cufile.log
• /var/adm/ras/mmfs.log.*
• /var/mmfs/gen/*
• /var/mmfs/etc/*
• /var/mmfs/tmp/*
• /var/mmfs/ssl/* except for complete.map and id_rsa files

Data gathered by gpfs.snap on AIX


This topic describes the type of data that is always gathered by the gpfs.snap command on the AIX
platform.
These items are always obtained by the gpfs.snap command when gathering data for an AIX node:

Chapter 18. Collecting details of the issues 297


1. The output of these commands:
• errpt -a
• lssrc -a
• lslpp -hac
• no -a
2. The contents of these files:
• /etc/filesystems
• /etc/trcfmt

System Monitor data


The following data is collected to help analyze the monitored system data:
• The output of these commands is collected for each relevant node:
– mmhealth node eventlog
– mhealth node show

• The contents of these files:


– /var/adm/ras/mmsysmonitor.*.log*
– /var/adm/ras/top_data*
– /var/mmfs/tmp/mmhealth.log
– /var/mmfs/tmp/debugmmhealth.log
– /var/mmfs/mmsysmon/mmsysmonitor.conf

• The output of these commands is collected once for the cluster:


– tsctl shownodes up
– mmhealth cluster show
– mmccr flist

• The contents of the mmsysmon.json CCR file.

Data gathered by gpfs.snap on Linux


This topic describes the type of data that is always gathered by the gpfs.snap command on the Linux
platform.
Note: The gpfs.snap command does not collect installation toolkit logs. You can collect these logs by
using the installer.snap.py script that is located in the same directory as the installation toolkit.
For more information, see Logging and debugging for installation toolkit in IBM Storage Scale: Concepts,
Planning, and Installation Guide.
These items are always obtained by the gpfs.snap command when gathering data for a Linux node:
1. The output of these commands:
• dmesg
• fdisk -l
• lsmod
• lspci
• rpm -qa
• rpm --verify gpfs.base
• rpm --verify gpfs.docs

298 IBM Storage Scale 5.1.9: Problem Determination Guide


• rpm --verify gpfs.gpl
• rpm --verify gpfs.msg.en_US
2. The content of these files:
• /etc/filesystems
• /etc/fstab
• /etc/*release
• /proc/cpuinfo
• /proc/version
• /usr/lpp/mmfs/src/config/site.mcr
• /var/log/messages*
The following data is also collected on Linux on Z:
1. The output of the dbginfo.sh tool.
If s390-tools are installed, then the output of dbginfo.sh is captured.
2. The content of these files:
• /boot/config-$(active-kernel). For example, /boot/
config-3.10.0-123.6.3.el7.s390x

Performance monitoring data


The following data is collected to enable performance monitoring diagnosis:
1. The output of these commands:
• mmperfmon config show
• ps auxw | grep ZIMon
• service pmsensors status
• service pmcollector status
• mmhealth node show perfmon -v
• du -h /opt/IBM/zimon
• ls -laR /opt/IBM/zimon/data
• mmdiag --waiters --iohist --threads --stats --memory
• mmfsadm eventsExporter mmpmon chms
• mmfsadm dump nsd
• mmfsadm dump mb
2. The content of these files:
• /var/log/zimon/*
• /opt/IBM/zimon/*.cfg
3. The outputs of these commands are collected once for the cluster:
• mmperfmon query --list=keys --raw

Call home configuration data


The following data is collected to enable call home diagnosis:
1. The content of these files is collected for each relevant node:
• /var/mmfs/tmp/mmcallhome.log
• /var/mmfs/tmp/callhome/log/callhomeutils.log

Chapter 18. Collecting details of the issues 299


• /var/mmfs/callhome/*
2. The output of these commands is collected once for the cluster:
• mmcallhome capability list
• mmcallhome group list
• mmcallhome info list
• mmcallhome proxy list
• mmcallhome schedule list
• mmcallhome status list
3. The output of the mmcallhome test connection command is collected once for each relevant
node.

GUI data
The following data is collected to enable performance monitoring diagnosis:
• The output of these commands:
– pg_dump -U postgres -h 127.0.0.1 -n fscc postgres
– /usr/lpp/mmfs/gui/bin/get_version
– getent passwd scalemgmt
– getent group scalemgmt
– iptables -L -n
– iptables -L -n -t nat
– systemctl kill gpfsgui --signal=3 --kill-who=main # trigger a core dump
– systemctl status gpfsgui
– journalctl _SYSTEMD_UNIT=gpfsgui.service --no-pager -l
• The content of these files:
– /etc/sudoers
– /etc/sysconfig/gpfsgui
– /opt/ibm/wlp/usr/servers/gpfsgui/*.xml
– /var/lib/pgsql/data/*.conf
– /var/lib/pgsql/data/pg_log/*
– /var/lib/mmfs/gui/*
– /var/log/cnlog/*
– /var/crash/scalemgmt/javacore*
– /var/crash/scalemgmt/heapdump*
– /var/crash/scalemgmt/Snap*
– /usr/lpp/mmfs/gui/conf/*
• The output of these commands is collected once for the cluster:
– /usr/lpp/mmfs/lib/ftdc/mmlssnap.sh
• The content of these CCR files is collected once for the cluster:
– _gui.settings
– _gui.user.repo
– _gui.dashboards
– _gui.snapshots
– key-value pair: gui_master_node

300 IBM Storage Scale 5.1.9: Problem Determination Guide


System monitor data
The following data is collected to help analyze the monitored system data:
• The output of these commands is collected for each relevant node:
– mmhealth node eventlog
– mhealth node show

• The contents of these files:


– /var/adm/ras/mmsysmonitor.*.log*
– /var/adm/ras/top_data*
– /var/mmfs/tmp/mmhealth.log
– /var/mmfs/tmp/debugmmhealth.log
– /var/mmfs/mmsysmon/mmsysmonitor.conf

• The output of these commands is collected once for the cluster:


– tsctl shownodes up
– mmhealth cluster show
– mmccr flist

• The contents of the mmsysmon.json CCR file.

InfiniBand interface data


The output of the following commands is collected on Linux nodes with InfiniBand network interface in
case of extended network discovery:
• ibstat
• iblinkinfo
• ibdev2netdev
• ibnetdiscover
• ip a
• cat /proc/net/dev
• ls -l /sys/class/infiniband/
The output of the ibdiagnet command is collected once for the cluster. For more information, see the
“Data gathered by gpfs.snap for a controller snapshot” on page 302 section.

Data gathered by gpfs.snap on Windows


This topic describes the type of data that is always gathered by the gpfs.snap command on the
Windows platform.
These items are always obtained by the gpfs.snap command when gathering data for a Windows node:
1. The output from systeminfo.exe
2. Any raw trace files *.tmf and mmfs.trc*
3. The *.pdb symbols from /usr/lpp/mmfs/bin/symbols

Chapter 18. Collecting details of the issues 301


Data gathered by gpfs.snap for a controller snapshot
This topic describes the type of data that is always gathered by the gpfs.snap command for a master
snapshot.
When the gpfs.snap command is specified with no options, a master snapshot is taken on the node
where the command was issued. All of the information from “Data gathered by gpfs.snap on all platforms”
on page 296, “Data gathered by gpfs.snap on AIX” on page 297, “Data gathered by gpfs.snap on Linux”
on page 298, and “Data gathered by gpfs.snap on Windows” on page 301 is obtained, as well as this data:
The output of these commands:
• mmauth
• mmgetstate -a
• mmlscluster
• mmlsconfig
• mmlsdisk
• mmlsfileset
• mmlsfs
• mmlspolicy
• mmlsmgr
• mmlsnode -a
• mmlsnsd
• mmlssnapshot
• mmremotecluster
• mmremotefs
• tsstatus
The contents of the /var/adm/ras/mmfs.log.* file on all nodes in the cluster.

Performance monitoring data


The master snapshot, when taken on a Linux node, collects the following data:
1. The output of these commands:
• mmlscluster
• mmdiag --waiters --iohist --threads --stats --memory
• mmfsadm eventsExporter mmpmon chms
• mmfsadm dump nsd
• mmfsadm dump mb
Note: The InfiniBand fabric and performance monitoring data are collected only when the master node is
a Linux node.

InfiniBand interface data


The output of the ibdiagnet command is collected on Linux nodes with InfiniBand network interfaces in
case of extended network discovery. It discovers the InfiniBand structure or fabric, and might take several
minutes for execution.
The ibdiagnet command creates the following files:
• ibdiagnet2.debug
• ibdiagnet2.log
• ibdiagnet2.db_csv

302 IBM Storage Scale 5.1.9: Problem Determination Guide


• ibdiagnet2.lst
• ibdiagnet2.net_dump
• ibdiagnet2.sm
• ibdiagnet2.pm
• ibdiagnet2.nodes_info
• ibdiagnet2.pkey
• ibdiagnet2.aguid
• ibdiagnet2.fdbs
• ibdiagnet2.mcfdbs
• ibdmchk.sw_out_port_num_paths
• ibdmchk.sw_out_port_num_dlids

Data gathered by gpfs.snap on Linux for protocols


When the gpfs.snap command runs with no options, data is collected for each of the enabled protocols.
You can turn off the collection of all protocol data and specify the type of protocol information to be
collected using the --protocol option.

Data gathered for SMB on Linux


The following data is always obtained by the gpfs.snap command for the server message block (SMB).
1. The output of these commands:
• ctdb status
• ctdb scriptstatus
• ctdb ip
• ctdb statistics
• ctdb uptime
• wbinfo -P
• rpm -q gpfs.smb (or dpkg-query on Ubuntu)
• rpm -q samba (or dpkg-query on Ubuntu)
• rpm --verify gpfs.smb (or dpkg-query on Ubuntu)
• net conf list
• net idmap get ranges
• net idmap dump
• sharesec --view-all
• - ps -ef
• - ls -lR /var/lib/samba
• mmlsperfdata smb2Throughput -n 1440 -b 60
• mmlsperfdata smb2IOLatency -n 1440 -b 60
• mmlsperfdata smbConnections -n 1440 -b 60
• ls -l /var/ctdb/CTDB_DBDIR
• ls -l /var/ctdb/persistent
• mmlsperfdata op_count -n 1440 -b 60
• mmlsperfdata op_time -n 1440 -b 60
2. The content of these files:
• /var/adm/ras/log.smbd*

Chapter 18. Collecting details of the issues 303


• /var/adm/ras/log.wb-*
• /var/var/ras/log.winbindd*
• /var/adm/ras/cores/smbd/* (Only files from the last 60 days)
• /var/adm/ras/cores/winbindd/* (Only files from the last 60 days.)
• /var/lib/samba/*.tdb
• /var/lib/samba/msg/*
• /etc/sysconfig/gpfs-ctdb/* (or /etc/default/ctdb on Ubuntu)
• /var/mmfs/ces/smb.conf
• /var/mmfs/ces/smb.ctdb.nodes
• /var/lib/ctdb/persistent/*.tdb* # except of secrets.tdb
• /etc/sysconfig/ctdb

Data gathered for NFS on Linux


The following data is always obtained by the gpfs.snap command for NFS.
1. The output of these commands:
• mmnfs export list
• mmnfs config list
• rpm -qi - for all installed ganesha packages (or dpkg-query on Ubuntu)
• systemctl status nfs-ganesha
• systemctl status rpcbind
• rpcinfo -p
• ps -ef | grep '^UID\|[r]pc'
2. The content of these files:
• /proc/$(pidof gpfs.ganesha.nfsd)/limits
• /var/mmfs/ces/nfs-config/*
• /var/log/ganesha*
• /var/tmp/abrt/* for all sub-directories, not older than 60 days
• /etc/sysconfig/ganesha
Files stored in the CCR:
• gpfs.ganesha.exports.conf
• gpfs.ganesha.main.conf
• gpfs.ganesha.nfsd.conf
• gpfs.ganesha.log.conf
• gpfs.ganesha.statdargs.conf

Data gathered for Object on Linux


The following data is obtained by the gpfs.snap command for Object protocol.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.

304 IBM Storage Scale 5.1.9: Problem Determination Guide


- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
1. The output of these commands is displayed:
• curl -i http://localhost:8080/info -X GET
• rpm -qi - for all installed openstack rpms (or dpkg-query on Ubuntu)
• ps aux | grep keystone
2. The content of these files is given:
• /var/log/swift/account-reaper.log*
• /var/log/swift/account-reaper.error*
• /var/log/swift/account-replicator.log*
• /var/log/swift/account-replicator.error*
• /var/log/swift/account-server.log*
• /var/log/swift/account-server.error*
• /var/log/swift/container-replicator.log*
• /var/log/swift/container-replicator.error*
• /var/log/swift/container-server.log*
• /var/log/swift/container-server.error*
• /var/log/swift/container-updater.log*
• /var/log/swift/container-updater.error*
• /var/log/swift/ibmobjectizer.log*
• /var/log/swift/object-expirer.log*
• /var/log/swift/object-expirer.error*
• /var/log/swift/object-replicator.log*
• /var/log/swift/object-replicator.error*
• /var/log/swift/object-server.log*
• /var/log/swift/object-server.error*
• /var/log/swift/object-updater.log*
• /var/log/swift/object-updater.error*
• /var/log/swift/policyscheduler.log*
• /var/log/swift/proxy-server.log*
• /var/log/swift/proxy-server.error*
• /var/log/swift/swift.log*
• /var/log/swift/swift.error*
• /var/log/keystone/keystone.log*
• /var/log/keystone/httpd-error.log*
• /var/log/keystone/httpd-access.log*
• /var/log/secure/*
• /var/log/httpd/access_log*
• /var/log/httpd/error_log*
• /var/log/httpd/ssl_access_log*

Chapter 18. Collecting details of the issues 305


• /var/log/httpd/ssl_error_log*
• /var/log/httpd/ssl_request_log*
• /var/log/messages
• /etc/httpd/conf/httpd.conf
• /etc/httpd/conf.d/ssl.conf
• /etc/keystone/keystone-paste.ini
• /etc/keystone/logging.conf
• /etc/keystone/policy.json
• /etc/keystone/ssl/certs/*
Any files that are stored in the directory specified in the spectrum-scale-objectizer.conf CCR
file in the objectization_tmp_dir parameter are displayed.
The following files are collected under /var/mmfs/tmp/object.snap while stripping any sensitive
information:
• /etc/swift/proxy-server.conf
• /etc/swift/swift.conf
• /etc/keystone/keystone.conf
Files stored in the CCR:
• account-server.conf
• account.builder
• account.ring.gz
• container-server.conf
• container.builder
• container.ring.gz
• object-server.conf
• object*.builder
• object*.ring.gz
• container-reconciler.conf
• spectrum-scale-compression-scheduler.conf
• spectrum-scale-object-policies.conf
• spectrum-scale-objectizer.conf
• spectrum-scale-object.conf
• object-server-sof.conf
• object-expirer.conf
• keystone-paste.ini
• policy*.json
• sso/certs/ldap_cacert.pem
• spectrum-scale-compression-status.stat
• wsgi-keystone.conf

Data gathered for CES on Linux


The following data is always obtained by the gpfs.snap command for any enabled protocols.
The following data is collected by the gpfs.snap command from by default if any protocols are enabled:
• Information collected for each relevant node:

306 IBM Storage Scale 5.1.9: Problem Determination Guide


1. The output of these commands:
– mmces service list -Y
– mmces service list --verbose -Y
– mmces state show -Y
– mmces events active -Y
– mmhealth node eventlog -Y
2. The content of these files:
– /var/adm/ras/mmcesdr.log*
– /var/adm/ras/mmprotocoltrace.log*
• Information collected once for the cluster:
1. The output of these commands:
– mmces node list
– mmces address list
– ls -l <cesSharedRoot>/ces/addrs/*
– mmces service list -a
– mmlscluster --ces
2. The content of the following file:
<cesSharedRoot>/ces/connections/*

Data gathered for authentication on Linux


The following data is always obtained by the gpfs.snap command for any enabled protocol.
1. The output of these commands:
• mmcesuserauthlsservice
• mmcesuserauthckservice --data-access-method all --nodes cesNodes
• mmcesuserauthckservice --data-access-method all --nodes cesNodes --server-
reachability
• systemctl status ypbind
• systemctl status sssd
• lsof -i
• sestatus
• systemctl status firewalld
• systemstl status iptables
• net ads info
2. The content of these files:
• /etc/nsswitch.conf
• /etc/ypbind.conf
• /etc/idmapd.conf
• /etc/krb5.conf
• /etc/firewalld/*
• /etc/sssd/sssd.conf (with password removed)
• /var/log/sssd/*
• /var/log/secure/*
• /var/mmfs/etc/krb5_scale.keytab

Chapter 18. Collecting details of the issues 307


Files stored in the CCR:
• NSSWITCH_CONF
• YP_CONF
• LDAP_TLS_CACERT
• authccr

Data gathered by gpfs.snap for File audit logging and Watchfolder components
These items are always obtained by the gpfs.snap command when data is gathered for File audit
logging and Watchfolder components:
1. The output of these commands:
• rpm -qi
For gpfs.librdkafka ganesha packages or dpkg-query on Ubuntu
• mmdiag --eventproducer -Y
• mmwatch all list -Y
• tslspolicy <dev> -L --ptn
2. The contents of these files:
• /var/adm/ras/mmmsgqueue.log
• /var/adm/ras/mmaudit.log
• /var/adm/ras/mmwf.log
• /var/adm/ras/mmwatch.log
• /var/adm/ras/tswatchmonitor.log
• /var/adm/ras/mmwfclient.log
• Watchfolder configuration file (/<Device>/.msgq/.audit/.config)
• File audit logging configuration file (/<Device>/.msgq/<watchID>/.config)

Files stored in CCR


The following file is stored in CCR:
• spectrum-scale-file-audit.conf

Data gathered for Hadoop on Linux


The following data is gathered when running gpfs.snap with the --hadoop core argument:
1. The output of these commands:
• ps -elf
• netstat --nap
2. The content of these files:
• /var/log/hadoop
• /var/log/flume
• /var/log/hadoop-hdfs
• /var/log/hadoop-httpfs
• /var/log/hadoop-mapreduce
• /var/log/hadoop-yarn
• /var/log/hbase
• /var/log/hive

308 IBM Storage Scale 5.1.9: Problem Determination Guide


• /var/log/hive-hcatalog
• /var/log/kafka
• /var/log/knox
• /var/log/oozie
• /var/log/ranger
• /var/log/solr
• /var/log/spark
• /var/log/sqoop
• /var/log/zookeeper
• /var/mmfs/hadoop/etc/hadoop
• /var/log/hadoop/root
Note: From IBM Storage Scale 5.0.5, gpfs.snap --hadoop is able to capture the HDFS
Transparency logs from the user configured directories.

Limitations of customizations when using sudo wrapper


If the sudo wrapper is in use, persistent environment variables, saved in the $HOME/.bashrc in /
root/.bashrc, $HOME/.kshrc, /root/.kshrc and similar paths are not initialized when the current
non-root gpfsadmin user elevates his rights with the sudo command. Thus, gpfs.snap is not able to detect
any customization options for the Hadoop data collection. Starting from IBM Storage Scale 5.0, if you
want to apply your customization to the Hadoop debugging data with an active sudo wrapper, then you
must create symlinks from the actual log files to the corresponding locations mentioned in the list of
collected log files.

Data gathered for core dumps on Linux


The following data is gathered when running gpfs.snap with the --protocol core argument:
• If core_pattern is set to dump to a file, then it gathers files matching that pattern.
• If core_pattern is set to redirect to abrt, then everything is gathered from the directory specified in the
abrt.conf file under DumpLocation. If this is not set, then /var/tmp/abrt is used.
• Other core dump mechanisms are not supported by the script.
• Any files in the directory /var/adm/ras/cores/ are also gathered.

Data gathered for performance on Linux


The following data is obtained by the gpfs.snap command, if the option --performance is provided.
• The output of the command top -n 1 -b on all nodes.
• The performance metrics data gathered on the node having the ACTIVE THRESHOLD MONITOR role.
The output contains the metrics data of the last 24 hours, collected from all the metrics sensors
enabled on the cluster.
Note: If the frequency of a sensor is greater than 60 seconds, then the data gathering is performed
with the original frequency of the metrics group. If the frequency of a sensor is less than or equal to 60
seconds, the frequency is set to 60 seconds.
• The performance data package includes the performance monitoring report from the top metric for last
24 hours as well.
For gathering performance data, it is recommended to execute the gpfs.snap command with the -a
option on all nodes, or at least on the pmcollector node having the ACTIVE THRESHOLD MONITOR role.
For more information about the ACTIVE THRESHOLD MONITOR role, see “Active threshold monitor role”
on page 24.

Chapter 18. Collecting details of the issues 309


mmdumpperfdata command
Collects and archives the performance metric information.

Synopsis
mmdumpperfdata [--remove-tree] [StartTime EndTime | Duration]

Availability
Available on all IBM Storage Scale editions.

Description
The mmdumpperfdata command runs all named queries and computed metrics used in the mmperfmon
query command for each cluster node, writes the output into CSV files, and archives all the files in a
single .tgz file. The file name is in the iss_perfdump_YYYYMMDD_hhmmss.tgz format.
The tar archive file contains a folder for each cluster node and within that folder there is a text file with the
output of each named query and computed metric.
If the start and end time, or duration are not given, then by default the last four hours of metrics
information is collected and archived.

Parameters
--remove-tree or -r
Removes the folder structure that was created for the TAR archive file.
StartTime
Specifies the start timestamp for query in the YYYY-MM-DD[-hh:mm:ss] format.
EndTime
Specifies the end timestamp for query in the YYYY-MM-DD[-hh:mm:ss] format.
Duration
Specifies the duration in seconds.

Exit status
0
Successful completion.
nonzero
A failure has occurred.

Security
You must have root authority to run the mmdumpperfdata command.
The node on which the command is issued must be able to execute remote shell commands on any other
node in the cluster without the use of a password and without producing any extraneous messages.
For more information, see Requirements for administering a GPFS file system in IBM Storage Scale:
Administration Guide.

Examples
1. To archive the performance metric information collected for the default time period of last four hours
and also delete the folder structure that the command creates, issue this command:

mmdumpperfdata --remove-tree

310 IBM Storage Scale 5.1.9: Problem Determination Guide


The system displays output similar to this:

Using the following options:


tstart :
tend :
duration: 14400
rem tree: True
Target folder: ./iss_perfdump_20150513_142420
[1/120] Dumping data for node=fscc-hs21-22 and query q=swiftAccThroughput
file: ./iss_perfdump_20150513_142420/fscc-hs21-22/swiftAccThroughput
[2/120] Dumping data for node=fscc-hs21-22 and query q=NetDetails
file: ./iss_perfdump_20150513_142420/fscc-hs21-22/NetDetails
[3/120] Dumping data for node=fscc-hs21-22 and query q=ctdbCallLatency
file: ./iss_perfdump_20150513_142420/fscc-hs21-22/ctdbCallLatency
[4/120] Dumping data for node=fscc-hs21-22 and query q=usage
file: ./iss_perfdump_20150513_142420/fscc-hs21-22/usage

2. To archive the performance metric information collected for a specific time period, issue this
command:

mmdumpperfdata --remove-tree 2015-01-25-04:04:04 2015-01-26-04:04:04

The system displays output similar to this:

Using the following options:


tstart : 2015-01-25 04:04:04
tend : 2015-01-26 04:04:04
duration:
rem tree: True
Target folder: ./iss_perfdump_20150513_144344
[1/120] Dumping data for node=fscc-hs21-22 and query q=swiftAccThroughput
file: ./iss_perfdump_20150513_144344/fscc-hs21-22/swiftAccThroughput
[2/120] Dumping data for node=fscc-hs21-22 and query q=NetDetails
file: ./iss_perfdump_20150513_144344/fscc-hs21-22/NetDetails

3. To archive the performance metric information collected in the last 200 seconds, issue this command:

mmdumpperfdata --remove-tree 200

The system displays output similar to this:

Using the following options:


tstart :
tend :
duration: 200
rem tree: True
Target folder: ./iss_perfdump_20150513_144426
[1/120] Dumping data for node=fscc-hs21-22 and query q=swiftAccThroughput
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/swiftAccThroughput
[2/120] Dumping data for node=fscc-hs21-22 and query q=NetDetails
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/NetDetails
[3/120] Dumping data for node=fscc-hs21-22 and query q=ctdbCallLatency
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/ctdbCallLatency
[4/120] Dumping data for node=fscc-hs21-22 and query q=usage
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/usage
[5/120] Dumping data for node=fscc-hs21-22 and query q=smb2IORate
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/smb2IORate
[6/120] Dumping data for node=fscc-hs21-22 and query q=swiftConLatency
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/swiftConLatency
[7/120] Dumping data for node=fscc-hs21-22 and query q=swiftCon
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/swiftCon
[8/120] Dumping data for node=fscc-hs21-22 and query q=gpfsNSDWaits
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/gpfsNSDWaits
[9/120] Dumping data for node=fscc-hs21-22 and query q=smb2Throughput
file: ./iss_perfdump_20150513_144426/fscc-hs21-22/smb2Throughput

See also
For more information, see mmperfmon command in the IBM Storage Scale: Command and Programming
Reference Guide.

Chapter 18. Collecting details of the issues 311


Location
/usr/lpp/mmfs/bin

mmfsadm command
The mmfsadm command is intended for use by trained service personnel. IBM suggests you do not run
this command except under the direction of such personnel.
Note: The contents of the mmfsadm command output might vary from release to release, which could
obsolete any user programs that depend on that output. Therefore, we suggest that you do not create
user programs that invoke the mmfsadm command.
The mmfsadm command extracts data from GPFS without using locking, so that it can collect the data in
the event of locking errors. In certain rare cases, this can cause GPFS or the node to fail. Several options
of this command exist and might be required for use:
cleanup
Delete shared segments left by a previously failed GPFS daemon without actually restarting the
daemon.
dump what
Dumps the state of a large number of internal state values that might be useful in determining the
sequence of events. The what parameter can be set to all, indicating that all available data should be
collected, or to another value, indicating more restricted collection of data. The output is presented to
STDOUT and should be collected by redirecting STDOUT. For more information about internal GPFS™
states, see the mmdiag command in IBM Storage Scale: Command and Programming Reference Guide.
showtrace
Shows the current level for each subclass of tracing available in GPFS. Trace level 14 provides the
highest level of tracing for the class and trace level 0 provides no tracing. Intermediate values exist for
most classes. More tracing requires more storage and results in a higher probability of overlaying the
required event.
trace class n
Sets the trace class to the value specified by n. Actual trace gathering only occurs when the
mmtracectl command has been issued.

Other options provide interactive GPFS debugging, but are not described here. Output from the mmfsadm
command is required in almost all cases where a GPFS problem is being reported. The mmfsadm
command collects data only on the node where it is issued. Depending on the nature of the problem,
the mmfsadm command output might be required from several or all nodes. The mmfsadm command
output from the file system manager is often required.
To determine where the file system manager is, issue the mmlsmgr command:

mmlsmgr

Output similar to this example is displayed:

file system manager node


---------------- ------------------
fs3 9.114.94.65 (c154n01)
fs2 9.114.94.73 (c154n09)
fs1 9.114.94.81 (c155n01)

Cluster manager node: 9.114.94.65 (c154n01)

Commands for GPFS cluster state information


There are a number of GPFS commands used to obtain cluster state information.
The information is organized as follows:
• “The mmafmctl Device getstate command” on page 313
• “The mmdiag command” on page 313

312 IBM Storage Scale 5.1.9: Problem Determination Guide


• “The mmgetstate command” on page 313
• “The mmlscluster command” on page 314
• “The mmlsconfig command” on page 315
• “The mmrefresh command” on page 315
• “The mmsdrrestore command” on page 316
• “The mmexpelnode command” on page 316

The mmafmctl Device getstate command


The mmafmctl Device getstate command displays the status of active file management cache filesets
and gateway nodes.
When this command displays a NeedsResync target/fileset state, inconsistencies between home and
cache are being fixed automatically; however, unmount and mount operations are required to return the
state to Active.
The mmafmctl Device getstate command is fully described in the Command reference section in the
IBM Storage Scale: Command and Programming Reference Guide.

The mmhealth command


The mmhealth command monitors and displays the health status of services hosted on nodes and the
health status of complete cluster in a single view.
Use the mmhealth command to monitor the health of the node and services hosted on the node in
IBM Storage Scale. If the status of a service hosted on any node is failed, the mmhealth command
allows the user to view the event log to analyze and determine the problem. The mmhealth command
provides list of events responsible for the failure of any service. On detailed analysis of these events
a set of troubleshooting steps might be followed to resume the failed service. For more details on
troubleshooting, see “How to get started with troubleshooting” on page 245.
The mmhealth command is fully described in the mmhealth command section in the IBM Storage Scale:
Command and Programming Reference Guide and Chapter 2, “Monitoring system health by using the
mmhealth command,” on page 13.

The mmdiag command


The mmdiag command displays diagnostic information about the internal GPFS state on the current node.
Use the mmdiag command to query various aspects of the GPFS internal state for troubleshooting and
tuning purposes. The mmdiag command displays information about the state of GPFS on the node where
it is executed. The command obtains the required information by querying the GPFS daemon process
(mmfsd), and thus, functions only when the GPFS daemon is running.
The mmdiag command is fully described in the Command reference section in IBM Storage Scale:
Command and Programming Reference Guide.

The mmgetstate command


The mmgetstate command displays the state of the GPFS daemon on one or more nodes.
These flags are of interest for problem determination:
-a
List all nodes in the GPFS cluster. The option does not display information for nodes that cannot be
reached. You may obtain more information if you specify the -v option.
-L
Additionally display quorum, number of nodes up, and total number of nodes.

Chapter 18. Collecting details of the issues 313


The total number of nodes may sometimes be larger than the actual number of nodes in the cluster.
This is the case when nodes from other clusters have established connections for the purposes of
mounting a file system that belongs to your cluster.
-s
Display summary information: number of local and remote nodes that have joined in the cluster,
number of quorum nodes, and so forth.
-v
Display intermediate error messages.
The remaining flags have the same meaning as in the mmshutdown command. They can be used to
specify the nodes on which to get the state of the GPFS daemon.
The GPFS states recognized and displayed by this command are:
active
GPFS is ready for operations.
arbitrating
A node is trying to form quorum with the other available nodes.
down
GPFS daemon is not running on the node or is recovering from an internal error.
unknown
Unknown value. Node cannot be reached or some other error occurred.
For example, to display the quorum, the number of nodes up, and the total number of nodes, issue:

mmgetstate -L -a

The system displays output similar to:

Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
--------------------------------------------------------------------
2 k154n06 1* 3 7 active quorum node
3 k155n05 1* 3 7 active quorum node
4 k155n06 1* 3 7 active quorum node
5 k155n07 1* 3 7 active
6 k155n08 1* 3 7 active
9 k156lnx02 1* 3 7 active
11 k155n09 1* 3 7 active

where *, if present, indicates that tiebreaker disks are being used.


The mmgetstate command is fully described in the Command reference section in the IBM Storage Scale:
Command and Programming Reference Guide.

The mmlscluster command


The mmlscluster command displays GPFS cluster configuration information.
The syntax of the mmlscluster command is:

mmlscluster

The system displays output similar to:

GPFS cluster information


========================
GPFS cluster name: cluster1.kgn.ibm.com
GPFS cluster id: 680681562214606028
GPFS UID domain: cluster1.kgn.ibm.com
Remote shell command: /usr/bin/rsh
Remote file copy command: /usr/bin/rcp
Repository type: CCR

GPFS cluster configuration servers:


-----------------------------------
Primary server: k164n06.kgn.ibm.com
Secondary server: k164n05.kgn.ibm.com

314 IBM Storage Scale 5.1.9: Problem Determination Guide


Node Daemon node name IP address Admin node name Designation
----------------------------------------------------------------------------------
1 k164n04.kgn.ibm.com 198.117.68.68 k164n04.kgn.ibm.com quorum
2 k164n05.kgn.ibm.com 198.117.68.71 k164n05.kgn.ibm.com quorum
3 k164n06.kgn.ibm.com 198.117.68.70 k164n06.kgn.ibm.com quorum-manager

The mmlscluster command is fully described in the Command reference section in the IBM Storage
Scale: Command and Programming Reference Guide.

The mmlsconfig command


The mmlsconfig command displays current configuration data for a GPFS cluster.
Depending on your configuration, additional information not documented in either the mmcrcluster
command or the mmchconfig command may be displayed to assist in problem determination.
If a configuration parameter is not shown in the output of this command, the default value for that
parameter, as documented in the mmchconfig command, is in effect.
The syntax of the mmlsconfig command is:

mmlsconfig

The system displays information similar to:

Configuration data for cluster cl1.cluster:


---------------------------------------------
clusterName cl1.cluster
clusterId 680752107138921233
autoload no
minReleaseLevel 5.1.6.0
pagepool 1G
maxblocksize 16m
[c5n97g]
pagepool 2G
[common]
cipherList AES256-SHA256

File systems in cluster cl1.cluster:


--------------------------------------
/dev/fs2

The mmlsconfig command is fully described in the Command reference section in the IBM Storage Scale:
Command and Programming Reference Guide.

The mmrefresh command


The mmrefresh command is intended for use by experienced system administrators who know how to
collect data and run debugging routines.
Use the mmrefresh command only when you suspect that something is not working as expected and
the reason for the malfunction is a problem with the GPFS configuration data. For example, a mount
command fails with a device not found error, and you know that the file system exists. Another
example is if any of the files in the /var/mmfs/gen directory were accidentally erased. Under normal
circumstances, the GPFS command infrastructure maintains the cluster data files automatically and there
is no need for user intervention.
The mmrefresh command places the most recent GPFS cluster configuration data files on the specified
nodes. The syntax of this command is:

mmrefresh [-f] [ -a | -N {Node[,Node...] | NodeFile | NodeClass}]

The -f flag can be used to force the GPFS cluster configuration data files to be rebuilt whether they
appear to be at the most current level or not. If no other option is specified, the command affects only the

Chapter 18. Collecting details of the issues 315


node on which it is run. The remaining flags have the same meaning as in the mmshutdown command, and
are used to specify the nodes on which the refresh is to be performed.
For example, to place the GPFS cluster configuration data files at the latest level, on all nodes in the
cluster, issue:

mmrefresh -a

The mmsdrrestore command


The mmsdrrestore command is intended for use by experienced system administrators.
The mmsdrrestore command restores the latest GPFS system files on the specified nodes. If no nodes
are specified, the command restores the configuration information only on the node where it is invoked. If
the local GPFS configuration file is missing, the file specified with the -F option from the node specified
with the -p option is used instead.
This command works best when used in conjunction with the mmsdrbackup user exit, which is described
in the GPFS user exits topic in the IBM Storage Scale: Command and Programming Reference Guide.
For more information, see mmsdrrestore command in IBM Storage Scale: Command and Programming
Reference Guide.

The mmexpelnode command


The mmexpelnode command instructs the cluster manager to expel the target nodes and to run the
normal recovery protocol.
The cluster manager keeps a list of the expelled nodes. Expelled nodes are not allowed to rejoin the
cluster until they are removed from the list using the -r or --reset option on the mmexpelnode
command. The expelled nodes information are also reset if the cluster manager node goes down or is
changed with mmchmgr -c.
The syntax of the mmexpelnode command is:

mmexpelnode [-o | --once] [-f | --is-fenced] [-w | --wait] -N Node[,Node...]

Or,

mmexpelnode {-l | --list}

Or,

mmexpelnode {-r | --reset} -N {all | Node[,Node...]}

The flags used by this command are:


-o | --once
Specifies that the nodes should not be prevented from rejoining. After the recovery protocol
completes, expelled nodes are allowed to rejoin the cluster immediately, without the need to first
invoke the mmexpelnode --reset command.
-f | --is-fenced
Specifies that the nodes are fenced out and precluded from accessing any GPFS disks without first
rejoining the cluster (for example, the nodes were forced to reboot by turning off power). Using this
flag allows GPFS to start log recovery immediately, skipping the normal 35-second wait.
Warning: The -f option should only be used when the administrator is sure that the node
being expelled can no longer write to any of the disks. This includes both the locally attached
disks and the remote NSDs. The node in question must be down, and it must not have any
disk I/O pending on any of the devices. Incorrect use of the -f option could lead to file system
corruption.

316 IBM Storage Scale 5.1.9: Problem Determination Guide


-w | --wait
Instructs the mmexpelnode command to wait until GPFS recovery for the failed node has completed
before it runs.
In version 5.1.3 and later, and when the tscCmdAllowRemoteConnections configuration
parameter is set to no, if the command is issued by one of the nodes specified with the -N option, the
command may return before the GPFS recovery for the node is completed. To ensure the command
wait for GPFS recovery to complete, issue the command from a node which is not included in the list
provided to the -N option.
-l | --list
Lists all currently expelled nodes.
-r | --reset
Allows the specified nodes to rejoin the cluster (that is, resets the status of the nodes). To unexpel all
of the expelled nodes, issue: mmexpelnode -r -N all.
-N {all | Node[,Node...]}
Specifies a list of host names or IP addresses that represent the nodes to be expelled or unexpelled.
Specify the daemon interface host names or IP addresses as shown by the mmlscluster command.
The mmexpelnode command does not support administration node names or node classes.
Note: -N all can only be used to unexpel nodes.

Examples of the mmexpelnode command


1. To expel node c100c1rp3, issue the command:

mmexpelnode -N c100c1rp3

2. To show a list of expelled nodes, issue the command:

mmexpelnode --list

The system displays information similar to:

Node List
---------------------
192.168.100.35 (c100c1rp3.ppd.pok.ibm.com)

3. To allow node c100c1rp3 to rejoin the cluster, issue the command:

mmexpelnode -r -N c100c1rp3

GPFS file system and disk information commands


The problem determination tools provided with GPFS for file system, disk and NSD problem determination
are intended for use by experienced system administrators who know how to collect data and run
debugging routines.
The information is organized as follows:
• “Restricted mode mount” on page 318
• “Read-only mode mount” on page 318
• “The lsof command” on page 318
• “The mmlsmount command” on page 318
• “The mmapplypolicy -L command” on page 319
• “The mmcheckquota command” on page 325
• “The mmlsnsd command” on page 326
• “The mmwindisk command” on page 326
• “The mmfileid command” on page 327

Chapter 18. Collecting details of the issues 317


• “The SHA digest” on page 329

Restricted mode mount


GPFS provides a capability to mount a file system in a restricted mode when significant data structures
have been destroyed by disk failures or other error conditions.
Restricted mode mount is not intended for normal operation, but may allow the recovery of some user
data. Only data which is referenced by intact directories and metadata structures would be available.
Attention:
1. Follow the procedures in “Information to be collected before contacting the IBM Support
Center” on page 555, and then contact the IBM Support Center before using this capability.
2. Attempt this only after you have tried to repair the file system with the mmfsck command. (See
“Why does the offline mmfsck command fail with "Error creating internal storage"?” on page
250.)
3. Use this procedure only if the failing disk is attached to an AIX or Linux node.
Some disk failures can result in the loss of enough metadata to render the entire file system unable to
mount. In that event it might be possible to preserve some user data through a restricted mode mount.
This facility should only be used if a normal mount does not succeed, and should be considered a last
resort to save some data after a fatal disk failure.
Restricted mode mount is invoked by using the mmmount command with the -o rs flags. After a
restricted mode mount is done, some data may be sufficiently accessible to allow copying to another
file system. The success of this technique depends on the actual disk structures damaged.

Read-only mode mount


Some disk failures can result in the loss of enough metadata to make the entire file system unable to
mount. In that event, it might be possible to preserve some user data through a read-only mode mount.
Attention: Attempt this only after you have tried to repair the file system with the mmfsck
command.
This facility should be used only if a normal mount does not succeed, and should be considered a last
resort to save some data after a fatal disk failure.
Read-only mode mount is invoked by using the mmmount command with the -o ro flags. After a read-
only mode mount is done, some data may be sufficiently accessible to allow copying to another file
system. The success of this technique depends on the actual disk structures damaged.

The lsof command


The lsof (list open files) command returns the user processes that are actively using a file system. It is
sometimes helpful in determining why a file system remains in use and cannot be unmounted.
The lsof command is available in Linux distributions or by using anonymous ftp from
lsof.itap.purdue.edu (cd to /pub/tools/unix/lsof). The inventor of the lsof command is
Victor A. Abell ([email protected]), Purdue University Computing Center.

The mmlsmount command


The mmlsmount command lists the nodes that have a given GPFS file system mounted.
Use the -L option to see the node name and IP address of each node that has the file system in use. This
command can be used for all file systems, all remotely mounted file systems, or file systems mounted on
nodes of certain clusters.
While not specifically intended as a service aid, the mmlsmount command is useful in these situations:
1. When writing and debugging new file system administrative procedures, to determine which nodes
have a file system mounted and which do not.

318 IBM Storage Scale 5.1.9: Problem Determination Guide


2. When mounting a file system on multiple nodes, to determine which nodes have successfully
completed the mount and which have not.
3. When a file system is mounted, but appears to be inaccessible to some nodes but accessible to others,
to determine the extent of the problem.
4. When a normal (not force) unmount has not completed, to determine the affected nodes.
5. When a file system has force unmounted on some nodes but not others, to determine the affected
nodes.
For example, to list the nodes having all file systems mounted:

mmlsmount all -L

The system displays output similar to:

File system fs2 is mounted on 7 nodes:


192.168.3.53 c25m3n12 c34.cluster
192.168.110.73 c34f2n01 c34.cluster
192.168.110.74 c34f2n02 c34.cluster
192.168.148.77 c12c4apv7 c34.cluster
192.168.132.123 c20m2n03 c34.cluster (internal mount)
192.168.115.28 js21n92 c34.cluster (internal mount)
192.168.3.124 c3m3n14 c3.cluster

File system fs3 is not mounted.

File system fs3 (c3.cluster:fs3) is mounted on 7 nodes:


192.168.2.11 c2m3n01 c3.cluster
192.168.2.12 c2m3n02 c3.cluster
192.168.2.13 c2m3n03 c3.cluster
192.168.3.123 c3m3n13 c3.cluster
192.168.3.124 c3m3n14 c3.cluster
192.168.110.74 c34f2n02 c34.cluster
192.168.80.20 c21f1n10 c21.cluster

The mmlsmount command is fully described in the Command reference section in the IBM Storage Scale:
Command and Programming Reference Guide.

The mmapplypolicy -L command


Use the -L flag of the mmapplypolicy command when you are using policy files to manage storage
resources and the data stored on those resources. This command has different levels of diagnostics to
help debug and interpret the actions of a policy file.
The -L flag, used in conjunction with the -I test flag, allows you to display the actions that would be
performed by a policy file without actually applying it. This way, potential errors and misunderstandings
can be detected and corrected without actually making these mistakes.
These are the trace levels for the mmapplypolicy -L flag:
Value
Description
0
Displays only serious errors.
1
Displays some information as the command runs, but not for each file.
2
Displays each chosen file and the scheduled action.
3
Displays the information for each of the preceding trace levels, plus each candidate file and the
applicable rule.
4
Displays the information for each of the preceding trace levels, plus each explicitly excluded file, and
the applicable rule.

Chapter 18. Collecting details of the issues 319


5
Displays the information for each of the preceding trace levels, plus the attributes of candidate and
excluded files.
6
Displays the information for each of the preceding trace levels, plus files that are not candidate files,
and their attributes.
These terms are used:
candidate file
A file that matches a policy rule.
chosen file
A candidate file that has been scheduled for an action.
This policy file is used in the examples that follow:

/* Exclusion rule */
RULE 'exclude *.save files' EXCLUDE WHERE NAME LIKE '%.save'
/* Deletion rule */
RULE 'delete' DELETE FROM POOL 'sp1' WHERE NAME LIKE '%tmp%'
/* Migration rule */
RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WHERE NAME LIKE
'%file%'
/* Typo in rule : removed later */
RULE 'exclude 2' EXCULDE
/* List rule */
RULE EXTERNAL LIST 'tmpfiles' EXEC '/tmp/exec.list'
RULE 'all' LIST 'tmpfiles' where name like '%tmp%'

These are some of the files in file system /fs1:

. .. data1 file.tmp0 file.tmp1 file0 file1 file1.save file2.save

The mmapplypolicy command is fully described in the Command reference section in the IBM Storage
Scale: Command and Programming Reference Guide.

mmapplypolicy -L 0
Use this option to display only serious errors.
In this example, there is an error in the policy file. This command:

mmapplypolicy fs1 -P policyfile -I test -L 0

produces output similar to this:

[E:-1] Error while loading policy rules.


PCSQLERR: Unexpected SQL identifier token - 'EXCULDE'.
PCSQLCTX: at line 8 of 8: RULE 'exclude 2' {{{EXCULDE}}}
mmapplypolicy: Command failed. Examine previous error messages to determine cause.

The error in the policy file is corrected by removing these lines:

/* Typo in rule */
RULE 'exclude 2' EXCULDE

Now rerun the command:

mmapplypolicy fs1 -P policyfile -I test -L 0

No messages are produced because no serious errors were detected.

320 IBM Storage Scale 5.1.9: Problem Determination Guide


mmapplypolicy -L 1
Use this option to display all of the information (if any) from the previous level, plus some information
as the command runs, but not for each file. This option also displays total numbers for file migration and
deletion.
This command:

mmapplypolicy fs1 -P policyfile -I test -L 1

produces output similar to this:

[I] GPFS Current Data Pool Utilization in KB and %


sp1 5120 19531264 0.026214%
system 102400 19531264 0.524288%
[I] Loaded policy rules from policyfile.
Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP = 2009-03-04@02:40:12 UTC
parsed 0 Placement Rules, 0 Restore Rules, 3 Migrate/Delete/Exclude Rules,
1 List Rules, 1 External Pool/List Rules
/* Exclusion rule */
RULE 'exclude *.save files' EXCLUDE WHERE NAME LIKE '%.save'
/* Deletion rule */
RULE 'delete' DELETE FROM POOL 'sp1' WHERE NAME LIKE '%tmp%'
/* Migration rule */
RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WHERE NAME LIKE
'%file%'
/* List rule */
RULE EXTERNAL LIST 'tmpfiles' EXEC '/tmp/exec.list'
RULE 'all' LIST 'tmpfiles' where name like '%tmp%'
[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
[I] Inodes scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
[I] Summary of Rule Applicability and File Choices:
Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule
0 2 32 0 0 0 RULE 'exclude *.save files' EXCLUDE WHERE(.)
1 2 16 2 16 0 RULE 'delete' DELETE FROM POOL 'sp1' WHERE(.)
2 2 32 2 32 0 RULE 'migration to system pool' MIGRATE FROM
POOL \
'sp1' TO POOL 'system' WHERE(.)
3 2 16 2 16 0 RULE 'all' LIST 'tmpfiles' WHERE(.)

[I] Files with no applicable rules: 5.

[I] GPFS Policy Decisions and File Choice Totals:


Chose to migrate 32KB: 2 of 2 candidates;
Chose to premigrate 0KB: 0 candidates;
Already co-managed 0KB: 0 candidates;
Chose to delete 16KB: 2 of 2 candidates;
Chose to list 16KB: 2 of 2 candidates;
0KB of chosen data is illplaced or illreplicated;
Predicted Data Pool Utilization in KB and %:
sp1 5072 19531264 0.025969%
system 102432 19531264 0.524451%

mmapplypolicy -L 2
Use this option to display all of the information from the previous levels, plus each chosen file and the
scheduled migration or deletion action.
This command:

mmapplypolicy fs1 -P policyfile -I test -L 2

produces output similar to this:

[I] GPFS Current Data Pool Utilization in KB and %


sp1 5120 19531264 0.026214%
system 102400 19531264 0.524288%
[I] Loaded policy rules from policyfile.
Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP = 2009-03-04@02:43:10 UTC
parsed 0 Placement Rules, 0 Restore Rules, 3 Migrate/Delete/Exclude Rules,
1 List Rules, 1 External Pool/List Rules
/* Exclusion rule */
RULE 'exclude *.save files' EXCLUDE WHERE NAME LIKE '%.save'
/* Deletion rule */
RULE 'delete' DELETE FROM POOL 'sp1' WHERE NAME LIKE '%tmp%'
/* Migration rule */
RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WHERE NAME LIKE

Chapter 18. Collecting details of the issues 321


'%file%'
/* List rule */
RULE EXTERNAL LIST 'tmpfiles' EXEC '/tmp/exec.list'
RULE 'all' LIST 'tmpfiles' where name like '%tmp%'
[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
[I] Inodes scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
WEIGHT(INF) LIST 'tmpfiles' /fs1/file.tmp1 SHOW()
WEIGHT(INF) LIST 'tmpfiles' /fs1/file.tmp0 SHOW()
WEIGHT(INF) DELETE /fs1/file.tmp1 SHOW()
WEIGHT(INF) DELETE /fs1/file.tmp0 SHOW()
WEIGHT(INF) MIGRATE /fs1/file1 TO POOL system SHOW()
WEIGHT(INF) MIGRATE /fs1/file0 TO POOL system SHOW()
[I] Summary of Rule Applicability and File Choices:
Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule
0 2 32 0 0 0 RULE 'exclude *.save files' EXCLUDE WHERE(.)
1 2 16 2 16 0 RULE 'delete' DELETE FROM POOL 'sp1' WHERE(.)
2 2 32 2 32 0 RULE 'migration to system pool' MIGRATE FROM
POOL \
'sp1' TO POOL 'system' WHERE(.)
3 2 16 2 16 0 RULE 'all' LIST 'tmpfiles' WHERE(.)

[I] Files with no applicable rules: 5.

[I] GPFS Policy Decisions and File Choice Totals:


Chose to migrate 32KB: 2 of 2 candidates;
Chose to premigrate 0KB: 0 candidates;
Already co-managed 0KB: 0 candidates;
Chose to delete 16KB: 2 of 2 candidates;
Chose to list 16KB: 2 of 2 candidates;
0KB of chosen data is illplaced or illreplicated;
Predicted Data Pool Utilization in KB and %:
sp1 5072 19531264 0.025969%
system 102432 19531264 0.524451%

where the lines:

WEIGHT(INF) LIST 'tmpfiles' /fs1/file.tmp1 SHOW()


WEIGHT(INF) LIST 'tmpfiles' /fs1/file.tmp0 SHOW()
WEIGHT(INF) DELETE /fs1/file.tmp1 SHOW()
WEIGHT(INF) DELETE /fs1/file.tmp0 SHOW()
WEIGHT(INF) MIGRATE /fs1/file1 TO POOL system SHOW()
WEIGHT(INF) MIGRATE /fs1/file0 TO POOL system SHOW()

show the chosen files and the scheduled action.

mmapplypolicy -L 3
Use this option to display all of the information from the previous levels, plus each candidate file and the
applicable rule.
This command:

mmapplypolicy fs1-P policyfile -I test -L 3

produces output similar to this:

[I] GPFS Current Data Pool Utilization in KB and %


sp1 5120 19531264 0.026214%
system 102400 19531264 0.524288%
[I] Loaded policy rules from policyfile.
Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP = 2009-03-04@02:32:16 UTC
parsed 0 Placement Rules, 0 Restore Rules, 3 Migrate/Delete/Exclude Rules,
1 List Rules, 1 External Pool/List Rules
/* Exclusion rule */
RULE 'exclude *.save files' EXCLUDE WHERE NAME LIKE '%.save'
/* Deletion rule */
RULE 'delete' DELETE FROM POOL 'sp1' WHERE NAME LIKE '%tmp%'
/* Migration rule */
RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WHERE NAME LIKE
'%file%'
/* List rule */
RULE EXTERNAL LIST 'tmpfiles' EXEC '/tmp/exec.list'
RULE 'all' LIST 'tmpfiles' where name like '%tmp%'
[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
/fs1/file.tmp1 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp1 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)

322 IBM Storage Scale 5.1.9: Problem Determination Guide


/fs1/file.tmp0 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file1 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)
/fs1/file0 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)
[I] Inodes scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
WEIGHT(INF) LIST 'tmpfiles' /fs1/file.tmp1 SHOW()
WEIGHT(INF) LIST 'tmpfiles' /fs1/file.tmp0 SHOW()
WEIGHT(INF) DELETE /fs1/file.tmp1 SHOW()
WEIGHT(INF) DELETE /fs1/file.tmp0 SHOW()
WEIGHT(INF) MIGRATE /fs1/file1 TO POOL system SHOW()
WEIGHT(INF) MIGRATE /fs1/file0 TO POOL system SHOW()
[I] Summary of Rule Applicability and File Choices:
Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule
0 2 32 0 0 0 RULE 'exclude *.save files' EXCLUDE WHERE(.)
1 2 16 2 16 0 RULE 'delete' DELETE FROM POOL 'sp1' WHERE(.)
2 2 32 2 32 0 RULE 'migration to system pool' MIGRATE FROM
POOL \
'sp1' TO POOL 'system' WHERE(.)
3 2 16 2 16 0 RULE 'all' LIST 'tmpfiles' WHERE(.)

[I] Files with no applicable rules: 5.

[I] GPFS Policy Decisions and File Choice Totals:


Chose to migrate 32KB: 2 of 2 candidates;
Chose to premigrate 0KB: 0 candidates;
Already co-managed 0KB: 0 candidates;
Chose to delete 16KB: 2 of 2 candidates;
Chose to list 16KB: 2 of 2 candidates;
0KB of chosen data is illplaced or illreplicated;
Predicted Data Pool Utilization in KB and %:
sp1 5072 19531264 0.025969%
system 102432 19531264 0.524451%

where the lines:

/fs1/file.tmp1 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)


/fs1/file.tmp1 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file.tmp0 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file1 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)
/fs1/file0 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)

show the candidate files and the applicable rules.

mmapplypolicy -L 4
Use this option to display all of the information from the previous levels, plus the name of each explicitly
excluded file, and the applicable rule.
This command:

mmapplypolicy fs1 -P policyfile -I test -L 4

produces the following additional information:

[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
/fs1/file1.save RULE 'exclude *.save files' EXCLUDE
/fs1/file2.save RULE 'exclude *.save files' EXCLUDE
/fs1/file.tmp1 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp1 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file.tmp0 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file1 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)
/fs1/file0 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)

where the lines:

/fs1/file1.save RULE 'exclude *.save files' EXCLUDE


/fs1/file2.save RULE 'exclude *.save files' EXCLUDE

Chapter 18. Collecting details of the issues 323


indicate that there are two excluded files, /fs1/file1.save and /fs1/file2.save.

mmapplypolicy -L 5
Use this option to display all of the information from the previous levels, plus the attributes of candidate
and excluded files.
These attributes include:
• MODIFICATION_TIME
• USER_ID
• GROUP_ID
• FILE_SIZE
• POOL_NAME
• ACCESS_TIME
• KB_ALLOCATED
• FILESET_NAME
This command:

mmapplypolicy fs1 -P policyfile -I test -L 5

produces the following additional information:

[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
/fs1/file1.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'exclude \
*.save files' EXCLUDE
/fs1/file2.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-03@21:19:57 16 root] RULE 'exclude \
*.save files' EXCLUDE
/fs1/file.tmp1 [2022-03-04@02:09:31 0 0 0 sp1 2022-03-04@02:09:31 0 root] RULE 'delete' DELETE
\
FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp1 [2022-03-04@02:09:31 0 0 0 sp1 2022-03-04@02:09:31 0 root] RULE 'all' LIST \
'tmpfiles' WEIGHT(INF)
/fs1/file.tmp0 [2022-03-04@02:09:38 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'delete' \
DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 [2022-03-04@02:09:38 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'all' \
LIST 'tmpfiles' WEIGHT(INF)
/fs1/file1 [2022-03-03@21:32:41 0 0 16384 sp1 2022-03-03@21:32:41 16 root] RULE 'migration
\
to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WEIGHT(INF)
/fs1/file0 [2022-03-03@21:21:11 0 0 16384 sp1 2022-03-03@21:32:41 16 root] RULE 'migration
\
to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WEIGHT(INF)

where the lines:

/fs1/file1.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'exclude \


*.save files' EXCLUDE
/fs1/file2.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-03@21:19:57 16 root] RULE 'exclude \
*.save files' EXCLUDE

show the attributes of excluded files /fs1/file1.save and /fs1/file2.save.

mmapplypolicy -L 6
Use this option to display all of the information from the previous levels, plus files that are not candidate
files, and their attributes.
These attributes include:
• MODIFICATION_TIME
• USER_ID
• GROUP_ID
• FILE_SIZE
• POOL_NAME

324 IBM Storage Scale 5.1.9: Problem Determination Guide


• ACCESS_TIME
• KB_ALLOCATED
• FILESET_NAME
This command:

mmapplypolicy fs1 -P policyfile -I test -L 6

produces the following additional information:


[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
/fs1/. [2022-03-04@02:10:43 0 0 8192 system 2022-03-04@02:17:43 8 root] NO RULE APPLIES
/fs1/file1.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE \
'exclude *.save files' EXCLUDE
/fs1/file2.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-03@21:19:57 16 root] RULE \
'exclude *.save files' EXCLUDE
/fs1/file.tmp1 [2022-03-04@02:09:31 0 0 0 sp1 2022-03-04@02:09:31 0 root] RULE 'delete' \
DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp1 [2022-03-04@02:09:31 0 0 0 sp1 2022-03-04@02:09:31 0 root] RULE 'all' LIST \
'tmpfiles' WEIGHT(INF)
/fs1/data1 [2022-03-03@21:20:23 0 0 0 sp1 2022-03-04@02:09:31 0 root] NO RULE APPLIES
/fs1/file.tmp0 [2022-03-04@02:09:38 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'delete' \
DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 [2022-03-04@02:09:38 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'all' LIST \
'tmpfiles' WEIGHT(INF)
/fs1/file1 [2022-03-03@21:32:41 0 0 16384 sp1 2022-03-03@21:32:41 16 root] RULE 'migration \
to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WEIGHT(INF)
/fs1/file0 [2022-03-03@21:21:11 0 0 16384 sp1 2022-03-03@21:32:41 16 root] RULE 'migration \
to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WEIGHT(INF)

where the line:


/fs1/data1 [2022-03-03@21:20:23 0 0 0 sp1 2022-03-04@02:09:31 0 root] NO RULE APPLIES

contains information about the data1 file, which is not a candidate file.

The mmcheckquota command


The mmcheckquota command counts inode and space usage for a file system and writes the collected
data into quota files.
Run the mmcheckquota command if:
• You have MMFS_QUOTA error log entries. This error log entry is created when the quota manager has a
problem reading or writing the quota file.
• Quota information is lost due to node failure. Node failure could leave you unable to open files or it
could deny you disk space that their quotas allow.
• The in-doubt value is approaching the quota limit. The sum of the in-doubt value and the current usage
cannot exceed the hard limit. Therefore, the actual block space and number of files available to you
might be constrained by the in-doubt value. If the in-doubt value approaches a significant percentage of
the maximum quota amount, use the mmcheckquota command to account for the lost space and files.
• Any user, group, or fileset quota files are corrupted.
During the normal operation of file systems with quotas enabled (not running mmcheckquota online), the
usage data reflects the actual usage of the blocks and inodes, which means that if you delete files you
the usage amount decreases. The in-doubt value does not reflect this usage amount. Instead, it is the
number of quotas that the quota server assigns to its clients. The quota server does not know whether the
assigned amount is used or not.
The only situation in which the in-doubt value is important is when the sum of the usage and the in-doubt
value is greater than the quota hard limit. In this case, you cannot allocate more blocks or inodes unless
you reduce the usage amount.
Note: The mmcheckquota command is I/O sensitive and if you specify it your system workload might
increase substantially. Specify it only when your workload is light.
The mmcheckquota command is fully described in the Command reference section in the IBM Storage
Scale: Administration Guide.

Chapter 18. Collecting details of the issues 325


The mmlsnsd command
The mmlsnsd command displays information about the currently defined disks in the cluster.
For example, if you issue mmlsnsd, your output is similar to this:

File system Disk name NSD servers


---------------------------------------------------------------------------
fs2 hd3n97 c5n97g.ppd.pok.ibm.com,c5n98g.ppd.pok.ibm.com,c5n99g.ppd.pok.ibm.com
fs2 hd4n97 c5n97g.ppd.pok.ibm.com,c5n98g.ppd.pok.ibm.com,c5n99g.ppd.pok.ibm.com
fs2 hd5n98 c5n98g.ppd.pok.ibm.com,c5n97g.ppd.pok.ibm.com,c5n99g.ppd.pok.ibm.com
fs2 hd6n98 c5n98g.ppd.pok.ibm.com,c5n97g.ppd.pok.ibm.com,c5n99g.ppd.pok.ibm.com
fs2 sdbnsd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
fs2 sdcnsd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
fs2 sddnsd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
fs2 sdensd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
fs2 sdgnsd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
fs2 sdfnsd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
fs2 sdhnsd c5n94g.ppd.pok.ibm.com,c5n96g.ppd.pok.ibm.com
(free disk) hd2n97 c5n97g.ppd.pok.ibm.com,c5n98g.ppd.pok.ibm.com

To find out the local device names for these disks, use the mmlsnsd command with the -m option. For
example, issuing mmlsnsd -m produces output similar to this:

Disk name NSD volume ID Device Node name Remarks


------------------------------------------------------------------------------------
hd2n97 0972846145C8E924 /dev/hdisk2 c5n97g.ppd.pok.ibm.com server node
hd2n97 0972846145C8E924 /dev/hdisk2 c5n98g.ppd.pok.ibm.com server node
hd3n97 0972846145C8E927 /dev/hdisk3 c5n97g.ppd.pok.ibm.com server node
hd3n97 0972846145C8E927 /dev/hdisk3 c5n98g.ppd.pok.ibm.com server node
hd4n97 0972846145C8E92A /dev/hdisk4 c5n97g.ppd.pok.ibm.com server node
hd4n97 0972846145C8E92A /dev/hdisk4 c5n98g.ppd.pok.ibm.com server node
hd5n98 0972846245EB501C /dev/hdisk5 c5n97g.ppd.pok.ibm.com server node
hd5n98 0972846245EB501C /dev/hdisk5 c5n98g.ppd.pok.ibm.com server node
hd6n98 0972846245DB3AD8 /dev/hdisk6 c5n97g.ppd.pok.ibm.com server node
hd6n98 0972846245DB3AD8 /dev/hdisk6 c5n98g.ppd.pok.ibm.com server node
hd7n97 0972846145C8E934 /dev/hd7n97 c5n97g.ppd.pok.ibm.com server node

To obtain extended information for NSDs, use the mmlsnsd command with the -X option. For example,
issuing mmlsnsd -X produces output similar to this:

Disk name NSD volume ID Device Devtype Node name Remarks


------------------------------------------------------------------------------------------------
---
hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n97g.ppd.pok.ibm.com server
node,pr=no
hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n98g.ppd.pok.ibm.com server
node,pr=no
hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n97g.ppd.pok.ibm.com server
node,pr=no
hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n98g.ppd.pok.ibm.com server
node,pr=no
sdfnsd 0972845E45F02E81 /dev/sdf generic c5n94g.ppd.pok.ibm.com server node
sdfnsd 0972845E45F02E81 /dev/sdm generic c5n96g.ppd.pok.ibm.com server node

The mmlsnsd command is fully described in the Command reference section in the IBM Storage Scale:
Administration Guide.

The mmwindisk command


On Windows nodes, use the mmwindisk command to view all disks known to the operating system along
with partitioning information relevant to GPFS.
For example, if you issue mmwindisk list, your output is similar to this:

Disk Avail Type Status Size GPFS Partition ID


---- ----- ------- --------- -------- ------------------------------------
0 BASIC ONLINE 137 GiB
1 GPFS ONLINE 55 GiB 362DD84E-3D2E-4A59-B96B-BDE64E31ACCF
2 GPFS ONLINE 200 GiB BD5E64E4-32C8-44CE-8687-B14982848AD2
3 GPFS ONLINE 55 GiB B3EC846C-9C41-4EFD-940D-1AFA6E2D08FB
4 GPFS ONLINE 55 GiB 6023455C-353D-40D1-BCEB-FF8E73BF6C0F
5 GPFS ONLINE 55 GiB 2886391A-BB2D-4BDF-BE59-F33860441262
6 GPFS ONLINE 55 GiB 00845DCC-058B-4DEB-BD0A-17BAD5A54530

326 IBM Storage Scale 5.1.9: Problem Determination Guide


7 GPFS ONLINE 55 GiB 260BCAEB-6E8A-4504-874D-7E07E02E1817
8 GPFS ONLINE 55 GiB 863B6D80-2E15-457E-B2D5-FEA0BC41A5AC
9 YES UNALLOC OFFLINE 55 GiB
10 YES UNALLOC OFFLINE 200 GiB

Where:
Disk
is the Windows disk number as shown in the Disk Management console and the DISKPART command-
line utility.
Avail
shows the value YES when the disk is available and in a state suitable for creating an NSD.
GPFS Partition ID
is the unique ID for the GPFS partition on the disk.
The mmwindisk command does not provide the NSD volume ID. You can use mmlsnsd -m to find the
relationship between NSDs and devices, which are disk numbers on Windows.

The mmfileid command


The mmfileid command identifies files that are on areas of a disk that are damaged or suspect.
Attention: Use this command only when the IBM Support Center directs you to do so.

Before you run mmfileid, you must run a disk analysis utility and obtain the disk sector numbers that are
damaged or suspect. These sectors are input to the mmfileid command.
The command syntax is as follows:

mmfileid Device
{-d DiskDesc | -F DescFile}
[-o OutputFile] [-f NumThreads] [-t Directory]
[-N {Node[,Node...] | NodeFile | NodeClass}] [--qos QOSClass]

The input parameters are as follows:


Device
The device name for the file system.
-d DiskDesc
A descriptor that identifies the disk to be scanned. DiskDesc has the following format:

NodeName:DiskName[:PhysAddr1[-PhysAddr2]]

It has the following alternative format:

:{NsdName|DiskNum|BROKEN}[:PhysAddr1[-PhysAddr2]]

NodeName
Specifies a node in the GPFS cluster that has access to the disk to scan. You must specify this
value if the disk is identified with its physical volume name. Do not specify this value if the disk is
identified with its NSD name or its GPFS disk ID number, or if the keyword BROKEN is used.
DiskName
Specifies the physical volume name of the disk to scan as known on node NodeName.
NsdName
Specifies the GPFS NSD name of the disk to scan.
DiskNum
Specifies the GPFS disk ID number of the disk to scan as displayed by the mmlsdisk -L
command.
BROKEN
Tells the command to scan all the disks in the file system for files with broken addresses that
result in lost data.

Chapter 18. Collecting details of the issues 327


PhysAddr1[-PhysAddr2]
Specifies the range of physical disk addresses to scan. The default value for PhysAddr1 is zero.
The default value for PhysAddr2 is the value for PhysAddr1.
If both PhysAddr1 and PhysAddr2 are zero, the command searches the entire disk.
The following lines are examples of valid disk descriptors:

k148n07:hdisk9:2206310-2206810
:gpfs1008nsd:
:10:27645856
:BROKEN

-F DescFile
Specifies a file that contains a list of disk descriptors, one per line.
-f NumThreads
Specifies the number of worker threads to create. The default value is 16. The minimum value
is 1. The maximum value is the maximum number allowed by the operating system function
pthread_create for a single process. A suggested value is twice the number of disks in the file
system.
-N {Node[,Node...] | NodeFile | NodeClass}
Specifies the list of nodes that participate in determining the disk addresses. This command supports
all defined node classes. The default is all or the current value of the defaultHelperNodes
configuration parameter of the mmchconfig command.
For general information on how to specify node names, see Specifying nodes as input to GPFS
commands in the IBM Storage Scale: Administration Guide.
-o OutputFile
The path name of a file to which the result from the mmfileid command is to be written. If not
specified, the result is sent to standard output.
-t Directory
Specifies the directory to use for temporary storage during mmfileid command processing. The
default directory is /tmp.
--qos QOSClass
Specifies the Quality of Service for I/O operations (QoS) class to which the instance of the command
is assigned. If you do not specify this parameter, the instance of the command is assigned by default
to the maintenance QoS class. This parameter has no effect unless the QoS service is enabled. For
more information, see the help topic on the mmchqos command in the IBM Storage Scale: Command
and Programming Reference Guide. Specify one of the following QoS classes:
maintenance
This QoS class is typically configured to have a smaller share of file system IOPS. Use this class for
I/O-intensive, potentially long-running GPFS commands, so that they contribute less to reducing
overall file system performance.
other
This QoS class is typically configured to have a larger share of file system IOPS. Use this class for
administration commands that are not I/O-intensive.
For more information, see the help topic on Setting the Quality of Service for I/O operations (QoS) in
the IBM Storage Scale: Administration Guide.
You can redirect the output to a file with the -o flag and sort the output on the inode number with the
sort command.
The mmfileid command output contains one line for each inode found to be on a corrupted disk sector.
Each line of the command output has this format:

InodeNumber LogicalDiskAddress SnapshotId Filename

InodeNumber
Indicates the inode number of the file identified by mmfileid.

328 IBM Storage Scale 5.1.9: Problem Determination Guide


LogicalDiskAddress
Indicates the disk block (disk sector) number of the file identified by mmfileid.
SnapshotId
Indicates the snapshot identifier for the file. A SnapshotId of 0 means that the file is not a snapshot
file.
Filename
Indicates the name of the file identified by mmfileid. File names are relative to the root of the file
system in which they reside.
Assume that a disk analysis tool reports that disks hdisk6, hdisk7, hdisk8, and hdisk9 contain
bad sectors, and that the file addr.in has the following contents:

k148n07:hdisk9:2206310-2206810
k148n07:hdisk8:2211038-2211042
k148n07:hdisk8:2201800-2202800
k148n01:hdisk6:2921879-2926880
k148n09:hdisk7:1076208-1076610

You run the following command:

mmfileid /dev/gpfsB -F addr.in

The command output might be similar to the following example:

Address 2201958 is contained in the Block allocation map (inode 1)


Address 2206688 is contained in the ACL Data file (inode 4, snapId 0)
Address 2211038 is contained in the Log File (inode 7, snapId 0)
14336 1076256 0 /gpfsB/tesDir/testFile.out
14344 2922528 1 /gpfsB/x.img

The lines that begin with the word Address represent GPFS system metadata files or reserved disk areas.
If your output contains any lines like these, do not attempt to replace or repair the indicated files. If you
suspect that any of the special files are damaged, call the IBM Support Center for assistance.
The following line of output indicates that inode number 14336, disk address 1072256 contains file /
gpfsB/tesDir/testFile.out. The 0 to the left of the name indicates that the file does not belong to a
snapshot. This file is on a potentially bad disk sector area:

14336 1072256 0 /gpfsB/tesDir/testFile.out

The following line of output indicates that inode number 14344, disk address 2922528 contains file /
gpfsB/x.img. The 1 to the left of the name indicates that the file belongs to snapshot number 1. This file
is on a potentially bad disk sector area:

14344 2922528 1 /gpfsB/x.img

The SHA digest


The Secure Hash Algorithm (SHA) digest is relevant only when using GPFS in a multi-cluster environment.
The SHA digest is a short and convenient way to identify a key registered with either the mmauth show
or mmremotecluster command. In theory, two keys may have the same SHA digest. In practice, this is
extremely unlikely. The SHA digest can be used by the administrators of two GPFS clusters to determine if
they each have received (and registered) the right key file from the other administrator.
An example is the situation of two administrators named Admin1 and Admin2 who have registered the
others' respective key file, but find that mount attempts by Admin1 for file systems owned by Admin2 fail
with the error message: Authorization failed. To determine which administrator has registered the
wrong key, they each run mmauth show and send the local clusters SHA digest to the other administrator.
Admin1 then runs the mmremotecluster command and verifies that the SHA digest for Admin2's
cluster matches the SHA digest for the key that Admin1 has registered. Admin2 then runs the mmauth
show command and verifies that the SHA digest for Admin1's cluster matches the key that Admin2 has
authorized.

Chapter 18. Collecting details of the issues 329


If Admin1 finds that the SHA digests do not match, Admin1 runs the mmremotecluster update
command, passing the correct key file as input.
If Admin2 finds that the SHA digests do not match, Admin2 runs the mmauth update command, passing
the correct key file as input.
This is an example of the output produced by the mmauth show all command:

Cluster name: fksdcm.pok.ibm.com


Cipher list: EXP1024-RC2-CBC-MD5
SHA digest: d5eb5241eda7d3ec345ece906bfcef0b6cd343bd
File system access: fs1 (rw, root allowed)

Cluster name: kremote.cluster


Cipher list: EXP1024-RC4-SHA
SHA digest: eb71a3aaa89c3979841b363fd6d0a36a2a460a8b
File system access: fs1 (rw, root allowed)

Cluster name: dkq.cluster (this cluster)


Cipher list: AUTHONLY
SHA digest: 090cd57a2e3b18ac163e5e9bd5f26ffabaa6aa25
File system access: (all rw)

Collecting details of the issues from performance monitoring tools


This topic describes how to collect details of issues that you might encounter in IBM Storage Scale by
using performance monitoring tools.
With IBM Storage Scale, system administrators can monitor the performance of GPFS and the
communications protocols that it uses. Issue the mmperfmon query command to query performance
data.
Note: If you issue the mmperfmon query command without any additional parameters, you can see a list
of options for querying performance-related information, as shown in the following sample output:

Usage:
mmperfmon query Metric[,Metric...] | Key[,Key...] | NamedQuery [StartTime EndTime | Duration]
[Options]
OR
mmperfmon query compareNodes ComparisonMetric [StartTime EndTime | Duration] [Options]
where
Metric metric name
Key a key consisting of node name, sensor group, optional additional
filters,
metric name, separated by pipe symbol
e.g.: "cluster1.ibm.com|CTDBStats|locking|db_hop_count_bucket_00"
NamedQuery name of a pre-defined query
ComparisonMetric name of a metric to be compared if using CompareNodes
StartTime Start timestamp for query
Format: YYYY-MM-DD-hh:mm:ss
EndTime End timestamp for query. Omitted means: execution time
Format: YYYY-MM-DD-hh:mm:ss
Duration Number of seconds into the past from today or <EndTime>

Options:
-h, --help show this help message and exit
-N NodeName, --Node=NodeName
Defines the node that metrics should be retrieved from
-b BucketSize, --bucket-size=BucketSize
Defines a bucket size (number of seconds), default is
1
-n NumberBuckets, --number-buckets=NumberBuckets
Number of buckets ( records ) to show, default is 10
--filter=Filter Filter criteria for the query to run
--format=Format Common format for all columns
--csv Provides output in csv format.
--raw Provides output in raw format rather than a pretty
table format.
--nice Use colors and other text attributes for output.
--resolve Resolve computed metrics, show metrics used
--short Shorten column names if there are too many to fit into
one row.
--list=List Show list of specified values (overrides other

330 IBM Storage Scale 5.1.9: Problem Determination Guide


options). Values are all, metrics, computed, queries,
keys.
Possible named queries are:
compareNodes - Compares a single metric across all nodes running sensors
cpu - Show CPU utilization in system and user space, and context switches
ctdbCallLatency - Show CTDB call latency.
ctdbHopCountDetails - Show CTDB hop count buckets 0 to 5 for one database.
ctdbHopCounts - Show CTDB hop counts (bucket 00 = 1-3 hops) for all databases.
gpfsCRUDopsLatency - Show GPFS CRUD operations latency
gpfsFSWaits - Display max waits for read and write operations for all file systems
gpfsNSDWaits - Display max waits for read and write operations for all disks
gpfsNumberOperations - Get the number of operations to the GPFS file system.
gpfsVFSOpCounts - Display VFS operation counts
netDetails - Get details about the network.
netErrors - Show network problems for all available networks: collisions, drops,
errors
nfsErrors - Get the NFS error count for read and write operations
nfsIOLatency - Get the NFS IO Latency in nanoseconds per second
nfsIORate - Get the NFS IOps per second
nfsQueue - Get the NFS read and write queue size in bytes
nfsThroughput - Get the NFS Throughput in bytes per second
nfsThroughputPerOp - Get the NFS read and write throughput per op in bytes
objAcc - Object account overall performance.
objAccIO - Object account IO details.
objAccLatency - Object proxy Latency.
objAccThroughput - Object account overall Throughput.
objCon - Object container overall performance.
objConIO - Object container IO details.
objConLatency - Object container Latency.
objConThroughput - Object container overall Throughput.
objObj - Object overall performance.
objObjIO - Object overall IO details.
objObjLatency - Object Latency.
objObjThroughput - Object overall Throughput.
objPro - Object proxy overall performance.
objProIO - Object proxy IO details.
objProThroughput - Object proxy overall Throughput.
protocolIOLatency - Compare latency per protocol (smb, nfs, object).
protocolIORate - Get the percentage of total I/O rate per protocol (smb, nfs, object).
protocolThroughput - Get the percentage of total throughput per protocol (smb, nfs,
object).
smb2IOLatency - Get the SMB2 I/O latencies per bucket size ( default 1 sec )
smb2IORate - Get the SMB2 I/O rate in number of operations per bucket size
( default 1 sec )
smb2Throughput - Get the SMB2 Throughput in bytes per bucket size ( default 1 sec )
smb2Writes - Count, # of idle calls, bytes in and out and operation time for smb2
writes
smbConnections - Number of smb connections
usage - Show CPU, memory, storage and network usage

For more information on monitoring performance and analyzing performance related issues, see “Using
the performance monitoring tool” on page 105 and mmperfmon command in the IBM Storage Scale:
Command and Programming Reference Guide

Other problem determination tools


Other problem determination tools include the kernel debugging facilities and the mmpmon command.
If your problem occurs on the AIX operating system, see AIX in IBM Documentation and search for the
appropriate kernel debugging documentation for information about the AIX kdb command.
If your problem occurs on the Linux operating system, see the documentation for your distribution vendor.
If your problem occurs on the Windows operating system, the following tools that are available from the
Windows Sysinternals, might be useful in troubleshooting:
• Debugging Tools for Windows
• Process Monitor
• Process Explorer
• Microsoft Windows Driver Kit
• Microsoft Windows Software Development Kit

Chapter 18. Collecting details of the issues 331


The mmpmon command is intended for system administrators to analyze their I/O on the node on which
it is run. It is not primarily a diagnostic tool, but may be used as one for certain problems. For example,
running mmpmon on several nodes may be used to detect nodes that are experiencing poor performance
or connectivity problems.
The syntax of the mmpmon command is fully described in the Command reference section in the IBM
Storage Scale: Command and Programming Reference Guide. For details on the mmpmon command, see
“Monitoring I/O performance with the mmpmon command” on page 59.

332 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 19. Managing deadlocks
IBM Storage Scale provides functions for automatically detecting potential deadlocks, collecting deadlock
debug data, and breaking up deadlocks.
The distributed nature of GPFS, the complexity of the locking infrastructure, the dependency on
the proper operation of disks and networks, and the overall complexity of operating in a clustered
environment all contribute to increasing the probability of a deadlock.
Deadlocks can be disruptive in certain situations, more so than other type of failure. A deadlock
effectively represents a single point of failure that can render the entire cluster inoperable. When a
deadlock is encountered on a production system, it can take a long time to debug. The typical approach to
recovering from a deadlock involves rebooting all of the nodes in the cluster. Thus, deadlocks can lead to
prolonged and complete outages of clusters.
To troubleshoot deadlocks, you must have specific types of debug data that must be collected while the
deadlock is in progress. Data collection commands must be run manually before the deadlock is broken.
Otherwise, determining the root cause of the deadlock after that is difficult. Also, deadlock detection
requires some form of external action, for example, a complaint from a user. Waiting for a user complaint
means that detecting a deadlock in progress might take many hours.
The automated deadlock detection, automated deadlock data collection, and deadlock breakup options
are provided in IBM Storage Scale to make it easier to handle a deadlock situation.
• “Debug data for deadlocks” on page 333
• “Automated deadlock detection” on page 334
• “Automated deadlock data collection” on page 335
• “Automated deadlock breakup” on page 336
• “Deadlock breakup on demand” on page 337

Debug data for deadlocks


Debug data for potential deadlocks is automatically collected. System administrators must monitor and
manage the file systems where debug data is stored.
Automated deadlock detection and automated deadlock data collection are enabled by default.
Automated deadlock breakup is disabled by default.
At the start of the GPFS daemon, the mmfs.log file shows entries like the following:

Thu Jul 16 18:50:14.097 2015: [I] Enabled automated deadlock detection.


Thu Jul 16 18:50:14.098 2015: [I] Enabled automated deadlock debug data
collection.
Thu Jul 16 18:50:14.099 2015: [I] Enabled automated expel debug data collection.
Thu Jul 16 18:50:14.100 2015: [I] Please see https://ibm.biz/Bd4bNK for more
information on deadlock amelioration.

The short URL points to this help topic to make it easier to find the information later.
By default, debug data is put into the /tmp/mmfs directory, or the directory specified for the
dataStructureDump configuration parameter, on each node. Plenty of disk space, typically many GBs,
needs to be available. Debug data is not collected when the directory runs out of disk space.
Important: Before you change the value of dataStructureDump, stop the GPFS trace. Otherwise, you
lose the GPFS trace data. Restart the GPFS trace afterwards.
After a potential deadlock is detected and the relevant debug data is collected, IBM Service needs to be
contacted to report the problem and to upload the debug data. Outdated debug data needs to be removed
to make room for new debug data in case a new potential deadlock is detected.

© Copyright IBM Corp. 2015, 2024 333


It is the responsibility of system administrators to manage the disk space under the /tmp/mmfs directory
or dataStructureDump. They know which set of debug data is still useful.
The "expel debug data" is similar to the "deadlock debug data", but it is collected when a node is expelled
from a cluster for no apparent reasons.

Automated deadlock detection


Automated deadlock detection flags unexpected long waiters as potential deadlocks. Effective deadlock
detection thresholds are self-tuned to reduce false positive detection. You can register a user program for
the deadlockDetected event to receive automatic notification.
GPFS code uses waiters to track what a thread is waiting for and how long it is waiting. Many deadlocks
involve long waiters. In a real deadlock, long waiters do not disappear naturally as the deadlock prevents
the threads from getting what they are waiting for. With some exceptions, long waiters typically indicate
that something in the system is not healthy. A deadlock might be in progress, some disk might be failing,
or the entire system might be overloaded.
Automated deadlock detection monitors waiters to detect potential deadlocks. Some waiters can become
long legitimately under normal operating conditions and such waiters are ignored by automated deadlock
detection. Such waiters appear in the mmdiag --waiters output but never in the mmdiag --deadlock
output. From now on in this topic, the word waiters refers only to those waiters that are monitored by
automated deadlock detection.
Automated deadlock detection flags a waiter as a potential deadlock when the waiter length exceeds
certain threshold for deadlock detection. For example, the following mmfs.log entry indicates that a
waiter started on thread 8397 at 2015-07-18 09:36:58 passed 905 seconds at Jul 18 09:52:04.626
2015 and is suspected to be a deadlock waiter.

Sat Jul 18 09:52:04.626 2015: [A] Unexpected long waiter detected: Waiting 905.9380 sec since
2015-07-18 09:36:58, on node c33f2in01,
SharedHashTabFetchHandlerThread 8397: on MsgRecordCondvar,
reason 'RPC wait' for tmMsgTellAcquire1

The /var/log/messages file on Linux and the error log on AIX also log an entry for the deadlock
detection, but the mmfs.log file has most details.
The deadlockDetected event is triggered on "Unexpected long waiter detected" and any user program
that is registered for the event is invoked. The user program can be made for recording and notification
purposes. See /usr/lpp/mmfs/samples/deadlockdetected.sample for an example and more
information.
When the flagged waiter disappears, an entry like the following one might appear in the mmfs.log file:
Sat Jul 18 10:00:05.705 2015: [N] The unexpected long waiter on thread 8397 has disappeared in 1386 seconds.

The mmdiag --deadlock command shows the flagged waiter and possibly other waiters closely behind
which also passed the threshold for deadlock detection
If the flagged waiter disappears on its own, without any deadlock breakup actions, then the flagged
waiter is not a real deadlock, and the detection is a false positive. A reasonable threshold needs to be
established to reduce false positive deadlock detection. It is a good practice to consider the trade-off
between waiting too long and not having a timely detection and not waiting long enough causing a
false-positive detection.
A false positive deadlock detection and debug data collection are not necessarily a waste of resources. A
long waiter, even if it eventually disappears on its own, likely indicates that something is not working well,
and is worth looking into.
The configuration parameter deadlockDetectionThreshold is used to specify the initial threshold for
deadlock detection. GPFS code adjusts the threshold on each node based on what's happening on the
node and cluster. The adjusted threshold is the effective threshold used in automated deadlock detection.

334 IBM Storage Scale 5.1.9: Problem Determination Guide


An internal algorithm is used to evaluate whether a cluster is overloaded or not. Overload is a factor that
influences the adjustment of the effective deadlock detection threshold. The effective deadlock detection
threshold and the cluster overload index are shown in the output of the mmdiag --deadlock.

Effective deadlock detection threshold on c37f2n04 is 1000 seconds


Effective deadlock detection threshold on c37f2n04 is 430 seconds for short waiters
Cluster my.cluster is overloaded. The overload index on c40bbc2xn2 is 1.14547

If deadlockDetectionThresholdForShortWaiters is positive, and it is by default, certain waiters,


including most of the mutex waiters, are considered short waiters that should not be long. These short
waiters have a shorter effective deadlock detection threshold that is self-tuned separately.
Certain waiters, including most of the mutex waiters, are considered short waiters that should not
be long. If deadlockDetectionThresholdForShortWaiters is positive, and it is by default, these
short waiters are monitored separately. Their effective deadlock detection threshold is also self-tuned
separately.
The overload index is the weighted average duration of all I/Os completed over a long time. Recent
I/O durations count more than the ones in the past. The cluster overload detection affects deadlock
amelioration functions only. The determination by GPFS that a cluster is overloaded is not necessarily
the same as the determination by a customer. But customers might use the determination by GPFS as a
reference and check the workload, hardware and network of the cluster to see whether anything needs
correction or adjustment. An overloaded cluster with a workload far exceeding its resource capability is
not healthy nor productive.
If the existing effective deadlock detection threshold value is no longer appropriate for the workload, run
the mmfsadm resetstats command to restart the local adjustment.
To view the current value of deadlockDetectionThreshold and
deadlockDetectionThresholdForShortWaiters, which are the initial thresholds for deadlock
detection, enter the following command:

mmlsconfig deadlockDetectionThreshold
mmlsconfig deadlockDetectionThresholdForShortWaiters

The system displays output similar to the following:

deadlockDetectionThreshold 300
deadlockDetectionThresholdForShortWaiters 60

To disable automated deadlock detection, specify a value of 0 for deadlockDetectionThreshold.


All deadlock amelioration functions, not just deadlock detection, are disabled by
specifying 0 for deadlockDetectionThreshold. A positive value must be specified for
deadlockDetectionThreshold to enable any part of the deadlock amelioration functions.

Automated deadlock data collection


Automated deadlock data collection gathers crucial debug data when a potential deadlock is detected.
Automated deadlock data collection helps gather crucial debug data on detection of a potential deadlock.
Messages similar to the following ones are written to the mmfs.log file:

Sat Jul 18 09:52:04.626 2015: [A] Unexpected long waiter detected:


2015-07-18 09:36:58: waiting 905.938 seconds on node c33f2in01:
SharedHashTabFetchHandlerThread 8397: on MsgRecordCondvar,
reason 'RPC wait' for tmMsgTellAcquire1
Sat Jul 18 09:52:04.627 2015: [I] Initiate debug data collection from
this node.
Sat Jul 18 09:52:04.628 2015: [I] Calling User Exit Script
gpfsDebugDataCollection: event deadlockDebugData,
Async command /usr/lpp/mmfs/bin/mmcommon.

What debug data is collected depends on the value of the configuration parameter debugDataControl.
The default value is light and a minimum amount of debug data, the data that is most frequently
needed to debug a GPFS issue, is collected. The value medium gets more debug data collected. The

Chapter 19. Managing deadlocks 335


value heavy is meant to be used routinely by internal test teams only. The value verbose needed only
for troubleshooting special cases and can result in very large dumps. No debug data is collected when
the value none is specified. You can set different values for the debugDataControl parameter across
nodes in the cluster. For more information, see the topic mmchconfig command in the IBM Storage Scale:
Command and Programming Reference Guide.
Automated deadlock data collection is enabled by default and controlled by the configuration parameter
deadlockDataCollectionDailyLimit. This parameter specifies the maximum number of times
debug data can be collected in a 24-hour period by automated deadlock data collection.
To view the current value of deadlockDataCollectionDailyLimit, enter the following command:

mmlsconfig deadlockDataCollectionDailyLimit

The system displays output similar to the following:

deadlockDataCollectionDailyLimit 3

To disable automated deadlock data collection, specify a value of 0 for


deadlockDataCollectionDailyLimit.
Another configuration parameter, deadlockDataCollectionMinInterval, is used to control the
minimum amount of time between consecutive debug data collections. The default is 3600 seconds
or 1 hour.

Automated deadlock breakup


Automated deadlock breakup helps resolve a deadlock situation without human intervention. To break up
a deadlock, less disruptive actions are tried first; for example, causing a file system panic. If necessary,
more disruptive actions are then taken; for example, shutting down a GPFS mmfsd daemon.
If a system administrator prefers to control the deadlock breakup process, the deadlockDetected
callback can be used to notify system administrators that a potential deadlock was detected. The
information from the mmdiag --deadlock section can then be used to help determine what steps
to take to resolve the deadlock.
Automated deadlock breakup is disabled by default and controlled with the mmchconfig attribute
deadlockBreakupDelay. The deadlockBreakupDelay attribute specifies how long to wait after a
deadlock is detected before attempting to break up the deadlock. Enough time must be provided to allow
the debug data collection to complete. To view the current breakup delay, enter the following command:

mmlsconfig deadlockBreakupDelay

The system displays output similar to the following:

deadlockBreakupDelay 0

The value of 0 shows that automated deadlock breakup is disabled. To enable automated deadlock
breakup, specify a positive value for deadlockBreakupDelay. If automated deadlock breakup is to be
enabled, a delay of 300 seconds or longer is recommended.
Automated deadlock breakup is done on a node-by-node basis. If automated deadlock breakup is
enabled, the breakup process is started when the suspected deadlock waiter is detected on a node.
The process first waits for the deadlockBreakupDelay, and then goes through various phases until the
deadlock waiters disappear. There is no central coordination on the deadlock breakup, so the time to take
deadlock breakup actions may be different on each node. Breaking up a deadlock waiter on one node can
cause some deadlock waiters on other nodes to disappear, so no breakup actions need to be taken on
those other nodes.
If a suspected deadlock waiter disappears while waiting for the deadlockBreakupDelay, the
automated deadlock breakup process stops immediately without taking any further action. To lessen
the number of breakup actions that are taken in response to detecting a false-positive deadlock, increase

336 IBM Storage Scale 5.1.9: Problem Determination Guide


the deadlockBreakupDelay. If you decide to increase the deadlockBreakupDelay, a deadlock can
potentially exist for a longer period.
If your goal is to break up a deadlock as soon as possible, and your workload can afford an interruption
at any time, then enable automated deadlock breakup from the beginning. Otherwise, keep automated
deadlock breakup disabled to avoid unexpected interruptions to your workload. In this case, you can
choose to break the deadlock manually, or use the function that is described in the “Deadlock breakup on
demand” on page 337 topic.
Due to the complexity of the GPFS code, asserts or segmentation faults might happen during a deadlock
breakup action. That might cause unwanted disruptions to a customer workload still running normally on
the cluster. A good reason to use deadlock breakup on demand is to not disturb a partially working
cluster until it is safe to do so. Try not to break up a suspected deadlock prematurely to avoid
unnecessary disruptions. If automated deadlock breakup is enabled all of the time, it is good to set
deadlockBreakupDelay to a large value such as 3600 seconds. If using mmcommon breakDeadlock,
it is better to wait until the longest deadlock waiter is an hour or longer. Much shorter times can be used if
a customer prefers fast action in breaking a deadlock over assurance that a deadlock is real.
The following messages, related to deadlock breakup, might be found in the mmfs.log files:

[I] Enabled automated deadlock breakup.

[N] Deadlock breakup: starting in 300 seconds

[N] Deadlock breakup: aborting RPC on 1 pending nodes.

[N] Deadlock breakup: panicking fs fs1

[N] Deadlock breakup: shutting down this node.

[N] Deadlock breakup: the process has ended.

Deadlock breakup on demand


Deadlocks can be broken up on demand, which allows a system administrator to choose the appropriate
time to start the breakup actions.
A deadlock can be localized, for example, it might involve only one of many file systems in a cluster. The
other file systems in the cluster can still be used, and a mission critical workload might need to continue
uninterrupted. In these cases, the best time to break up the deadlock is after the mission critical workload
ends.
The mmcommon command can be used to break up an existing deadlock in a cluster when the deadlock
was previously detected by deadlock amelioration. To start the breakup on demand, use the following
syntax:

mmcommon breakDeadlock [-N {Node[,Node...] | NodeFile | NodeClass}]

If the mmcommon breakDeadlock command is issued without the -N parameter, then every node in the
cluster receives a request to take action on any long waiter that is a suspected deadlock.
If the mmcommon breakDeadlock command is issued with the -N parameter, then only the nodes
that are specified receive a request to take action on any long waiter that is a suspected deadlock. For
example, assume that there are two nodes, called node3 and node6, that require a deadlock breakup. To
send the breakup request to just these nodes, issue the following command:

mmcommon breakDeadlock -N node3,node6

Shortly after running the mmcommon breakDeadlock command, issue the following command:

mmdsh -N all /usr/lpp/mmfs/bin/mmdiag --deadlock

The output of the mmdsh command can be used to determine if any deadlock waiters still exist and if any
additional actions are needed.

Chapter 19. Managing deadlocks 337


The effect of the mmcommon breakDeadlock command only persists on a node until the longest
deadlock waiter that was detected disappears. All actions that are taken by mmcommon breakDeadlock
are recorded in the mmfs.log file. When mmcommon breakDeadlock is issued for a node that did not
have a deadlock, no action is taken except for recording the following message in the mmfs.log file:

[N] Received deadlock breakup request from 192.168.40.72: No deadlock to break up.

The mmcommon breakDeadlock command provides more control over breaking up deadlocks, but
multiple breakup requests might be required to achieve satisfactory results. All waiters that exceeded the
deadlockDetectionThreshold might not disappear when mmcommon breakDeadlock completes
on a node. In complicated deadlock scenarios, some long waiters can persist after the longest
waiters disappear. Waiter length can grow to exceed the deadlockDetectionThreshold at any
point, and waiters can disappear at any point as well. Examine the waiter situation after mmcommon
breakDeadlock completes to determine whether the command must be repeated to break up the
deadlock.
Another way to break up a deadlock on demand is to enable automated deadlock breakup by changing
deadlockBreakupDelay to a positive value. By enabling automated deadlock breakup, breakup actions
are initiated on existing deadlock waiters. The breakup actions repeat automatically if deadlock waiters
are detected. Change deadlockBreakupDelay back to 0 when the results are satisfactory, or when you
want to control the timing of deadlock breakup actions again. If automated deadlock breakup remains
enabled, breakup actions start on any newly detected deadlocks without any intervention.

338 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 20. Installation and configuration issues
You might encounter errors with GPFS installation, configuration, and operation. Use the information in
this topic to help you identify and correct errors.
An IBM Storage Scale installation problem should be suspected in case of following scenarios:
• The GPFS modules are not loaded successfully.
• Commands do not work on the node that you are working on or on the other nodes.
• New command operands that are added with a new release of IBM Storage Scale are not recognized.
• Issues are found with the Kernel extension.
A GPFS configuration problem should be suspected in case of following scenarios:
• The GPFS daemon does not activate, it does not remain active, or it fails on some nodes but not on
others.
• If a quorum is lost, then certain nodes appear to hang or do not communicate properly with GPFS and
nodes cannot be added to the cluster or are expelled.
• The GPFS performance is noticeably degraded once a new release of GPFS is installed or configuration
parameters are changed.

Resolving most frequent problems related to installation,


deployment, and upgrade
Use the following information to resolve the most frequent problems related to installation, deployment,
and upgrade.

Finding deployment related error messages more easily and using them for
failure analysis
Use this information to find and analyze error messages related to installation, deployment, and upgrade
from the respective logs when using the installation toolkit.
In case of any installation, deployment, and upgrade related error:
1. Go to the end of the corresponding log file and search upwards for the text FATAL.
2. Find the topmost occurrence of FATAL (or first FATAL error that occurred) and look above and below
this error for further indications of the failure.

Problems due to missing prerequisites


Use this information to ensure that prerequisites are met before using the installation toolkit for
installation, deployment, and upgrade.
• “Passwordless SSH setup” on page 339
• “Repository setup” on page 340
• “Firewall configuration” on page 340
• “CES IP address allocation” on page 341
• “Addition of CES IPs to /etc/hosts” on page 342

Passwordless SSH setup


The installation toolkit performs verification during the precheck phase to ensure that passwordless SSH
is set up correctly. You can manually verify and set up passwordless SSH as follows.
1. Verify that passwordless SSH is set up by using the following commands.

© Copyright IBM Corp. 2015, 2024 339


a. Verify that the user can log into the node by using the host name of the node successfully without
being prompted for any input and that there are no warnings.

ssh HostNameofFirstNode
ssh HostNameofSecondNode

b. Verify that the user can log into the node by using the FQDN of the node successfully without being
prompted for any input and that there are no warnings.

ssh FQDNofFirstNode
ssh FQDNofSecondNode

Repeat this on all nodes.


c. Verify that the user can log into the node successfully by using the IP address of the node without
being prompted for any input and that there are no warnings.

ssh IPAddressofFirstNode
ssh IPAddressofSecondNode

Repeat this on all nodes.


2. If needed, set up passwordless SSH using the following commands.
Note: This is one of the several possible ways of setting up passwordless SSH.
a. Generate the SSH key.

ssh-keygen

Repeat this command on all nodes.


b. Run the following commands.

ssh-copy-id FQDNofFirstNode
ssh-copy-id FQDNofSecondNode

Repeat this step on all nodes.


c. Run the following commands.

ssh-copy-id HostNameofFirstNode
ssh-copy-id HostNameofSecondNode

Repeat this step on all nodes.


d. Run the following commands.

ssh-copy-id IPAddressofFirstNode
ssh-copy-id IPAddressofSecondNode

Repeat this step on all nodes.

Repository setup
• Verify that the repository is set up depending on your operating system. For example, verify that yum
repository is set up by using the following command on all cluster nodes.

yum repolist

This command should run clean with no errors if the Yum repository is set up.

Firewall configuration
It is recommended that firewalls are in place to secure all nodes. For more information, see Securing the
IBM Storage Scale system using firewall in IBM Storage Scale: Administration Guide.
• If you need to open specific ports, use the following steps on Red Hat Enterprise Linux nodes.

340 IBM Storage Scale 5.1.9: Problem Determination Guide


1. Check the firewall status.

systemctl status firewalld

2. Open ports required by the installation toolkit.

firewall-cmd --permanent --add-port 8889/tcp


firewall-cmd --add-port 8889/tcp
firewall-cmd --permanent --add-port 10080/tcp
firewall-cmd --add-port 10080/tcp

CES IP address allocation


As part of the deployment process, the IBM Storage Scale checks routing on the cluster and applies CES
IPs as aliases on each protocol node. Furthermore, as service actions or failovers, nodes dynamically lose
the alias IPs as they go down and other nodes gain additional aliases to hold all of the IPs passed to them
from the down nodes.
Example - Before deployment
The only address here is 192.168.251.161, which is the ssh address for the node. It is held by the eth0
adapter.

# ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>mtu 1500
inet 192.168.251.161 netmask 255.255.254.0 broadcast 192.168.251.255
inet6 2002:90b:e006:84:250:56ff:fea5:1d86 prefixlen 64 scopeid 0x0<global>
inet6 fe80::250:56ff:fea5:1d86 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)
RX packets 1978638 bytes 157199595 (149.9 MiB)
RX errors 0 dropped 2291 overruns 0 frame 0
TX packets 30884 bytes 3918216 (3.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

# ip addr
2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP>mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a5:1d:86 brd ff:ff:ff:ff:ff:ff
inet 192.168.251.161/23 brd 192.168.251.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 2002:90b:e006:84:250:56ff:fea5:1d86/64 scope global dynamic
valid_lft 2591875sec preferred_lft 604675sec
inet6 fe80::250:56ff:fea5:1d86/64 scope link
valid_lft forever preferred_lft forever

Example - After deployment


Now that the CES IP addresses exist, you can see that aliases called eth0:0 and eth0:1 have been
created and the CES IP addresses specific to this node have been tagged to it. This allows the ssh IP
of the node to exist at the same time as the CES IP address on the same adapter, if necessary. In this
example, 192.168.251.161 is the initial ssh IP. The CES IP 192.168.251.165 is aliased onto eth0:0 and
the CES IP 192.168.251.166 is aliased onto eth0:1. This occurs on all protocol nodes that are assigned
a CES IP address. NSD server nodes or any client nodes that do not have protocols installed on them do
not get a CES IP.
Furthermore, as service actions or failovers, nodes dynamically lose the alias IPs as they go down and
other nodes gain additional aliases such as eth0:1 and eth0:2 to hold all of the IPs passed to them
from the down nodes.

# ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.251.161 netmask 255.255.254.0 broadcast 192.168.251.255
inet6 2002:90b:e006:84:250:56ff:fea5:1d86 prefixlen 64 scopeid 0x0<global>
inet6 fe80::250:56ff:fea5:1d86 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)
RX packets 2909840 bytes 1022774886 (975.3 MiB)
RX errors 0 dropped 2349 overruns 0 frame 0
TX packets 712595 bytes 12619844288 (11.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>mtu 1500
inet 192.168.251.165 netmask 255.255.254.0 broadcast 192.168.251.255
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)

Chapter 20. Installation and configuration issues 341


eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>mtu 1500
inet 192.168.251.166 netmask 255.255.254.0 broadcast 192.168.251.255
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)

# ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a5:1d:86 brd ff:ff:ff:ff:ff:ff
inet 192.168.251.161/23 brd 9.11.85.255 scope global eth0
valid_lft forever preferred_lft forever
inet 192.168.251.165/23 brd 9.11.85.255 scope global secondary eth0:0
valid_lft forever preferred_lft forever
inet 192.168.251.166/23 brd 9.11.85.255 scope global secondary eth0:1
valid_lft forever preferred_lft forever
inet6 2002:90b:e006:84:250:56ff:fea5:1d86/64 scope global dynamic
valid_lft 2591838sec preferred_lft 604638sec
inet6 fe80::250:56ff:fea5:1d86/64 scope link
valid_lft forever preferred_lft forever

Addition of CES IPs to /etc/hosts


Although it is highly recommended that all CES IPs are maintained in a central DNS and that they are
accessible using both forward and reverse DNS lookup, there are times when this might not be possible.
IBM Storage Scale always verify that forward or reverse DNS lookup is possible. To satisfy this check
without a central DNS server containing the CES IPs, you must add the CES IPs to /etc/hosts and
create a host name for them within /etc/hosts. The following example shows how a cluster might have
multiple networks, nodes, and IPs defined.
For example:

# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# These are external addresses for GPFS


# Use these for ssh in. You can also use these to form your GPFS cluster if you choose
198.51.100.2 ss-deploy-cluster3-1.example.com ss-deploy-cluster3-1
198.51.100.4 ss-deploy-cluster3-2.example.com ss-deploy-cluster3-2
198.51.100.6 ss-deploy-cluster3-3.example.com ss-deploy-cluster3-3
198.51.100.9 ss-deploy-cluster3-4.example.com ss-deploy-cluster3-4

# These are addresses for the base adapter used to alias CES-IPs to.
# Do not use these as CES-IPs.
# You could use these for a gpfs cluster if you choose
# Or you could leave these unused as placeholders
203.0.113.7 ss-deploy-cluster3-1_ces.example.com ss-deploy-cluster3-1_ces
203.0.113.10 ss-deploy-cluster3-2_ces.example.com ss-deploy-cluster3-2_ces
203.0.113.12 ss-deploy-cluster3-3_ces.example.com ss-deploy-cluster3-3_ces
203.0.113.14 ss-deploy-cluster3-4_ces.example.com ss-deploy-cluster3-4_ces

# These are addresses to use for CES-IPs


203.0.113.17 ss-deploy-cluster3-ces.example.com ss-deploy-cluster3-ces
203.0.113.20 ss-deploy-cluster3-ces.example.com ss-deploy-cluster3-ces
203.0.113.21 ss-deploy-cluster3-ces.example.com ss-deploy-cluster3-ces
203.0.113.23 ss-deploy-cluster3-ces.example.com ss-deploy-cluster3-ces

In this example, the first two sets of addresses have unique host names and the third set of addresses
that are associated with CES IPs are not unique. Alternatively, you could give each CES IP a unique host
name but this is an arbitrary decision because only the node itself can see its own /etc/hosts file.
Therefore, these host names are not visible to external clients/nodes unless they too contain a mirror
copy of the /etc/hosts file. The reason for containing the CES IPs within the /etc/hosts file is solely
to satisfy the IBM Storage Scale CES network verification checks. Without this, in cases with no DNS
server, CES IPs cannot be added to a cluster.

Problems due to mixed operating system levels in the cluster


Use the following guidelines to avoid problems due to mixed operating system levels in an IBM Storage
Scale cluster.
For latest information about supported operating systems, see IBM Storage Scale FAQ in IBM
Documentation.

342 IBM Storage Scale 5.1.9: Problem Determination Guide


Verify that the installation toolkit is configured to operate only on supported nodes by using the following
command:

./spectrumscale node list

If any of the listed nodes are of an unsupported OS type, then they need to be removed by using the
following command:

./spectrumscale node delete node

If the node to be removed is an NSD node, then you might have to manually create NSDs and file systems
before using the installation toolkit.
The installation toolkit does not need to be made aware of preexisting file systems and NSDs that
are present on unsupported node types. Ensure that the file systems are mounted before running the
installation toolkit and that they point at the mount points or directory structures.
For information about how the installation toolkit can be used in a cluster that has nodes with mixed
operating systems, see Mixed operating system support with the installation toolkit in IBM Storage
Scale: Concepts, Planning, and Installation Guide.

Upgrades in a mixed OS cluster


Upgrades in a mixed OS cluster need to be performed carefully due to a mix of manual and automated
steps. In this case, the installation toolkit can be made aware of a list of nodes that are running on
supported OS that are to be upgraded. It can then upgrade these nodes. However, the remaining nodes
need to be upgraded manually.

Problems due to using the installation toolkit for functions or configurations


not supported
Use this information to determine node types, setups, and functions that are supported by the installation
toolkit, and to understand how to use the toolkit if a setup is not fully supported.
• “Support for mixed mode of install, deploy, or upgrade” on page 343
• “Support for DMAPI enabled nodes” on page 345
• “Support for ESS cluster” on page 346

Support for mixed mode of install, deploy, or upgrade


You want to use the installation toolkit but you already have an existing cluster. Can the installation
toolkit auto-detect your cluster or do you need to manually configure the toolkit?
The installation toolkit is stateless and it does not import an existing cluster configuration into its
cluster definition file. As a workaround to this scenario, use the steps that are mentioned in the
following sections of the IBM Storage Scale: Concepts, Planning, and Installation Guide.
• Deploying protocols on an existing cluster
• Adding nodes, NSDs, or file systems to an existing installation
• Enabling another protocol on an existing cluster that has protocols enabled
If NSDs and file systems exist, you do not need to provide that information to the installation toolkit.
What are valid starting scenarios for which the installation toolkit can be used for an installation or a
deployment or an upgrade?

Scenario Installation toolkit support


No cluster exists and no GPFS RPMs The installation toolkit can be used to install GPFS and
exist on any nodes. create a cluster.

Chapter 20. Installation and configuration issues 343


Scenario Installation toolkit support
No cluster exists and GPFS RPMs are The installation toolkit can be used to install GPFS and
already installed on nodes. create a cluster.
No cluster exists. The installation toolkit can be used to configure NTP
during GPFS installation and cluster configuration.
A cluster exists. The installation toolkit can be used to add NSDs.
A cluster exists. The installation toolkit can be used to add nodes
(manager, quorum, admin, nsd, protocol, gui).
A cluster exists and NSDs exist. The installation toolkit can be used to add file systems.
A cluster exists and some NSDs exist. The installation toolkit can be used to add more NSDs.
A cluster exists and some protocols are The installation toolkit can be used to enable more
enabled. protocols.
A cluster exists and performance The installation toolkit can be used to reconfigure
monitoring is enabled. performance monitoring.
An ESS cluster exists and protocol The installation toolkit can be used to add protocols to
nodes are added. protocol nodes.
SLES 12, Windows, Ubuntu 16.04, RHEL The installation toolkit can be used only on RHEL 8.x and
7.1, and AIX nodes exist along with 7.x (7.7 and later), Ubuntu 20.04, and SLES 15 nodes.
RHEL 8.x, Ubuntu 20.04, and SLES 15
nodes.
A cluster is at mixed levels. The installation toolkit can be used to upgrade all nodes
or a subset of nodes to a common code level.

What are invalid starting scenarios for the installation toolkit?


• NSDs were not cleaned up or deleted before a cluster deletion.
• Unsupported node types were added to the installation toolkit.
• File systems or NSDs are served by unsupported node types.
The installation toolkit cannot add or change the configuration. It can use only the file system paths
for protocol configuration.
• An ESS cluster exists and protocol nodes have not yet been added to the cluster.
Protocol nodes must first be added to the ESS cluster before the installation toolkit can install the
protocols.
Does the installation toolkit need to have my entire cluster information?
No, but this depends on the use case. Here are some examples in which the installation toolkit does
not need to be made aware of the configuration information of an existing cluster:
• Deploying protocols on protocol nodes: The installation toolkit needs only the protocol nodes
information and that they are configured to point to cesSharedRoot.
• Upgrading protocol nodes: The installation toolkit can upgrade a portion of the cluster such as all
protocol nodes. In this case, it does not need to be made aware of other NSD or client/server nodes
within the cluster.
• Adding protocols to an ESS cluster: The installation toolkit does not need to be made aware of the
EMS or I/O nodes. The installation toolkit needs only the protocol nodes information and that they
are configured to point to cesSharedRoot.
• Adding protocols to a cluster with AIX, SLES 12, Debian, RHEL6, and Windows nodes: The
installation toolkit does not need to be made aware of any nodes except for the RHEL 8.x, RHEL

344 IBM Storage Scale 5.1.9: Problem Determination Guide


7.x (7.7 or later), Ubuntu 20.04, and SLES 15 protocol nodes. The installation toolkit needs only the
protocol nodes information and that they are configured to point to cesSharedRoot.
Can the installation toolkit act on some protocol nodes but not all?
Protocol nodes must always be treated as a group of nodes. Therefore, do not use the installation
toolkit to run install, deploy, or upgrade commands on a subset of protocol nodes.

Support for DMAPI enabled nodes


On nodes with DMAPI enabled, the installation toolkit does not provide much help to users in case of an
error including whether a DMAPI related function is supported or unsupported.
Use the following steps to verify whether DMAPI is enabled on your nodes and to use the installation
toolkit on DMAPI enabled nodes.
1. Verify that DMAPI is enabled on a file system by using the following command:

# mmlsfs all -z
File system attributes for /dev/fs1:
====================================
flag value description
------------------- ------------------------ -----------------------------------
-z yes Is DMAPI enabled?

2. Shut down all functions that are using DMAPI and unmount DMAPI by using the following steps:
a. Shut down all functions that are using DMAPI. This includes HSM policies and IBM Spectrum
Archive.
b. Unmount the DMAPI file system from all nodes by using the following command:

# mmunmount fs1 -a

Note: If the DMAPI file system is also the CES shared root file system, then you must first shut
down GPFS on all protocol nodes before unmounting the file system.
i) Check whether the DMAPI file system is also the CES shared root file system, use the following
command:

# mmlsconfig | grep cesSharedRoot

ii) Compare the output of this command with that of Step 1 to determine whether the CES shared
root file system has DMAPI enabled.
iii) Shut down GPFS on all protocol nodes by using the following command:

# mmshutdown -N cesNodes

c. Disable DMAPI by using the following command:

# mmchfs fs1 -z no

3. If GPFS was shut down on the protocol nodes in one of the preceding steps, start GPFS on the protocol
nodes by using the following command:

# mmstartup -N cesNodes

4. Remount the file system on all nodes by using the following command:

# mmmount fs1 -a

5. Proceed with using the installation toolkit as now it can be used on all file systems.
6. After the task is done by using the installation toolkit is completed, enable DMAPI by using the
following steps:
a. Unmount the DMAPI file system from all nodes.

Chapter 20. Installation and configuration issues 345


Note: If the DMAPI file system is also the CES shared root file system, shut down GPFS on all
protocol nodes before unmounting the file system.
b. Enable DMAPI by using the following command:

# mmchfs fs1 -z yes

c. Start GPFS on all protocol nodes.


d. Remount the file system on all nodes.

Support for ESS cluster


For information on using the installation toolkit with a cluster containing ESS, see the following topics in
IBM Storage Scale: Concepts, Planning, and Installation Guide:
• ESS awareness with the installation toolkit
• Preparing a cluster that contains ESS for adding protocols
• Deploying protocols on an existing cluster

Understanding supported upgrade functions with installation toolkit


Use this information to understand the setups in which upgrade can be done using the installation toolkit.
• “Scope of the upgrade process” on page 346
• “Understanding implications of a failed upgrade” on page 346

Scope of the upgrade process


The upgrade process using the installation toolkit can be summarized as follows:
• The upgrade process acts upon all nodes specified in the cluster definition file (typically using the ./
spectrumscale node add commands).
• All installed/deployed components are upgraded.
• Upgrades are sequential with multiple passes.
The upgrade process using the installation toolkit comprises following passes:
1. Pass 1 of all nodes upgrades GPFS sequentially.
2. Pass 2 of all nodes upgrades Object sequentially.
3. Pass 3 of all nodes upgrades NFS sequentially.
4. Pass 4 of all nodes upgrades SMB sequentially.
5. A post check is done to verify a healthy cluster state after the upgrade.
As an upgrade moves sequentially across nodes, functions such as SMB, NFS, Object, Performance
Monitoring, AFM, etc. undergo failovers. This might cause outages on the nodes being upgraded.
Upgrading a subset of nodes is possible because the installation toolkit acts only on the nodes specified
in the cluster definition file. If you want to upgrade a subset of cluster nodes, be aware of the node types
and the functions being performed on these nodes. For example, all protocol nodes within a cluster must
be upgraded by the installation toolkit in one batch.

Understanding implications of a failed upgrade


A failed upgrade might leave a cluster in a state of containing multiple code levels. It is important to
analyze console output to determine which nodes or components were upgraded prior to the failure and
which node or component was in the process of being upgraded when the failure occurred.
Once the problem has been isolated, a healthy cluster state must be achieved prior to continuing the
upgrade. Use the mmhealth command in addition to the mmces state show -a command to verify
that all services are up. It might be necessary to manually start services that were down when the

346 IBM Storage Scale 5.1.9: Problem Determination Guide


upgrade failed. Starting the services manually helps achieve a state in which all components are healthy
prior to continuing the upgrade.
For more information about verifying service status, see mmhealth command and mmces state show
command in IBM Storage Scale: Command and Programming Reference Guide.
If an upgrade using the installation toolkit fails, you can rerun the upgrade. For more information, see
Upgrade rerun after an upgrade failure in IBM Storage Scale: Concepts, Planning, and Installation Guide.

Installation toolkit setup command fails after upgrade to Ubuntu


22.04
The installation toolkit setup command might fail after the upgrade to Ubuntu 22.04.

File "/usr/local/bin/ansible", line 34, in <module>


from ansible import context
ModuleNotFoundError: No module named 'ansible'
[ FATAL ] Ansible is not installed on test-41.openstacklocal, Please install the required
Ansible version 2.9.15 then continue.

The error occurs because the Ansible package might get removed after the upgrade to Ubuntu 22.04.
Resolve this issue as follows.
1. Manually install Ansible 2.9.15.

pip3 install ansible==2.9.15

2. Retry the installation by using the installation toolkit.

Installation toolkit fails with Python not found error


The installation toolkit might fail due to a Python not found error.

"module_stdout": "/bin/sh: 1: /usr/libexec/platform-python: not found

The error might occur because the path of the required Python version is not set correctly.
Resolve this issue as follows.
1. Set the path of the required Python version correctly.

ln --symbolic /usr/bin/python3 /usr/bin/python

2. Retry the installation by using the installation toolkit.

Installation toolkit fails on Ubuntu 20.04.4 nodes with Ansible


related error
The installation toolkit might fail on Ubuntu 20.04.4 nodes due to an Ansible related error.
The installation failure occurs when the Ubuntu 20.04.4 nodes have the default Ansible version 2.9.15
installed.
Resolve this issue as follows.
1. Install the latest Ansible 2.9.x version on the Ubuntu 20.04.4 node.

pip3 install --upgrade ansible==2.9.*

2. Retry the installation by using the installation toolkit.

Chapter 20. Installation and configuration issues 347


Installation toolkit Ansible package troubleshooting if it fails for
already installed ansible for Red Hat Enterprise Linux® 9.0 and >=
Red Hat Enterprise Linux 8.6
Red Hat Enterprise Linux 9.0 and >= Red Hat Enterprise Linux 8.6 node has ansible-core package in the
repository by default. If the user installs ansible using yum/dnf then it takes ansible from appstream
repository itself. But, it does not work with ansible-toolkit as the installed ansible core has very basic
functionality and does not have additional collection and module that is used in the ansible code.
So, if the user wants to use repository based ansible-core and do not want to remove installed ansible
package to use toolkit shipped ansible package, then the user can install required ansible collection
through ansible-galaxy command:

ansible-galaxy collection install community.general


ansible-galaxy collection install ansible.posix

Installation toolkit fails if running yum commands result in


warning or error
The installation toolkit might fail to install any packages if running yum commands result in warning or
error due to a yum subscription manager issue.
Resolve this issue as follows:
1. Confirm that the installation toolkit failure is due to the yum subscription manager issue as follows.
a. Run a yum command such as yum repolist.
b. Check if the output of the yum command contains warnings such as the following snippet.

Plugin "product-id" can't be imported


Plugin "search-disabled-repos" can't be imported
Plugin "subscription-manager" can't be imported

2. Fix the yum environment issue such that there are no warnings or errors in the output of yum
commands.
3. Retry the installation by using the installation toolkit.

Installation toolkit operation fails with PKey parsing or OpenSSH


keys related errors
The installation toolkit operation might fail with a PKey parsing-related error on SLES or an OpenSSH keys
related error on Red Hat Enterprise Linux or Ubuntu.
The error message on SLES is similar to the following message:

ArgumentError: Could not parse PKey: no start line\n'

The error message on Red Hat Enterprise Linux or Ubuntu is similar to the following message:

ERROR: NotImplementedError: OpenSSH keys only supported if ED25519 is available


net-ssh requires the following gems for ed25519 support

Workaround:
1. Do one of the following steps:
• Create a fresh SSH key by using the -PEM option. For example, ssh-keygen -m PEM
• Convert the existing SSH key by using the following command.

348 IBM Storage Scale 5.1.9: Problem Determination Guide


ssh-keygen -p -N "" -m pem -f /root/.ssh/id_rsa

2. Retry the installation procedure by using the installation toolkit.

Installation toolkit setup fails with an ssh-agent related error


The installation toolkit setup might fail with an ssh-agent related error.
The error message is similar to the following:

ERROR: Net::SSH::Authentication::AgentError: could not get identity count

Workaround:
1. Issue the following command on each node added in the installation toolkit cluster definition.

eval "$(ssh-agent)";

2. Retry the installation procedure by using the installation toolkit.

systemctl commands time out during installation, deployment, or


upgrade with the installation toolkit
In some environments, systemctl commands such as systemctl daemon-reexec and systemctl
list-unit-files might time out during installation, deployment, or upgrade using the installation
toolkit. This causes the installation, deployment, or upgrade operation to fail.
When this issue occurs, a message similar to the following might be present in the installation toolkit log:

no implicit conversion of false into Array

Workaround:
1. List all the scope files without a directory.

for j in $(ls /run/systemd/system/session*.scope);


do if [[ ! -d /run/systemd/system/$j.d ]];
then echo $j;
fi;
done

2. Remove all the scope files without a directory.

for j in $(ls /run/systemd/system/session*.scope);


do if [[ ! -d /run/systemd/system/$j.d ]];
then rm -f $j;
fi;
done

3. Rerun installation, deployment, or upgrade using the installation toolkit.

Installation toolkit setup on Ubuntu fails due to dpkg database


lock issue
The installation toolkit setup on Ubuntu nodes might fail due to a dpkg database lock issue.
Symptom:
The error message might be similar to one of the following:

Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
Unable to lock directory /var/lib/apt/lists/
Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?

Chapter 20. Installation and configuration issues 349


Workaround:
1. Identify the apt-get process by issuing the following command:

ps -ef | grep apt-get

This process might be running on multiple nodes. Therefore, you might need to issue this command
on each of these nodes. If the installation process failed after the creation of cluster, you can use the
mmdsh command to identify the apt-get process on each node it is running on.

mmdsh ps -ef | grep apt-get

2. Kill the apt-get process on each node it is running on:

sudo kill Process_ID

3. Retry the installation toolkit setup. If the error persists, issue following commands and then try again:

rm /var/lib/apt/lists/lock
dpkg –configure -a

Installation toolkit config populate operation fails to detect object


endpoint
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
The installation toolkit deployment precheck might fail in some cases because the config populate
operation is unable to detect the object endpoint.
However, the deployment precheck identifies this issue and suggests the corrective action.
Workaround
1. Run the following command to the add the object endpoint:

./spectrumscale config object -e EndPoint

2. Proceed with the installation, deployment, or upgrade with the installation toolkit.

Post installation and configuration problems


This topic describes the issues that you might encounter after installing or configuring IBM Storage Scale.
The IBM Storage Scale: Concepts, Planning, and Installation Guide provides the step-by-step procedure
for installing and migrating IBM Storage Scale, however, some problems might occur after installation and
configuration if the procedures were not properly followed.

350 IBM Storage Scale 5.1.9: Problem Determination Guide


Some of those problems might include:
• Not being able to start GPFS after installation of the latest version. Did you reboot your IBM Storage
Scale nodes before and after the installation/upgrade of IBM Storage Scale? If you did, see “GPFS
daemon does not come up” on page 358. If not, reboot. For more information, see the Initialization of
the GPFS daemon topic in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
• Not being able to access a file system. See “File system fails to mount” on page 377.
• New GPFS functions do not operate. See “GPFS commands are unsuccessful” on page 362.

Cluster is crashed after re-installation


This topic describes the steps that you need to perform when a cluster crashes after IBM Storage Scale
re-installation.
After reinstalling IBM Storage Scale code, check whether the /var/mmfs/gen/mmsdrfs file was lost.
If it was lost, and an up-to-date version of the file is present on the primary GPFS cluster configuration
server, restore the file by issuing this command from the node on which it is missing:

mmsdrrestore -p primaryServer

where primaryServer is the name of the primary GPFS cluster configuration server.
If the /var/mmfs/gen/mmsdrfs file is not present on the primary GPFS cluster configuration server, but
it is present on some other node in the cluster, restore the file by issuing these commands:

mmsdrrestore -p remoteNode -F remoteFile


mmchcluster -p LATEST

where remoteNode is the node that has an up-to-date version of the /var/mmfs/gen/mmsdrfs file, and
remoteFile is the full path name of that file on that node.
One way to ensure that the latest version of the /var/mmfs/gen/mmsdrfs file is always available is to
use the mmsdrbackup user exit.
If you have made modifications to any of the users exist in /var/mmfs/etc, you need to restore them
before starting GPFS.
For additional information, see “Recovery from loss of GPFS cluster configuration data file” on page 355.

Node cannot be added to the GPFS cluster


There is an indication leading you to the conclusion that a node cannot be added to a cluster and steps to
follow to correct the problem.
That indication is:
• You issue the mmcrcluster or mmaddnode command and receive the message:
6027-1598
Node nodeName was not added to the cluster. The node appears to already belong to a GPFS
cluster.
Steps to follow if a node cannot be added to a cluster:
1. Run the mmlscluster command to verify that the node is not in the cluster.
2. If the node is not in the cluster, issue this command on the node that could not be added:

mmdelnode -f

3. Reissue the mmaddnode command.

Chapter 20. Installation and configuration issues 351


Problems with the /etc/hosts file
This topic describes the issues relating to the /etc/hosts file that you might come across while
installing or configuring IBM Storage Scale.
The /etc/hosts file must have a unique node name for each node interface to be used by GPFS.
Violation of this requirement results in the message:
6027-1941
Cannot handle multiple interfaces for host hostName.
If you receive this message, correct the /etc/hosts file so that each node interface to be used by GPFS
appears only once in the file.

Linux configuration considerations


This topic describes the Linux configuration that you need to consider while installing or configuring IBM
Storage Scale on your cluster.
Note: This information applies only to Linux nodes.
Depending on your system configuration, you may need to consider:
1. Why can only one host successfully attach to the Fibre Channel loop and see
the Fibre Channel disks?
Your host bus adapter may be configured with an enabled Hard Loop ID that conflicts with other host
bus adapters on the same Fibre Channel loop.
To see if that is the case, reboot your machine and enter the adapter bios with <Alt-Q> when the
Fibre Channel adapter bios prompt appears. Under the Configuration Settings menu, select Host
Adapter Settings and either ensure that the Adapter Hard Loop ID option is disabled or assign a
unique Hard Loop ID per machine on the Fibre Channel loop.
2. Could the GPFS daemon be terminated due to a memory shortage?
The Linux virtual memory manager (VMM) exhibits undesirable behavior for low memory situations on
nodes, where the processes with the largest memory usage are killed by the kernel (using OOM killer),
yet no mechanism is available for prioritizing important processes that should not be initial candidates
for the OOM killer. The GPFS mmfsd daemon uses a large amount of pinned memory in the page
pool for caching data and metadata, and so the mmfsd process is a likely candidate for termination if
memory must be freed up.
3. What are the performance tuning suggestions?
For an up-to-date list of tuning suggestions, see the IBM Storage Scale FAQ in IBM Documentation.
For Linux on Z, see the Device Drivers, Features, and Commands topic in the Linux on Z library
overview.

Python conflicts while deploying object packages using installation


toolkit
While deploying object packages using the installation toolkit, you might encounter a dependency conflict
between python-dnspython and python-dns.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.

352 IBM Storage Scale 5.1.9: Problem Determination Guide


- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
Symptom:
The error messages can be similar to the following:

[ INFO ] [shepard7lp1.tuc.stglabs.example.com 12-04-2017 16:39:29] IBM SPECTRUM SCALE:


Installing Object packages (SS50)
[ FATAL ] shepard3lp1.tuc.stglabs.example.com failure whilst: Installing Object packages (SS50)
[ WARN ] SUGGESTED ACTION(S):
[ WARN ] Check Object dependencies are available via your package manager or are already met
prior to installation.

Workaround
1. Run the following command to manually remove the conflicting RPM:

yum remove python-dns

2. Retry deploying the object packages.

Problems with running commands on other nodes


This topic describes the problems that you might encounter relating to running remote commands during
installing and configuring IBM Storage Scale.
Many of the GPFS administration commands perform operations on nodes other than the node on which
the command was issued. This is achieved by utilizing a remote invocation shell and a remote file copy
command. By default, these items are /usr/bin/ssh and /usr/bin/scp. You also have the option of
specifying your own remote shell and remote file copy commands to be used instead of the default ssh
and scp. The remote shell and copy commands must adhere to the same syntax forms as ssh and scp
but may implement an alternate authentication mechanism. For more information on the mmcrcluster
and mmchcluster commands, see the mmcrcluster command and the mmchcluster command pages
in the IBM Storage Scale: Command and Programming Reference Guide. These are problems you may
encounter with the use of remote commands.

Authorization problems
This topic describes issues with running remote commands due to authorization problems in IBM Storage
Scale.
The ssh and scp commands are used by GPFS administration commands to perform operations on other
nodes. The ssh daemon (sshd) on the remote node must recognize the command being run and must
obtain authorization to invoke it.
Note: Use the ssh and scp commands that are shipped with the OpenSSH package supported by GPFS.
Refer to the IBM Storage Scale FAQ in IBM Documentation for the latest OpenSSH information.
For more information, see “Problems due to missing prerequisites” on page 339.
For the ssh and scp commands issued by GPFS administration commands to succeed, each node in the
cluster must have an .rhosts file in the home directory for the root user, with file permission set to 600.
This .rhosts file must list each of the nodes and the root user. If such an .rhosts file does not exist
on each node in the cluster, the ssh and scp commands issued by GPFS commands fail with permission
errors, causing the GPFS commands to fail in turn.

Chapter 20. Installation and configuration issues 353


If you elected to use installation specific remote invocation shell and remote file copy commands, you
must ensure:
1. Proper authorization is granted to all nodes in the GPFS cluster.
2. The nodes in the GPFS cluster can communicate without the use of a password, and without any
extraneous messages.

Connectivity problems
This topic describes the issues with running GPFS commands on remote nodes due to connectivity
problems.
Another reason why ssh may fail is that connectivity to a needed node has been lost. Error messages
from mmdsh may indicate that connectivity to such a node has been lost. Here is an example:

mmdelnode -N k145n04
Verifying GPFS is stopped on all affected nodes ...
mmdsh: 6027-1617 There are no available nodes on which to run the command.
mmdelnode: 6027-1271 Unexpected error from verifyDaemonInactive: mmcommon onall.
Return code: 1

If error messages indicate that connectivity to a node has been lost, use the ping command to verify
whether the node can still be reached:

ping k145n04
PING k145n04: (119.114.68.69): 56 data bytes
<Ctrl- C>
----k145n04 PING Statistics----
3 packets transmitted, 0 packets received, 100% packet loss

If connectivity has been lost, restore it, then reissue the GPFS command.

GPFS error messages for rsh problems


This topic describes the error messages that are displayed for rsh issues in IBM Storage Scale.
When rsh problems arise, the system may display information similar to these error messages:
6027-1615
nodeName remote shell process had return code value.
6027-1617
There are no available nodes on which to run the command.

Cluster configuration data file issues


This topic describes the issues that you might encounter with respect to the cluster configuration data
files while installing or configuring IBM Storage Scale.

GPFS cluster configuration data file issues


This topic describes the issues relating to IBM Storage Scale cluster configuration data.
GPFS uses a file to serialize access of administration commands to the GPFS cluster configuration data
files. This lock file is kept on the primary GPFS cluster configuration server in the /var/mmfs/gen/
mmLockDir directory. If a system failure occurs before the cleanup of this lock file, the file remains
and subsequent administration commands may report that the GPFS cluster configuration data files are
locked. Besides a serialization lock, certain GPFS commands may obtain an additional lock. This lock is
designed to prevent GPFS from coming up, or file systems from being mounted, during critical sections
of the command processing. If this happens, then you can see a message that shows the name of the
blocking command, similar to message:
6027-1242
GPFS is waiting for requiredCondition.

354 IBM Storage Scale 5.1.9: Problem Determination Guide


To release the lock:
1. Determine the PID and the system that owns the lock by issuing:

mmcommon showLocks

The mmcommon showLocks command displays information about the lock server, lock name, lock
holder, PID, and extended information. If a GPFS administration command is not responding, stopping
the command unlocks the lock. If another process has this PID, another error occurred to the original
GPFS command, causing it to die without freeing the lock, and this new process has the same PID. If
this is the case, do not kill the process.
2. If any locks are held and you want to release them manually, from any node in the GPFS cluster issue
the command:

mmcommon freeLocks <lockName>

GPFS error messages for cluster configuration data file problems


This topic describes the error messages relating to the cluster configuration data file issues in IBM
Storage Scale.
When GPFS commands are unable to retrieve or update the GPFS cluster configuration data files, the
system may display information similar to these error messages:
6027-1628
Cannot determine basic environment information. Not enough nodes are available.
6027-1630
The GPFS cluster data on nodeName is back level.
6027-1631
The commit process failed.
6027-1632
The GPFS cluster configuration data on nodeName is different than the data on nodeName.
6027-1633
Failed to create a backup copy of the GPFS cluster data on nodeName.

Recovery from loss of GPFS cluster configuration data file


This topic describes the procedure for recovering the cluster configuration data file in IBM Storage Scale.
A copy of the IBM Storage Scale cluster configuration data files is stored in the /var/mmfs/gen/
mmsdrfs file on each node. For proper operation, this file must exist on each node in the IBM Storage
Scale cluster. The latest level of this file is guaranteed to be on the primary, and secondary if specified,
GPFS cluster configuration server nodes that were defined when the IBM Storage Scale cluster was first
created with the mmcrcluster command.
If the /var/mmfs/gen/mmsdrfs file is removed by accident from any of the nodes, and an up-to-date
version of the file is present on the primary IBM Storage Scale cluster configuration server, restore the file
by issuing this command from the node on which it is missing:

mmsdrrestore -p primaryServer

where primaryServer is the name of the primary GPFS cluster configuration server.
If the /var/mmfs/gen/mmsdrfs file is not present on the primary GPFS cluster configuration server, but
is present on some other node in the cluster, restore the file by issuing these commands:

mmsdrrestore -p remoteNode -F remoteFile


mmchcluster -p LATEST

where remoteNode is the node that has an up-to-date version of the /var/mmfs/gen/mmsdrfs file and
remoteFile is the full path name of that file on that node.

Chapter 20. Installation and configuration issues 355


One way to ensure that the latest version of the /var/mmfs/gen/mmsdrfs file is always available is to
use the mmsdrbackup user exit.

Automatic backup of the GPFS cluster data


This topic describes the procedure for automatically backing up the cluster data in IBM Storage Scale.
The IBM Storage Scale provides an exit, mmsdrbackup, that can be used to automatically back up the
IBM Storage Scale configuration data every time it changes. To activate this facility, follow these steps:
1. Modify the IBM Storage Scale-provided version of mmsdrbackup as described in its prologue, to
accomplish the backup of the mmsdrfs file however the user desires. This file is /usr/lpp/mmfs/
samples/mmsdrbackup.sample.
2. Copy this modified mmsdrbackup.sample file to /var/mmfs/etc/mmsdrbackup on all of the nodes
in the cluster. Make sure that the permission bits for /var/mmfs/etc/mmsdrbackup are set to
permit execution by root.
The IBM Storage Scale system invokes the user-modified version of mmsdrbackup in /var/mmfs/etc
every time a change is made to the mmsdrfs file. This action performs the backup of the mmsdrfs file
according to the user's specifications. For more information on GPFS user exits, see the GPFS user exits
topic in the IBM Storage Scale: Command and Programming Reference Guide.

Installation of gpfs.gpfsbin reports an error


The depmod: ERROR: fstatat(4, mmfs26.ko): No such file or directory error message is
related to the installation of the gpfs.gpfsbin rpm.
During a kernel update, the Red Hat Driver Update program checks if kernel modules present for older
kernels are compatible with the new kernel. If the kernel modules are compatible with the new kernel,
the kernel creates symlinks in the weak-updates directory. Symlinks are created for mmfs26.ko,
mmfslinux.ko, and tracedev.ko object files of IBM Storage Scale.
If the system administrator removes a gpfs.gplbin rpm for older kernels after the kernel is updated
then the object files, mmfs26.ko, mmfslinux.ko, and tracedev.ko, get removed for older kernels.
The dangling symlinks that are created for the object files are left in the weak-updates directory for the
new kernel.
While installing the gpfs.gplbin rpm, the depmod -a command can result in the depmod: ERROR:
fstatat(4, mmfs26.ko): No such file or directory error as the object files, mmfs26.ko,
mmfslinux.ko, and tracedev.ko, are absent in the weak-updates folder. The dangling symlinks are
not used for the installation, upgrade, or execution of IBM Storage Scale, the error message can be
ignored.
To avoid the error message, run the following steps before you install the gpfs.gplbin rpm to clean up
the dangling symlinks:
1. After removal of a gpfs.gplbin rpm, determine the kernel version by running the uname -r
command, and then check the kernel weak-updates directory /lib/modules/<kernel version>
for broken symlinks using the ls -l command.
2. Remove dangling symlinks, if any.
3. Run the depmod -a command to confirm that there are no dangling symlinks.
4. Install the new gpfs.gplbin rpm.

GPFS application calls

356 IBM Storage Scale 5.1.9: Problem Determination Guide


Error numbers specific to GPFS applications calls
This topic describes the error numbers specific to GPFS application calls.
When experiencing installation and configuration problems, GPFS may report these error numbers in the
operating system error log facility, or return them to an application:
ECONFIG = 215, Configuration invalid or inconsistent between different nodes.
This error is returned when the levels of software on different nodes cannot coexist. For information
about which levels may coexist, see the IBM Storage Scale FAQ in IBM Documentation.
ENO_QUOTA_INST = 237, No Quota management enabled.
To enable quotas for the file system issue the mmchfs -Q yes command. To disable quotas for the
file system issue the mmchfs -Q no command.
EOFFLINE = 208, Operation failed because a disk is offline
This is most commonly returned when an open of a disk fails. Since GPFS attempts to continue
operation with failed disks, this operation is returned when the disk is first needed to complete a
command or application request. If this return code occurs, check your disk subsystem for stopped
states and check to determine if the network path exists. In rare situations, this operation is reported
if disk definitions are incorrect.
EALL_UNAVAIL = 218, A replicated read or write failed because none of the replicas were available.
Multiple disks in multiple failure groups are unavailable. Follow the procedures in Chapter 25, “Disk
issues,” on page 407 for unavailable disks.
6027-341 [D]
Node nodeName is incompatible because its maximum compatible version (number) is less than the
version of this node (number).
6027-342 [E]
Node nodeName is incompatible because its minimum compatible version is greater than the version
of this node (number).
6027-343 [E]
Node nodeName is incompatible because its version (number) is less than the minimum compatible
version of this node (number).
6027-344 [E]
Node nodeName is incompatible because its version is greater than the maximum compatible version
of this node (number).

GPFS modules cannot be loaded on Linux


You must build the GPFS portability layer binaries based on the kernel configuration of your system. For
more information, see The GPFS open source portability layer topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide. During mmstartup processing, GPFS loads the mmfslinux kernel
module.
Some of the more common problems that you may encounter are:
1. If the portability layer is not built, you may see messages similar to:

Mon Mar 26 20:56:30 EDT 2012: runmmfs starting


Removing old /var/adm/ras/mmfs.log.* files:
Unloading modules from /lib/modules/2.6.32.12-0.6-ppc64/extra
runmmfs: The /lib/modules/2.6.32.12-0.6-ppc64/extra/mmfslinux.ko kernel extension does not
exist.
runmmfs: Unable to verify kernel/module configuration.
Loading modules from /lib/modules/2.6.32.12-0.6-ppc64/extra
runmmfs: The /lib/modules/2.6.32.12-0.6-ppc64/extra/mmfslinux.ko kernel extension does not
exist.
runmmfs: Unable to verify kernel/module configuration.
Mon Mar 26 20:56:30 EDT 2012 runmmfs: error in loading or unloading the mmfs kernel extension
Mon Mar 26 20:56:30 EDT 2012 runmmfs: stopping GPFS

2. The GPFS kernel modules, mmfslinux and tracedev, are built with a kernel version that differs from
that of the currently running Linux kernel. This situation can occur if the modules are built on another

Chapter 20. Installation and configuration issues 357


node with a different kernel version and copied to this node, or if the node is rebooted using a kernel
with a different version.
3. If the mmfslinux module is incompatible with your system, you may experience a kernel panic on
GPFS startup. Ensure that the site.mcr has been configured properly from the site.mcr.proto,
and GPFS has been built and installed properly.
For more information about the mmfslinux module, see the Building the GPFS portability layer topic in
the IBM Storage Scale: Concepts, Planning, and Installation Guide.

GPFS daemon issues


This topic describes the GPFS daemon issues that you might encounter while installing or configuring IBM
Storage Scale.

GPFS daemon does not come up


There are several indications that could lead you to the conclusion that the GPFS daemon (mmfsd) does
not come up and there are some steps to follow to correct the problem.
Those indications include:
• The file system has been enabled to mount automatically, but the mount has not completed.
• You issue a GPFS command and receive the message:
6027-665
Failed to connect to file system daemon: Connection refused.
• The GPFS log does not contain the message:
6027-300 [N]
mmfsd ready
• The GPFS log file contains this error message: 'Error: daemon and kernel extension do not match.' This
error indicates that the kernel extension currently loaded in memory and the daemon currently starting
have mismatching versions. This situation may arise if a GPFS code update has been applied, and the
node has not been rebooted prior to starting GPFS.
While GPFS scripts attempt to unload the old kernel extension during update and install operations,
such attempts may fail if the operating system is still referencing GPFS code and data structures. To
recover from this error, ensure that all GPFS file systems are successfully unmounted, and reboot the
node. The mmlsmount command can be used to ensure that all file systems are unmounted.

Steps to follow if the GPFS daemon does not come up


This topic describes the steps that you need to follow if the GPFS daemon does not come up after
installation of IBM Storage Scale.
1. See “GPFS modules cannot be loaded on Linux” on page 357 if your node is running Linux, to verify
that you have built the portability layer.
2. Verify that the GPFS daemon is active by issuing:

ps -e | grep mmfsd

The output of this command should list mmfsd as operational. For example:

12230 pts/8 00:00:00 mmfsd

If the output does not show this, the GPFS daemon needs to be started with the mmstartup
command.
3. If you did not specify the autoload option on the mmcrcluster or the mmchconfig command, you
need to manually start the daemon by issuing the mmstartup command.

358 IBM Storage Scale 5.1.9: Problem Determination Guide


If you specified the autoload option, someone may have issued the mmshutdown command. In this
case, issue the mmstartup command. When using autoload for the first time, mmstartup must be
run manually. The autoload takes effect on the next reboot.
4. Verify that the network upon which your GPFS cluster depends is up by issuing:

ping nodename

to each node in the cluster. A properly working network and node correctly replies to the ping with no
lost packets.
Query the network interface that GPFS is using with:

netstat -i

A properly working network reports no transmission errors.


5. Verify that the GPFS cluster configuration data is available by looking in the GPFS log. If you see the
message:
6027-1592
Unable to retrieve GPFS cluster files from node nodeName.
Determine the problem with accessing node nodeName and correct it.
6. Verify that the GPFS environment is properly initialized by issuing these commands and ensuring that
the output is as expected.
• Issue the mmlscluster command to list the cluster configuration. This command also updates the
GPFS configuration data on the node. Correct any reported errors before continuing.
• List all file systems that were created in this cluster. For an AIX node, issue:

lsfs -v mmfs

For a Linux node, issue:

cat /etc/fstab | grep gpfs

If any of these commands produce unexpected results, this may be an indication of corrupted GPFS
cluster configuration data file information. Follow the procedures in “Information to be collected
before contacting the IBM Support Center” on page 555, and then contact the IBM Support Center.
7. GPFS requires a quorum of nodes to be active before any file system operations can be honored.
This requirement guarantees that a valid single token management domain exists for each GPFS file
system. Prior to the existence of a quorum, most requests are rejected with a message indicating that
quorum does not exist.
To identify which nodes in the cluster have daemons up or down, issue:

mmgetstate -L -a

If insufficient nodes are active to achieve quorum, go to any nodes not listed as active and perform
problem determination steps on these nodes. A quorum node indicates that it is part of a quorum by
writing an mmfsd ready message to the GPFS log. Remember that your system may have quorum
nodes and non-quorum nodes, and only quorum nodes are counted to achieve the quorum.
8. This step applies only to AIX nodes. Verify that GPFS kernel extension is not having problems with its
shared segment by invoking:

cat /var/adm/ras/mmfs.log.latest

Messages such as:


6027-319
Could not create shared segment.
must be corrected by the following procedure:

Chapter 20. Installation and configuration issues 359


a. Issue the mmshutdown command.
b. Remove the shared segment in an AIX environment:
i) Issue the mmshutdown command.
ii) Issue the mmfsadm cleanup command.
c. If you are still unable to resolve the problem, reboot the node.
9. If the previous GPFS daemon was brought down and you are trying to start a new daemon but are
unable to, this is an indication that the original daemon did not completely go away. Go to that node
and check the state of GPFS. Stopping and restarting GPFS or rebooting this node often returns GPFS
to normal operation. If this fails, follow the procedures in “Additional information to collect for GPFS
daemon crashes” on page 557, and then contact the IBM Support Center.

Unable to start GPFS after the installation of a new release of GPFS


This topic describes the steps that you need to perform if you are unable to start GPFS after installing a
new version of IBM Storage Scale.
If one or more nodes in the cluster does not start GPFS, then these are the possible causes:
• If message:
6027-2700 [E]
A node join was rejected. This could be due to incompatible daemon versions, failure to find the
node in the configuration database, or no configuration manager found.
is written to the GPFS log, incompatible versions of GPFS code exist on nodes within the same cluster.
• If messages stating that functions are not supported are written to the GPFS log, then you may not have
the correct kernel extensions loaded.
1. Ensure that the latest GPFS install packages are loaded on your system.
2. If running on Linux, then ensure that the latest kernel extensions have been installed and built.
See the Building the GPFS portability layer topic in the IBM Storage Scale: Concepts, Planning, and
Installation Guide.
3. Reboot the GPFS node after an installation to ensure that the latest kernel extension is loaded.
• The daemon does not start because the configuration data was not migrated. See “Post installation and
configuration problems” on page 350.

GPFS error messages for shared segment and network problems


This topic describes the error messages relating to issues in shared segment and network in IBM Storage
Scale.
For shared segment problems, follow the problem determination and repair actions specified with the
following messages:
6027-319
Could not create shared segment.
6027-320
Could not map shared segment.
6027-321
Shared segment mapped at wrong address (is value, should be value).
6027-322
Could not map shared segment in kernel extension.
For network problems, follow the problem determination and repair actions specified with the following
message:
6027-306 [E]
Could not initialize inter-node communication

360 IBM Storage Scale 5.1.9: Problem Determination Guide


Error numbers specific to GPFS application calls when the daemon is unable
to come up
This topic describes the application call error numbers when the daemon is unable to come up.
When the daemon is unable to come up, GPFS may report these error numbers in the operating system
error log, or return them to an application:
ECONFIG = 215, Configuration invalid or inconsistent between different nodes.
This error is returned when the levels of software on different nodes cannot coexist. For information
about which levels may coexist, see the IBM Storage Scale FAQ in IBM Documentation.
6027-341 [D]
Node nodeName is incompatible because its maximum compatible version (number) is less than the
version of this node (number).
6027-342 [E]
Node nodeName is incompatible because its minimum compatible version is greater than the version
of this node (number).
6027-343 [E]
Node nodeName is incompatible because its version (number) is less than the minimum compatible
version of this node (number).
6027-344 [E]
Node nodeName is incompatible because its version is greater than the maximum compatible version
of this node (number).

GPFS daemon went down


There are a number of conditions that can cause the GPFS daemon to exit.
These are all conditions where the GPFS internal checking has determined that continued operation
would be dangerous to the consistency of your data. Some of these conditions are errors within GPFS
processing but most represent a failure of the surrounding environment.
In most cases, the daemon exits and restarts after recovery. If it is not safe to simply force the
unmounted file systems to recover, the GPFS daemon exits.
Indications leading you to the conclusion that the daemon went down:
• Applications running at the time of the failure, see either ENODEV or ESTALE errors. The ENODEV errors
are generated by the operating system until the daemon has restarted. The ESTALE error is generated
by GPFS as soon as it restarts.
When quorum is lost, applications with open files receive an ESTALE error return code until the files
are closed and reopened. New file open operations fail until quorum is restored and the file system is
remounted. Applications accessing these files prior to GPFS return may receive a ENODEV return code
from the operating system.
• The GPFS log contains the message:
6027-650 [X]
The mmfs daemon is shutting down abnormally.
Most GPFS daemon down error messages are in the mmfs.log.previous log for the instance that
failed. If the daemon restarted, it generates a new mmfs.log.latest. Begin problem determination
for these errors by examining the operating system error log.
If an existing quorum is lost, GPFS stops all processing within the cluster to protect the integrity of your
data. GPFS attempts to rebuild a quorum of nodes and remounts the file system if automatic mounts are
specified.
• Open requests are rejected with no such file or no such directory errors.

Chapter 20. Installation and configuration issues 361


When quorum has been lost, requests are rejected until the node has rejoined a valid quorum and
mounted its file systems. If messages indicate lack of quorum, follow the procedures in “GPFS daemon
does not come up” on page 358.
• Removing the setuid bit from the permission bits of any of the following IBM Storage Scale commands
can produce errors for non-root users:
mmdf
mmgetacl
mmlsdisk
mmlsfs
mmlsmgr
mmlspolicy
mmlsquota
mmlssnapshot
mmputacl
mmsnapdir
mmsnaplatest
The GPFS system-level versions of these commands (prefixed by ts) may need to be checked for how
permissions are set if non-root users see the following message:
6027-1209
GPFS is down on this node.
If the setuid bit is removed from the permission bits of any of the following system-level commands,
non-root users cannot execute the command and to non-root users the node appears to be down:
tsdf
tslsdisk
tslsfs
tslsmgr
tslspolicy
tslsquota
tslssnapshot
tssnapdir
tssnaplatest
These are found in the /usr/lpp/mmfs/bin directory.
Note: The mode bits for all listed commands are 4555 or -r-sr-xr-x. To restore the default (shipped)
permission, enter:

chmod 4555 tscommand

Attention: Only administration-level versions of GPFS commands (prefixed by mm) should be


executed. Executing system-level commands (prefixed by ts) directly produces unexpected
results.
• For all other errors, follow the procedures in “Additional information to collect for GPFS daemon
crashes” on page 557, and then contact the IBM Support Center.

GPFS commands are unsuccessful


GPFS commands can be unsuccessful for various reasons.
Unsuccessful command results are indicated by:
• Return codes indicating the GPFS daemon is no longer running.
• Command specific problems indicating you are unable to access the disks.

362 IBM Storage Scale 5.1.9: Problem Determination Guide


• A nonzero return code from the GPFS command.
Some reasons that GPFS commands can be unsuccessful include:
1. If all commands are generically unsuccessful, this may be due to a daemon failure. Verify that the
GPFS daemon is active. Issue:

mmgetstate

If the daemon is not active, check /var/adm/ras/mmfs.log.latest and /var/adm/ras/


mmfs.log.previous on the local node and on the file system manager node. These files enumerate
the failing sequence of the GPFS daemon.
If there is a communication failure with the file system manager node, you receive an error and the
errno global variable may be set to EIO (I/O error).
2. Verify the GPFS cluster configuration data files are not locked and are accessible. To determine if the
GPFS cluster configuration data files are locked, see “GPFS cluster configuration data file issues” on
page 354.
3. The ssh command is not functioning correctly. See “Authorization problems” on page 353.
If ssh is not functioning properly on a node in the GPFS cluster, a GPFS administration command
that needs to run work on that node fails with a 'permission is denied' error. The system displays
information similar to:

mmlscluster
sshd: 0826-813 Permission is denied.
mmdsh: 6027-1615 k145n02 remote shell process had return code 1.
mmlscluster: 6027-1591 Attention: Unable to retrieve GPFS cluster files from node k145n02
sshd: 0826-813 Permission is denied.
mmdsh: 6027-1615 k145n01 remote shell process had return code 1.
mmlscluster: 6027-1592 Unable to retrieve GPFS cluster files from node k145n01

These messages indicate that ssh is not working properly on nodes k145n01 and k145n02.
If you encounter this type of failure, determine why ssh is not working on the identified node. Then fix
the problem.
4. Most problems encountered during file system creation fall into three classes:
• You did not create network shared disks which are required to build the file system.
• The creation operation cannot access the disk.
Follow the procedures for checking access to the disk. This can result from a number of factors
including those described in “NSD and underlying disk subsystem failures” on page 407.
• Unsuccessful attempt to communicate with the file system manager.
The file system creation runs on the file system manager node. If that node goes down, the mmcrfs
command may not succeed.
5. If the mmdelnode command was unsuccessful and you plan to permanently de-install GPFS from a
node, you should first remove the node from the cluster. If this is not done and you run the mmdelnode
command after the mmfs code is removed, the command fails and displays a message similar to this
example:

Verifying GPFS is stopped on all affected nodes ...


k145n05: ksh: /usr/lpp/mmfs/bin/mmremote: not found.

If this happens, power off the node and run the mmdelnode command again.
6. If you have successfully installed and are operating with the latest level of GPFS, but cannot run the
new functions available, it is probable that you have not issued the mmchfs -V full or mmchfs -V
compat command to change the version of the file system. This command must be issued for each of
your file systems.
In addition to mmchfs -V, you may need to run the mmmigratefs command. See the File system
format changes between versions of GPFS topic in the IBM Storage Scale: Administration Guide.

Chapter 20. Installation and configuration issues 363


Note: Before issuing the -V option (with full or compat), see Upgrading in IBM Storage Scale:
Concepts, Planning, and Installation Guide. You must ensure that all nodes in the cluster have been
migrated to the latest level of GPFS code and that you have successfully run the mmchconfig
release=LATEST command.
Make sure you have operated with the new level of code for some time and are certain you want
to migrate to the latest level of GPFS. Issue the mmchfs -V full command only after you have
definitely decided to accept the latest level, as this causes disk changes that are incompatible with
previous levels of GPFS.
For more information about the mmchfs command, see the IBM Storage Scale: Command and
Programming Reference Guide.

GPFS error messages for unsuccessful GPFS commands


This topic describes the error messages for unsuccessful GPFS commands.
If message 6027-538 is returned from the mmcrfs command, verify that the disk descriptors are
specified correctly and that all named disks exist and are online. Issue the mmlsnsd command to check
the disks.
6027-538
Error accessing disks.
If the daemon failed while running the command, you can see message 6027-663. Follow the procedures
in “GPFS daemon went down” on page 361.
6027-663
Lost connection to file system daemon.
If the daemon was not running when you issued the command, you can see message 6027-665. Follow
the procedures in “GPFS daemon does not come up” on page 358.
6027-665
Failed to connect to file system daemon: errorString.
When GPFS commands are unsuccessful, the system may display information similar to these error
messages:
6027-1627
The following nodes are not aware of the configuration server change: nodeList. Do not start GPFS on
the preceding nodes until the problem is resolved.

Quorum loss
Each GPFS cluster has a set of quorum nodes explicitly set by the cluster administrator.
These quorum nodes and the selected quorum algorithm determine the availability of file systems
owned by the cluster. For more information, see Quorum in IBM Storage Scale: Concepts, Planning, and
Installation Guide.
When quorum loss or loss of connectivity occurs, any nodes still running GPFS suspend the use of file
systems owned by the cluster experiencing the problem. This may result in GPFS access within the
suspended file system receiving ESTALE errors. Nodes continuing to function after suspending file system
access starts contacting other nodes in the cluster in an attempt to rejoin or reform the quorum. If they
succeed in forming a quorum, access to the file system is restarted.
Normally, quorum loss or loss of connectivity occurs if a node goes down or becomes isolated from its
peers by a network failure. The expected response is to address the failing condition.

364 IBM Storage Scale 5.1.9: Problem Determination Guide


CES configuration issues
The following are the issues that you might encounter while configuring cluster export services in IBM
Storage Scale.
• Issue: The mmces command shows a socket-connection-error.
Error: Cannot connect to server(localhost), port(/var/mmfs/mmsysmon/mmsysmonitor.socket):
Connection refused
Solution: The mmsysmon-daemon is not running or is malfunctioning. Submit the mmsysmoncontrol
restart command to restore the functionality.
• Issue: The mmlscluster --ces command does not show any CES IPs, bound to the CES-nodes.
Solution: Either all CES nodes are unhealthy or no IPs are defined as CES IPs. Try out the following steps
to resolve this issue:
1. Use the mmces state show -a to find out the nodes in which the CES service is in the FAILED
state. Using the ssh <nodeName> mmhealth node show command displays the component that
is creating the issue. In some cases, events are created if there are issues with the node health.
2. Use the mmces address list command to list the IPs are defined as CES IPs. You can extend this
list by issuing the command mmces address add --ces-node --ces-ip <ipAddress>.

Application program errors


When receiving application program errors, there are various courses of action to take.
Follow these steps to help resolve application program errors:
1. Loss of file system access usually appears first as an error received by an application. Such errors are
normally encountered when the application tries to access an unmounted file system.
The most common reason for losing access to a single file system is a failure somewhere in the path
to a large enough number of disks to jeopardize your data if operation continues. These errors may be
reported in Operating system error log on any node because they are logged in the first node to detect
the error. Check all error logs for errors.
The mmlsmount all -L command can be used to determine the nodes that have successfully
mounted a file system.
2. There are several cases where the state of a given disk subsystem prevents access by GPFS. This
instance is seen by the application as I/O errors of various types and is reported in the error logs
as MMFS_SYSTEM_UNMOUNT or MMFS_DISKFAIL records. This state can be found by issuing the
mmlsdisk command.
3. If allocation of data blocks or files (which quota limits should allow) fails, issue the mmlsquota
command for the user, group or fileset.
If filesets are involved, use these steps to determine which fileset was being accessed at the time of
the failure:
a. From the error messages generated, obtain the path name of the file being accessed.
b. Go to the directory just obtained, and use this mmlsattr -L command to obtain the fileset name:

mmlsattr -L . | grep "fileset name:"

The system produces output similar to:

fileset name: myFileset

c. Use the mmlsquota -j command to check the quota limit of the fileset. For example, using the
fileset name found in the previous step, issue this command:

mmlsquota -j myFileset -e

Chapter 20. Installation and configuration issues 365


The system produces output similar to:

Block Limits | File Limits


Filesystem type KB quota limit in_doubt grace | files quota limit in_doubt grace
Remarks
fs1 FILESET 2152 0 0 0 none | 250 0 250 0 none

The mmlsquota output is similar when checking the user and group quota. If usage is equal to or
approaching the hard limit, or if the grace period has expired, make sure that no quotas are lost by
checking in doubt values.
If quotas are exceeded in the in doubt category, run the mmcheckquota command. For more
information, see “The mmcheckquota command” on page 325.
Note: There is no way to force GPFS nodes to relinquish all their local shares in order to check for
lost quotas. This can only be determined by running the mmcheckquota command immediately after
mounting the file system, and before any allocations are made. In this case, the value in doubt is the
amount lost.
To display the latest quota usage information, use the -e option on either the mmlsquota or the
mmrepquota commands. Remember that the mmquotaon and mmquotaoff commands do not enable
and disable quota management. These commands merely control enforcement of quota limits. Usage
continues to be counted and recorded in the quota files regardless of enforcement.
Reduce quota usage by deleting or compressing files or moving them out of the file system. Consider
increasing quota limit.

GPFS error messages for application program errors


This topic describes the error messages that IBM Storage Scale displays for application program errors.
Application program errors can be associated with these GPFS message numbers:
6027-506
program: loadFile is already loaded at address.
6027-695 [E]
File system is read-only.

Windows issues
The topics that follow apply to Windows Server 2008.

Home and .ssh directory ownership and permissions


This topic describes the issues related to .ssh directory ownership and permissions.
Make sure users own their home directories, which is not normally the case on Windows. They should also
own ~/.ssh and the files it contains. Here is an example of file attributes that work:

bash-3.00$ ls -l -d ~
drwx------ 1 demyn Domain Users 0 Dec 5 11:53 /dev/fs/D/Users/demyn
bash-3.00$ ls -l -d ~/.ssh
drwx------ 1 demyn Domain Users 0 Oct 26 13:37 /dev/fs/D/Users/demyn/.ssh
bash-3.00$ ls -l ~/.ssh
total 11
drwx------ 1 demyn Domain Users 0 Oct 26 13:37 .
drwx------ 1 demyn Domain Users 0 Dec 5 11:53 ..
-rw-r--r-- 1 demyn Domain Users 603 Oct 26 13:37 authorized_keys2
-rw------- 1 demyn Domain Users 672 Oct 26 13:33 id_dsa
-rw-r--r-- 1 demyn Domain Users 603 Oct 26 13:33 id_dsa.pub
-rw-r--r-- 1 demyn Domain Users 2230 Nov 11 07:57 known_hosts
bash-3.00$

366 IBM Storage Scale 5.1.9: Problem Determination Guide


Problems running as Administrator
You might have problems using SSH when running as the domain Administrator user. These issues do not
apply to other accounts, even if they are members of the Administrators group.

GPFS Windows and SMB2 protocol (CIFS serving)


SMB2 is a version of the Server Message Block (SMB) protocol that was introduced with Windows Vista
and Windows Server 2008.
Various enhancements include the following (among others):
• reduced "chattiness" of the protocol
• larger buffer sizes
• faster file transfers
• caching of metadata such as directory content and file properties
• better scalability by increasing the support for number of users, shares, and open files per server
The SMB2 protocol is negotiated between a client and the server during the establishment of the SMB
connection, and it becomes active only if both the client and the server are SMB2 capable. If either side is
not SMB2 capable, the default SMB (version 1) protocol gets used.
The SMB2 protocol does active metadata caching on the client redirector side, and it relies on Directory
Change Notification on the server to invalidate and refresh the client cache. However, GPFS on Windows
currently does not support Directory Change Notification. As a result, if SMB2 is used for serving out an
IBM Storage Scale file system, the SMB2 redirector cache on the client does not see any cache-invalidate
operations if the actual metadata is changed, either directly on the server or via another CIFS client.
In such a case, the SMB2 client continues to see its cached version of the directory contents until the
redirector cache expires. Therefore, the use of SMB2 protocol for CIFS sharing of GPFS file systems can
result in the CIFS clients seeing an inconsistent view of the actual GPFS namespace.
A workaround is to disable the SMB2 protocol on the CIFS server (that is, the GPFS compute node).
This action ensures that the SMB2 never gets negotiated for file transfer even if any CIFS client is SMB2
capable.
To disable SMB2 on the GPFS compute node, follow the instructions under the "MORE INFORMATION"
section in How to detect, enable and disable SMBv1, SMBv2, and SMBv3 in Windows.

Chapter 20. Installation and configuration issues 367


368 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 21. Upgrade issues
This topic describes the issues that you might encounter while upgrading IBM Storage Scale from one
version to another.

Installation toolkit setup command fails on RHEL 7.x nodes with


setuptools package-related error
The installation toolkit setup command might fail on Red Hat Enterprise Linux 7.x nodes with a
setuptools package-related error.

Found existing Ansible installation on system.


ERROR! Unexpected Exception, this is probably a bug: setuptools>=11.3

This issue typically occurs because of a conflict with the python2-cryptography package installed
on the system. The conflicting python2-cryptography package name contains the string ibm. For
example, python2-cryptography-x.x.x.x.ibm.el7.
• Check whether the python2-cryptography package that is installed on the system has ibm in the
package name.

rpm -q python2-cryptography

• Check whether the object packages are installed.

rpm -qa | grep spectrum-scale-object

If the object protocol is enabled on your system and you want to continue using it, see Considerations
for upgrading from an operating system not supported in IBM Storage Scale 5.1.x.x in IBM Storage Scale:
Concepts, Planning, and Installation Guide.
Use the following workaround if:
• The object protocol is not installed or if you can uninstall the object protocol. You can check that the
object protocol is not enabled by using the mmces service list -a command.
• The correct python2-cryptography package is not installed.
Workaround:
1. Remove the spectrum-scale-object package if it exists.

yum remove spectrum-scale-object

2. Downgrade the python2-cryptography package to the system default version.

yum downgrade python2-cryptography

After doing the workaround steps, verify that the correct python2-cryptography package is installed
on the system. The correct package has the default version without ibm in the name. For example,
python2-cryptography-x.x.x.x.el7.

rpm -q python2-cryptography

© Copyright IBM Corp. 2015, 2024 369


Upgrade precheck by using the installation toolkit might fail on
protocol nodes in an ESS environment
The installation toolkit upgrade precheck might fail on protocol nodes in an Elastic Storage Server (ESS)
environment due to versionlocks being set on certain packages.
If a protocol node is provisioned through xCAT in an ESS environment, versionlocks are set on certain key
packages. If the OS is then upgraded on the node and then the installation toolkit is used for an upgrade
precheck, the precheck might fail.
Workaround:
1. View the list of versionlocks.

yum versionlock status

2. Clear the versionlocks.

yum versionlock clear

3. Rerun the upgrade precheck.

./spectrumscale upgrade precheck

Home cluster unable to unload mmfs modules for upgrades


The home cluster cannot unload mmfs modules because of the NFS severs that are running on the home
cluster.
When the command mmshutdown is issued to shut down the IBM Storage Scale system for the upgrades,
the mmfs modules are unloaded.
Workaround:
Stop the NFS server and related NFS services and issue the mmshutdown command.

File conflict issue while upgrading SLES on IBM Storage Scale


nodes
While upgrading SLES on IBM Storage Scale nodes using the zypper up command, you might encounter
file conflicts. This occurs because of the installation of unnecessary, conflicting packages.
Workaround:
Do the SLES upgrade on IBM Storage Scale nodes using the zypper up --no-recommends command
to avoid the installation of conflicting packages.

NSD nodes cannot connect to storage after upgrading from SLES 15


SP1 to SP2
After upgrading from SLES 15 SP1 to SP2, NSD nodes might be unable to connect to the storage.
This occurs because of a change in the way regular expressions are evaluated in SLES 15. After this
change, glibc-provided regular expressions are used in SLES 15. Therefore, to match an arbitrary string,
you must now use ".*" instead of "*".
Workaround:
1. In the blocked section of the /etc/multipath.conf file, replace "*" with ".*".
2. Restart multipathd.service by issuing the systemctl restart multipathd.service
command.

370 IBM Storage Scale 5.1.9: Problem Determination Guide


3. Verify that LUNs from storage can be detected by issuing the multipath -ll command.

After upgrading IBM Storage Scale code, trying to mark events as


read returns a server error message
After upgrading IBM Storage Scale code, you might encounter an error message that says, "The server
was unable to process the request. Error invoking the RPC method..." when trying to mark events as read.
This topic describes a workaround.
If you encounter this problem, you can see the following error message:

The server was unable to process the request. Error invoking


the RPC method:
com.ibm.gss.gui.logic.MonitorRPC.getEventsCount,com.ibm.evo.rpc.
java.lang.NoSuchMethodException: There is no such method
named 'getEventsCount' with the specified arguments

Workaround:
The workaround to this error is to refresh your browser cache.

RDMA adapters not supporting ATOMIC operations


Remote Direct Access Memory (RDMA) adapters must support atomic operations. For more information
about available configuration requirements for utilizing RDMA, refer to the IBM Storage Scale FAQ in IBM
Documentation.
If the IBM Storage Scale log file contains the following warning message:

[W] VERBS RDMA open error verbsPort <port> due to missing support for atomic operations for
device <device>

Workaround:
Check the description of the verbsRdmaWriteFlush configuration variable in the mmchconfig command
topic in IBM Storage Scale: Command and Programming Reference Guide for possible options.

Chapter 21. Upgrade issues 371


372 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 22. CCR issues
When CCR loses its quorum, most of the IBM Storage Scale administrative commands do not work any
longer. This is because the mm-commands use CCR to ensure that they are working with the most recent
version of the configuration data.
The following tips help you to identify and avoid potential issues that might affect CCR:
• CCR communicates through port 1191. Before you create a new cluster or add a new node to an existing
cluster, make sure that this port is not blocked by a firewall.
• CCR does IP address lookup to communicate between CCR servers or between a CCR client and server.
Make sure that /etc/hosts entries (name resolution in general) for the quorum nodes are consistent
on all nodes in the cluster.
• CCR must work even if GPFS is shut down. When GPFS is shut down, a separate daemon, the
mmsdrserv daemon, is started to provide CCR services instead of the mmfsd daemon. The mmsdrserv
daemon has its own log file and it is available at /var/adm/ras/mmsdrserv.log. Examine this log
file on the quorum nodes if CCR runs into issues when GPFS is down.
• The mmccr check command can be used on any node in the cluster to verify whether the CCR is
accessible on a particular node and works as expected. The checks are initiated by the CCR client and
the CCR server responds where necessary. The CCR check generates two types of output depending on
whether the CCR command is issued on a quorum node or a non-quorum node. A sample output is as
follows:

# mmccr check -Ye


mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEn
tities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,Security,/var/mmfs/ccr/ccr.disks:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::node-21.localnet.com:OK:
mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::node-21.localnet.com,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22:OK:
mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0:::1:OK:

The CCR check on the non-quorum node displays an output similar to this:

# mmccr check -Ye


mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEn
tities:ListOfSucceedEntities:Severity:
mmccr::0:1:::-1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,Security:OK:
mmccr::0:1:::-1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::-1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/
ccr.paxos:OK:
mmccr::0:1:::-1:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22:OK:

The following list provides descriptions for each CCR check item:
CCR_CLIENT_INIT
Verifies whether the CCR directory structure and files are complete and intact. It also verifies
whether the security layer that the CCR is using (GSKit) can be initialized successfully.
FC_CCR_AUTH_KEYS
Verifies that the CCR key file needed for authentication by the GSKit layer is available.
FC_CCR_PAXOS_CACHED and FC_CCR_PAXOS_12
Verify whether the CCR Paxos state files are available. On quorum nodes, these files are used during
CCR's consensus protocol. That is, a cached copy on every node in the cluster is used to speed up
the process in certain cases.

© Copyright IBM Corp. 2015, 2024 373


PC_LOCAL_SERVER
Pings the CCR server that is running on the local quorum node by sending a simple authenticated
RPC through the configured IP address for the quorum node. This check item is applicable only for
quorum nodes.
PC_IP_ADDR_LOOKUP
Measures the time the IP address lookup needs during the former simple ping RPC to the local
CCR server. If it exceeds a certain threshold, currently 5 seconds, this check returns a warning. This
check item is applicable only for quorum nodes.
PC_QUORUM_NODES
Pings all specified quorum nodes by sending a simple RPC through their configured IP addresses.
CCR uses the /var/mmfs/ccr/ccr.nodes file to look up the quorum nodes.
FC_COMMITTED_DIR
Verifies the integrity of the files in the /var/mmfs/ccr/committed directory. This check item is
applicable only for quorum nodes.
TC_TIEBREAKER_DISKS
Verifies whether the CCR server has access to the configured tiebreaker disks. This check item is
applicable only for quorum nodes.
• The mmccr echo command can be used to send a simple test string to the specified CCR server, as
shown in the following example:

# mmccr echo -n node-21,node-22 testString


echo testString
echo testString

With this command, the CCR client/server connection can be tested. If the server in the specified node
list does not echo the testString , then it means that the connection between this client and server is not
working. In such scenarios, check whether the port 1191 is blocked by the firewall or wrong IP address
lookup due to inconsistent /etc/hosts entries.
• The CCR_DEBUG environment variable can be used in a CCR client command to print detailed console
output for debug purposes, as shown in the following example:

# CCR_DEBUG=9 mmccr echo


Using /var/mmfs/ccr
Size of file '/var/mmfs/ccr/ccr.nodes': 76 bytes
readNodeList('/var/mmfs/ccr/ccr.nodes') size 2 COMMIT_PENDING NO ccrFileCommitVersion -1 nErr
0
updateNodeMaps: nodes: [(1, 'node-21.localnet.com', ('10.0.100.21', 1191)), (2,
'node-22.localnet.com', ('10.0.100.22', 1191))]
getNodeDataFromFile: nodeId: 5 daemonIpAddr: 10.0.100.25 adminNodeName: node-25.localnet.com
daemonNodeName: node-25.localnet.com quorum node: No
Reading '/var/mmfs/gen/mmfsNodeData' returned 0
'/var/mmfs/ccr/ccr.disks' does not exist
ccrio init: ip= port=1191 node=-1 (0)
setCcrSecurity: ccrSecEnabled=1
Auth: secReady 1 cipherList 'AUTHONLY' keyGenNumber 2
connecting to node 127.0.0.1:1191 (timeout -1, handshaketimeout -1)
No cached out-connection found for address: 127.0.0.1:1191 (0)
connected to node 127.0.0.1:1191 (sock 4-0x55fbcad5bdf0)
sending msg 2 'debug' len 6 (sock 4) msgCnt 44065 TO -1
sent msg 2 'debug' (sock 4) ok
receiving from 127.0.0.1:1191 (sock 4) TO -1 0x55FBCAD5BDF0
waiting for hdr (len 8)
In receive after recvRpcHdr sock 4 hdr '0x04 0x00 0x00 0x00 0x00 0x02 0x00 0x01' (8 bytes)
waiting for data (len 4)
received msg 0 'ok' len 4 (sock 4) msgCnt 1
closing connection to 127.0.0.1:1191 (sock 4-0x55fbcad5bdf0)
closeSocket: shutdown socket 4 returned rc 0
closeSocket: close socket 4 (linger: Yes) returned rc 0
debug response: err 0 type 0 (ok) len 4
echo
Command 'echo' returned err: 0 exit code: 0

374 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 23. Network issues
This topic describes network issues that you might encounter while using IBM Storage Scale.
For firewall settings, see the topic Securing the IBM Storage Scale system using firewall in the IBM Storage
Scale: Administration Guide.

IBM Storage Scale failures due to a network failure


For proper functioning, GPFS depends both directly and indirectly on correct network operation.
This dependency is direct because various IBM Storage Scale internal messages flow on the network, and
may be indirect if the underlying disk technology is dependent on the network. Symptoms included in an
indirect failure would be inability to complete I/O or GPFS moving disks to the down state.
The problem can also be first detected by the GPFS network communication layer. If network connectivity
is lost between nodes or GPFS heart beating services cannot sustain communication to a node, GPFS
declares the node dead and perform recovery procedures. This problem manifests itself by messages
appearing in the GPFS log such as:

Mon Jun 25 22:23:36.298 2018: Close connection to 192.168.10.109 c5n109. Attempting reconnect.
Mon Jun 25 22:23:37.300 2018: Connecting to 192.168.10.109 c5n109
Mon Jun 25 22:23:37.398 2018: Close connection to 192.168.10.109 c5n109
Mon Jun 25 22:23:38.338 2018: Recovering nodes: 9.114.132.109
Mon Jun 25 22:23:38.722 2018: Recovered 1 nodes.

Nodes mounting file systems owned and served by other clusters may receive error messages similar to
this:

Mon Jun 25 16:11:16 2018: Close connection to 89.116.94.81 k155n01


Mon Jun 25 16:11:21 2018: Lost membership in cluster remote.cluster. Unmounting file systems.

If a sufficient number of nodes fail, GPFS loses the quorum of nodes, which exhibits itself by messages
appearing in the GPFS log, similar to this:

Mon Jun 25 11:08:10 2018: Close connection to 179.32.65.4 gpfs2


Mon Jun 25 11:08:10 2018: Lost membership in cluster gpfsxx.kgn.ibm.com. Unmounting file system.

When either of these cases occur, perform problem determination on your network connectivity. Failing
components could be network hardware such as switches or host bus adapters.

OpenSSH connection delays


OpenSSH can be sensitive to network configuration issues that often do not affect other system
components. One common symptom is a substantial delay (20 seconds or more) to establish a
connection. When the environment is configured correctly, a command such as ssh gandalf date
should only take one or two seconds to complete.
If you are using OpenSSH and experiencing an SSH connection delay (and if IPv6 is not supported in your
environment), try disabling IPv6 on your Windows nodes and remove or comment out any IPv6 addresses
from the /etc/resolv.conf file.

Analyze network problems with the mmnetverify command


You can use the mmnetverify command to detect network problems and to identify nodes where a
network problem exists.
The mmnetverify command is useful for detecting network problems and for identifying the type and
node location of a network problem. The command can run 16 types of network checks in the areas of
connectivity, ports, data, bandwidth, and flooding.

© Copyright IBM Corp. 2015, 2024 375


The following examples illustrate some of the uses of this command:
• Before you create a cluster, to verify that all your nodes are ready to be included in a cluster together,
you can run the following command:

mmnetverify --configuration-file File connectivity -N all

This command runs several types of connectivity checks between each node and all the other nodes in
the group and reports the results on the console. Because a cluster does not exist yet, you must include
a configuration file File in which you list all the nodes that you want to test.
• To check for network outages in a cluster, you can run the following command:

mmnetverify ping -N all

This command runs several types of ping checks between each node and all the other nodes in the
cluster and reports the results on the console.
• Before you make a node a quorum node, you can run the following check to verify that other nodes can
communicate with the daemon:

mmnetverify connectivity port

• To investigate a possible lag in large-data transfers between two nodes, you can run the following
command:

mmnetverify data-large -N node2 --target-nodes node3 --verbose


min-bandwidth Bandwidth

This command establishes a TCP connection from node2 to node3 and causes the two nodes to
exchange a series of large-sized data messages. If the bandwidth falls below the level that is specified,
the command generates an error. The output of the command to the console indicates the results of the
test.
• To analyze a problem with connectivity between nodes, you can run the following command:

mmnetverify connectivity -N all --target-nodes all --verbose


--log-file File

This command runs connectivity checks between each node and all the other nodes in the cluster, one
pair at a time, and writes the results of each test to the console and to the specified log file.

376 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 24. File system issues
Suspect a GPFS file system problem when a file system does not mount or unmount.
You can also suspect a file system problem if a file system unmounts unexpectedly, or you receive an
error message indicating that file system activity can no longer continue due to an error, and the file
system is being unmounted to preserve its integrity. Record all error messages and log entries that you
receive relative to the problem, making sure that you look on all affected nodes for this data.
These are some of the errors encountered with GPFS file systems:
• “File system fails to mount” on page 377
• “File system fails to unmount” on page 381
• “File system forced unmount” on page 382
• “Unable to determine whether a file system is mounted” on page 391
• “Multiple file system manager failures” on page 391
• “Discrepancy between GPFS configuration data and the on-disk data for a file system” on page 392
• “Errors associated with storage pools, filesets and policies” on page 393
• “Failures using the mmbackup command” on page 403
• “Snapshot problems” on page 400
• “Failures using the mmpmon command” on page 498
• “NFS issues” on page 443
• “File access failure from an SMB client with sharing conflict” on page 454
• “Data integrity” on page 404
• “Messages requeuing in AFM” on page 404

File system fails to mount


There are indications leading you to the conclusion that your file system does not mount and courses of
action you can take to correct the problem.
Some of those indications include:
• On performing a manual mount of the file system, you get errors from either the operating system or
GPFS.
• If the file system was created with the option of an automatic mount, you have failure return codes in
the GPFS log.
• Your application cannot access the data it needs. Check the GPFS log for messages.
• Return codes or error messages from the mmmount command.
• The mmlsmount command indicates that the file system is not mounted on certain nodes.
If your file system does not mount, then follow these steps:
1. On a quorum node in the cluster that owns the file system, verify that quorum has been achieved.
Check the GPFS log to see if an mmfsd ready message has been logged, and that no errors were
reported on this or other nodes.
2. Verify that a conflicting command is not running. This applies only to the cluster that owns the file
system. However, other clusters would be prevented from mounting the file system if a conflicting
command is running in the cluster that owns the file system.
For example, a mount command may not be issued while the mmfsck command is running. The
mount command may not be issued until the conflicting command completes. Note that interrupting

© Copyright IBM Corp. 2015, 2024 377


the mmfsck command is not a solution because the file system is not mountable until the command
completes. Try again after the conflicting command has completed.
3. Verify that sufficient disks are available to access the file system by issuing the mmlsdisk command.
GPFS requires a minimum number of disks to find a current copy of the core metadata. If sufficient
disks cannot be accessed, the mount fails. The corrective action is to fix the path to the disk. See
“NSD and underlying disk subsystem failures” on page 407.
Missing disks can also cause GPFS to be unable to find critical metadata structures. The output of
the mmlsdisk command shows any unavailable disks. If you have not specified metadata replication,
the failure of one disk may result in your file system being unable to mount. If you have specified
metadata replication, it requires two disks in different failure groups to disable the entire file system.
If there are down disks, issue the mmchdisk start command to restart them and retry the mount.
For a remote file system, mmlsdisk provides information about the disks of the file system. However
mmchdisk must be run from the cluster that owns the file system.
If there are no disks down, you can also look locally for error log reports, and follow the problem
determination and repair actions specified in your storage system vendor problem determination
guide. If the disk has failed, follow the procedures in “NSD and underlying disk subsystem failures”
on page 407.
4. Verify that communication paths to the other nodes are available. The lack of communication paths
between all nodes in the cluster may impede contact with the file system manager.
5. Verify that the file system is not already mounted. Issue the mount command.
6. Verify that the GPFS daemon on the file system manager is available. Run the mmlsmgr command
to determine which node is currently assigned as the file system manager. Run a trivial data access
command such as an ls on the mount point directory. If the command fails, see “GPFS daemon went
down” on page 361.
7. Check to see if the mount point directory exists and that there is an entry for the file system in
the /etc/fstab file (for Linux) or /etc/filesystems file (for AIX). The device name for a file
system mount point is listed in column one of the /etc/fstab entry or as a dev= attribute in
the /etc/filesystems stanza entry. A corresponding device name must also appear in the /dev
file system.
If any of these elements are missing, an update to the configuration information may not have been
propagated to this node. Issue the mmrefresh command to rebuild the configuration information on
the node and reissue the mmmount command.
Do not add GPFS file system information to /etc/filesystems (for AIX) or /etc/fstab (for
Linux) directly. If after running mmrefresh -f the file system information is still missing from /etc/
filesystems (for AIX) or /etc/fstab (for Linux), follow the procedures in “Information to be
collected before contacting the IBM Support Center” on page 555, and then contact the IBM Support
Center.
8. Check the number of file systems that are already mounted. There is a maximum number of 256
mounted file systems for a GPFS cluster. Remote file systems are included in this number.
9. If you issue mmchfs -V compat, it enables backwardly-compatible format changes only. Nodes in
remote clusters that were able to mount the file system before it is still able to do so.
If you issue mmchfs -V full, it enables all new functions that require different on-disk data
structures. Nodes in remote clusters running an older GPFS version is no longer able to mount
the file system. If there are any nodes running an older GPFS version that have the file system
mounted at the time this command is issued, the mmchfs command fails. For more information, see
the Completing the upgrade to a new level of IBM Storage Scale section in the IBM Storage Scale:
Concepts, Planning, and Installation Guide.
All nodes that access the file system must be upgraded to the same level of GPFS. Check for the
possibility that one or more of the nodes was accidently left out of an effort to upgrade a multi-node
system to a new GPFS release. If you need to return to the earlier level of GPFS, you must re-create
the file system from the backup medium and restore the content in order to access it.

378 IBM Storage Scale 5.1.9: Problem Determination Guide


10. If DMAPI is enabled for the file system, ensure that a data management application is started and has
set a disposition for the mount event. Refer to the IBM Storage Scale: Command and Programming
Reference Guide and the user's guide from your data management vendor. The data management
application must be started in the cluster that owns the file system. If the application is not started,
then other clusters are not able to mount the file system. Remote mounts of DMAPI managed file
systems may take much longer to complete than those not managed by DMAPI.
11. Issue the mmlsfs -A command to check whether the automatic mount option has been specified. If
automatic mount option is expected, check the GPFS log in the cluster that owns and serves the file
system, for progress reports indicating:

starting ...
mounting ...
mounted ....

12. If quotas are enabled, check if there was an error while reading quota files. See “MMFS_QUOTA” on
page 276.
13. Verify the maxblocksize configuration parameter on all clusters involved. If maxblocksize is less
than the block size of the local or remote file system you are attempting to mount, you cannot mount
it.
14. If the file system has encryption rules, see “Mount failure for a file system with encryption rules” on
page 435.
15. To mount a file system on a remote cluster, ensure that the cluster that owns and serves the file
system and the remote cluster have proper authorization in place. The authorization between clusters
is set up with the mmauth command.
Authorization errors on AIX are similar to the following:

c13c1apv6.gpfs.net: Failed to open remotefs.


c13c1apv6.gpfs.net: Permission denied
c13c1apv6.gpfs.net: Cannot mount /dev/remotefs on /gpfs/remotefs: Permission denied

Authorization errors on Linux are similar to the following:

mount: /dev/remotefs is write-protected, mounting read-only


mount: cannot mount /dev/remotefs read-only
mmmount: 6027-1639 Command failed. Examine previous error messages to determine cause.

For more information about mounting a file system that is owned and served by another GPFS cluster,
see the Mounting a remote GPFS file system topic in the IBM Storage Scale: Administration Guide.

GPFS error messages for file system mount problems


There are error messages specific to file system reading, failure, mounting, and remounting.
6027-419
Failed to read a file system descriptor.
6027-482 [E]
Remount failed for device name: errnoDescription
6027-549
Failed to open name.
6027-580
Unable to access vital system metadata. Too many disks are unavailable.
6027-645
Attention: mmcommon getEFOptions fileSystem failed. Checking fileName.

Chapter 24. File system issues 379


Error numbers specific to GPFS application calls when a file system mount is
not successful
There are specific error numbers for unsuccessful file system mounting.
When a mount of a file system is not successful, GPFS may report these error numbers in the operating
system error log or return them to an application:
ENO_QUOTA_INST = 237, No Quota management enabled.
To enable quotas for the file system, issue the mmchfs -Q yes command. To disable quotas for the
file system issue the mmchfs -Q no command.

Mount failure due to client nodes joining before NSD servers are online
While mounting a file system, specially during automounting, if a client node joins the GPFS cluster and
attempts file system access prior to the file system's NSD servers being active, the mount fails. Use
mmchconfig command to specify the amount of time for GPFS mount requests to wait for an NSD server
to join the cluster.
If a client node joins the GPFS cluster and attempts file system access prior to the file system's NSD
servers being active, the mount fails. This is especially true when automount is used. This situation can
occur during cluster startup, or any time that an NSD server is brought online with client nodes already
active and attempting to mount a file system served by the NSD server.
The file system mount failure produces a message similar to this:

Mon Jun 25 11:23:34 EST 2007: mmmount: Mounting file systems ...
No such device
Some file system data are inaccessible at this time.
Check error log for additional information.
After correcting the problem, the file system must be unmounted and then
mounted again to restore normal data access.
Failed to open fs1.
No such device
Some file system data are inaccessible at this time.
Cannot mount /dev/fs1 on /fs1: Missing file or filesystem

The GPFS log contains information similar to this:

Mon Jun 25 11:23:54 2007: Command: mount fs1 32414


Mon Jun 25 11:23:58 2007: Disk failure. Volume fs1. rc = 19. Physical volume sdcnsd.
Mon Jun 25 11:23:58 2007: Disk failure. Volume fs1. rc = 19. Physical volume sddnsd.
Mon Jun 25 11:23:58 2007: Disk failure. Volume fs1. rc = 19. Physical volume sdensd.
Mon Jun 25 11:23:58 2007: Disk failure. Volume fs1. rc = 19. Physical volume sdgnsd.
Mon Jun 25 11:23:58 2007: Disk failure. Volume fs1. rc = 19. Physical volume sdhnsd.
Mon Jun 25 11:23:58 2007: Disk failure. Volume fs1. rc = 19. Physical volume sdinsd.
Mon Jun 25 11:23:58 2007: File System fs1 unmounted by the system with return code 19
reason code 0
Mon Jun 25 11:23:58 2007: No such device
Mon Jun 25 11:23:58 2007: File system manager takeover failed.
Mon Jun 25 11:23:58 2007: No such device
Mon Jun 25 11:23:58 2007: Command: err 52: mount fs1 32414
Mon Jun 25 11:23:58 2007: Missing file or filesystem

Two mmchconfig command options are used to specify the amount of time for GPFS mount requests to
wait for an NSD server to join the cluster:
nsdServerWaitTimeForMount
Specifies the number of seconds to wait for an NSD server to come up at GPFS cluster startup time,
after a quorum loss, or after an NSD server failure.
Valid values are between 0 and 1200 seconds. The default is 300. The interval for checking is
10 seconds. If nsdServerWaitTimeForMount is 0, nsdServerWaitTimeWindowOnMount has no
effect.
nsdServerWaitTimeWindowOnMount
Specifies a time window to determine if quorum is to be considered recently formed.

380 IBM Storage Scale 5.1.9: Problem Determination Guide


Valid values are between 1 and 1200 seconds. The default is 600. If nsdServerWaitTimeForMount
is 0, nsdServerWaitTimeWindowOnMount has no effect.
The GPFS daemon need not be restarted in order to change these values. The scope of these two
operands is the GPFS cluster. The -N flag can be used to set different values on different nodes. In this
case, the settings on the file system manager node take precedence over the settings of nodes trying to
access the file system.
When a node rejoins the cluster (after it was expelled, experienced a communications problem, lost
quorum, or other reason for which it dropped connection and rejoined), that node resets all the failure
times that it knows about. Therefore, when a node rejoins it sees the NSD servers as never having failed.
From the node's point of view, it has rejoined the cluster and old failure information is no longer relevant.
GPFS checks the cluster formation criteria first. If that check falls outside the window, GPFS then checks
for NSD server fail times being within the window.

File system fails to unmount


There are indications leading you to the conclusion that your file system cannot unmount and takes a
course of action to correct the problem.
Those indications include:
• Return codes or error messages indicate the file system does not unmount.
• The mmlsmount command indicates that the file system is still mounted on one or more nodes.
• Return codes or error messages from the mmumount command.
If your file system does not unmount, then follow these steps:
1. If you get an error message similar to:

umount: /gpfs1: device is busy

the file system does not unmount until all processes are finished accessing it. If mmfsd is up, the
processes accessing the file system can be determined. See “The lsof command” on page 318. These
processes can be killed with the command:

lsof filesystem | grep -v COMMAND | awk '{print $2}' | xargs kill -9

If mmfsd is not operational, the lsof command is not able to determine which processes are still
accessing the file system.
For Linux nodes it is possible to use the /proc pseudo file system to determine current file access.
For each process currently running on the system, there is a subdirectory /proc/pid/fd, where pid
is the numeric process ID number. This subdirectory is populated with symbolic links pointing to the
files that this process has open. You can examine the contents of the fd subdirectory for all running
processes, manually or with the help of a simple script, to identify the processes that have open
files in GPFS file systems. Terminating all of these processes may allow the file system to unmount
successfully.
To unmount a CES protocol node, suspend the CES function using the following command:

mmces node suspend

• Stop the NFS service using the following command:

mmces service stop NFS

• Stop the SMB service using the following command:

mmces service stop SMB

• Stop the Object service using the following command:

Chapter 24. File system issues 381


mmces service stop OBJ

2. Verify that there are no disk media failures.


Look on the NSD server node for error log entries. Identify any NSD server node that has generated an
error log entry. See “Disk media failure” on page 415 for problem determination and repair actions to
follow.
3. If the file system must be unmounted, you can force the unmount by issuing the mmumount -f
command:
Note:
a. See “File system forced unmount” on page 382 for the consequences of doing this.
b. Before forcing the unmount of the file system, issue the lsof command and close any files that are
open.
c. On Linux, you might encounter a situation where a GPFS file system cannot be unmounted, even
if you issue the mmumount -f command. In this case, you must reboot the node to clear the
condition. You can also try the system umount command before you reboot. For example:

umount -f /fileSystem

4. If a file system that is mounted by a remote cluster needs to be unmounted, you can force the
unmount by issuing the command:

mmumount fileSystem -f -C RemoteClusterName

Remote node expelled after remote file system successfully


mounted
This problem produces 'node expelled from cluster' messages.
One cause of this condition is when the subnets attribute of the mmchconfig command has been used
to specify subnets to GPFS, and there is an incorrect netmask specification on one or more nodes of the
clusters involved in the remote mount. Check to be sure that all netmasks are correct for the network
interfaces used for GPFS communication.

File system forced unmount


There are indications that lead you to the conclusion that your file system has been forced to unmount
and various courses of action that you can take to correct the problem.
Those indications are:
• Forced unmount messages in the GPFS log.
• Your application no longer has access to data.
• Your application is getting ESTALE or ENOENT return codes.
• Multiple unsuccessful attempts to appoint a file system manager may cause the cluster manager to
unmount the file system everywhere.
Such situations involve the failure of paths to disk resources from many, if not all, nodes. The underlying
problem may be at the disk subsystem level, or lower. The error logs for each node that unsuccessfully
attempted to appoint a file system manager contains records of a file system unmount with an error that
are either coded 212, or that occurred when attempting to assume management of the file system. Note
that these errors apply to a specific file system although it is possible that shared disk communication
paths cause the unmount of multiple file systems.
• File system unmounts with an error indicating too many disks are unavailable.
The mmlsmount -L command can be used to determine which nodes currently have a given file system
mounted.

382 IBM Storage Scale 5.1.9: Problem Determination Guide


If your file system has been forced to unmount, follow these steps:
1. With the failure of a single disk, if you have not specified multiple failure groups and replication
of metadata, GPFS is not able to continue because it cannot write logs or other critical metadata.
If you have specified multiple failure groups and replication of metadata, the failure of multiple
disks in different failure groups put you in the same position. In either of these situations, GPFS
forcibly unmounts the file system. This action is indicated in the error log by records indicating exactly
which access failed, with an MMFS_SYSTEM_UNMOUNT record indicating the forced unmount. The user
response to this is to take the needed actions to restore the disk access and issue the mmchdisk
command to disks that are shown as down in the information displayed by the mmlsdisk command.
2. Internal errors in processing data on a single file system may cause loss of file system access. These
errors may clear with the invocation of the umount command, followed by a remount of the file
system, but they should be reported as problems to the IBM Support Center.
3. If an MMFS_QUOTA error log entry containing Error writing quota file... is generated, the
quota manager continues operation if the next write for the user, group, or fileset is successful. If not,
further allocations to the file system fails. Check the error code in the log and make sure that the disks
containing the quota file are accessible. Run the mmcheckquota command. For more information, see
“The mmcheckquota command” on page 325.
If the file system must be repaired without quotas:
a. Disable quota management by issuing the command:

mmchfs Device -Q no

b. Issue the mmmount command for the file system.


c. Make any necessary repairs and install the backup quota files.
d. Issue the mmumount -a command for the file system.
e. Restore quota management by issuing the mmchfs Device -Q yes command.
f. Run the mmcheckquota command with the -u, -g, and -j options. For more information, see “The
mmcheckquota command” on page 325.
g. Issue the mmmount command for the file system.
4. If errors indicate that too many disks are unavailable, see “Additional failure group considerations” on
page 383.

Additional failure group considerations


GPFS uses file system descriptor to be replicated on a subset of the disks as changes to the file system
occur, such as adding or deleting disks. To reduce the risk of multiple failure GPFS picks disks to hold the
replicas in different failure group.
There is a structure in GPFS called the file system descriptor that is initially written to every disk in the
file system, but is replicated on a subset of the disks as changes to the file system occur, such as adding
or deleting disks. Based on the number of failure groups and disks, GPFS creates between one and five
replicas of the descriptor:
• If there are at least five different failure groups, five replicas are created.
• If there are at least three different disks, three replicas are created.
• If there are only one or two disks, a replica is created on each disk.
Once it is decided how many replicas to create, GPFS picks disks to hold the replicas, so that all replicas
are in different failure groups, if possible, to reduce the risk of multiple failures. In picking replica
locations, the current state of the disks is taken into account. Stopped or suspended disks are avoided.
Similarly, when a failed disk is brought back online, GPFS may modify the subset to rebalance the file
system descriptors across the failure groups. The subset can be found by issuing the mmlsdisk -L
command.

Chapter 24. File system issues 383


GPFS requires a majority of the replicas on the subset of disks to remain available to sustain file system
operations:
• If there are at least five different failure groups, then GPFS is able to tolerate a loss of two of the five
groups. If disks out of three different failure groups are lost, the file system descriptor may become
inaccessible due to the loss of the majority of the replicas.
• If there are at least three different failure groups, then GPFS is able to tolerate a loss of one of the
three groups. If disks out of two different failure groups are lost, the file system descriptor may become
inaccessible due to the loss of the majority of the replicas.
• If there are fewer than three failure groups, then a loss of one failure group may make the descriptor
inaccessible.
If the subset consists of three disks and there are only two failure groups, one failure group must have
two disks and the other failure group has one. In a scenario that causes one entire failure group to
disappear all at once, if the half of the disks that are unavailable contain the single disk that is part
of the subset, everything stays up. The file system descriptor is moved to a new subset by updating
the remaining two copies and writing the update to a new disk added to the subset. But if the downed
failure group contains a majority of the subset, the file system descriptor cannot be updated and the file
system has to be force unmounted.
Introducing a third failure group consisting of a single disk that is used solely for the purpose of
maintaining a copy of the file system descriptor can help prevent such a scenario. You can designate
this disk by using the descOnly designation for disk usage on the disk descriptor. For more information
on disk replication, see the NSD creation considerations topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide and the Data mirroring and replication topic in the IBM Storage Scale:
Administration Guide.

GPFS error messages for file system forced unmount problems


There are many error messages for file system forced unmount problems due to unavailable disk space.
Indications there are not enough disks available:
6027-418
Inconsistent file system quorum. readQuorum=value writeQuorum=value quorumSize=value.
6027-419
Failed to read a file system descriptor.
Indications the file system has been forced to unmount:
6027-473 [X]
File System fileSystem unmounted by the system with return code value reason code value
6027-474 [X]
Recovery Log I/O failed, unmounting file system fileSystem

Error numbers specific to GPFS application calls when a file system has
been forced to unmount
There are error numbers to indicate that a file system is forced to unmount for GPFS application calls.
When a file system has been forced to unmount, GPFS may report these error numbers in the operating
system error log or return them to an application:
EPANIC = 666, A file system has been forcibly unmounted because of an error. Most likely due to the
failure of one or more disks containing the last copy of metadata.
See “Operating system error logs” on page 274 for details.
EALL_UNAVAIL = 218, A replicated read or write failed because none of the replicas were available.
Multiple disks in multiple failure groups are unavailable. Follow the procedures in Chapter 25, “Disk
issues,” on page 407 for unavailable disks.

384 IBM Storage Scale 5.1.9: Problem Determination Guide


Automount file system does not mount
The automount fails to mount the file system and the courses of action that you can take to correct the
problem.
If an automount fails when you cd into the mount point directory, first check that the file system in
question is of automount type. Use the mmlsfs -A command for local file systems. Use the mmremotefs
show command for remote file systems.

Steps to follow if automount fails to mount on Linux


There are course of actions that you can take if the automount fails to mount on Linux system.
On Linux, perform these steps:
1. Verify that the GPFS file system mount point is actually a symbolic link to a directory in the
automountdir directory. If automountdir=/gpfs/automountdir then the mount point /gpfs/
gpfs66 would be a symbolic link to /gpfs/automountdir/gpfs66.
a. First, verify that GPFS is up and running.
b. Use the mmlsconfig command to verify the automountdir directory. The default
automountdir is named /gpfs/automountdir. If the GPFS file system mount point is not a
symbolic link to the GPFS automountdir directory, then accessing the mount point does not
cause the automounter to mount the file system.
c. If the command /bin/ls -ld of the mount point shows a directory, then run the command
mmrefresh -f. If the directory is empty, the command mmrefresh -f removes the directory
and create a symbolic link. If the directory is not empty, you need to move or remove the files
contained in that directory, or change the mount point of the file system. For a local file system, use
the mmchfs command. For a remote file system, use the mmremotefs command.
d. Once the mount point directory is empty, run the mmrefresh -f command.
2. Verify that the autofs mount has been established. Issue this command:

mount | grep automount

The output must be similar to this:

automount(pid20331) on /gpfs/automountdir type autofs


(rw,fd=5,pgrp=20331,minproto=2,maxproto=3)

For Red Hat Enterprise Linux 5, verify the following line is in the default master map file (/etc/
auto.master):

/gpfs/automountdir program:/usr/lpp/mmfs/bin/mmdynamicmap

For example, issue:

grep mmdynamicmap /etc/auto.master

Output should be similar to this:

/gpfs/automountdir program:/usr/lpp/mmfs/bin/mmdynamicmap

This is an autofs program map, and there is a single mount entry for all GPFS automounted file
systems. The symbolic link points to this directory, and access through the symbolic link triggers the
mounting of the target GPFS file system. To create this GPFS autofs mount, issue the mmcommon
startAutomounter command, or stop and restart GPFS using the mmshutdown and mmstartup
commands.
3. Verify that the automount daemon is running. Issue this command:

ps -ef | grep automount

Chapter 24. File system issues 385


The output must be similar to this:

root 5116 1 0 Jun25 pts/0 00:00:00 /usr/sbin/automount /gpfs/automountdir program


/usr/lpp/mmfs/bin/mmdynamicmap

For Red Hat Enterprise Linux 5, verify that the autofs daemon is running. Issue this command:

ps -ef | grep automount

The output must be similar to this:

root 22646 1 0 01:21 ? 00:00:02 automount

To start the automount daemon, issue the mmcommon startAutomounter command, or stop and
restart GPFS using the mmshutdown and mmstartup commands.
Note: If automountdir is mounted (as in step 2) and the mmcommon startAutomounter command
is not able to bring up the automount daemon, manually umount the automountdir before issuing
the mmcommon startAutomounter again.
4. Verify that the mount command was issued to GPFS by examining the GPFS log. You should see
something like this:

Mon Jun 25 11:33:03 2004: Command: mount gpfsx2.kgn.ibm.com:gpfs55 5182

5. Examine /var/log/messages for autofs error messages. The following is an example of what you
might see if the remote file system name does not exist.

Jun 25 11:33:03 linux automount[20331]: attempting to mount entry /gpfs/automountdir/gpfs55


Jun 25 11:33:04 linux automount[28911]: >> Failed to open gpfs55.
Jun 25 11:33:04 linux automount[28911]: >> No such device
Jun 25 11:33:04 linux automount[28911]: >> mount: fs type gpfs not supported by kernel
Jun 25 11:33:04 linux automount[28911]: mount(generic): failed to mount /dev/gpfs55 (type
gpfs)
on /gpfs/automountdir/gpfs55

6. After you have established that GPFS has received a mount request from autofs (Step “4” on page
386) and that mount request failed (Step “5” on page 386), issue a mount command for the GPFS file
system and follow the directions in “File system fails to mount” on page 377.

Steps to follow if automount fails to mount on AIX


There are course of actions that you can take if the automount fails to mount on AIX server.
On AIX, perform these steps:
1. First, verify that GPFS is up and running.
2. Verify that GPFS has established autofs mounts for each automount file system. Issue the following
command:

mount | grep autofs

The output is similar to this:

/var/mmfs/gen/mmDirectMap /gpfs/gpfs55 autofs Jun 25 15:03 ignore


/var/mmfs/gen/mmDirectMap /gpfs/gpfs88 autofs Jun 25 15:03 ignore

These are direct mount autofs mount entries. Each GPFS automount file system have an autofs
mount entry. These autofs direct mounts allow GPFS to mount on the GPFS mount point. To create
any missing GPFS autofs mounts, issue the mmcommon startAutomounter command, or stop and
restart GPFS using the mmshutdown and mmstartup commands.
3. Verify that the autofs daemon is running. Issue this command:

ps -ef | grep automount

386 IBM Storage Scale 5.1.9: Problem Determination Guide


Output is similar to this:

root 9820 4240 0 15:02:50 - 0:00 /usr/sbin/automountd

To start the automount daemon, issue the mmcommon startAutomounter command, or stop and
restart GPFS using the mmshutdown and mmstartup commands.
4. Verify that the mount command was issued to GPFS by examining the GPFS log. You should see
something like this:

Mon Jun 25 11:33:03 2007: Command: mount gpfsx2.kgn.ibm.com:gpfs55 5182

5. Since the autofs daemon logs status using syslogd, examine the syslogd log file for status
information from automountd. Here is an example of a failed automount request:

Jun 25 15:55:25 gpfsa1 automountd [9820 ] :mount of /gpfs/gpfs55:status 13

6. After you have established that GPFS has received a mount request from autofs (Step “4” on page
387) and that mount request failed (Step “5” on page 387), issue a mount command for the GPFS file
system and follow the directions in “File system fails to mount” on page 377.
7. If automount fails for a non-GPFS file system and you are using file /etc/auto.master, use
file /etc/auto_master instead. Add the entries from /etc/auto.master to /etc/auto_master
and restart the automount daemon.

Remote file system does not mount


The remote file system mounting failure reasons and the course of action that you can take to resolve the
issue.
When a remote file system does not mount, the problem might be with how the file system was defined to
both the local and remote nodes, or the communication paths between them. Review the Mounting a file
system owned and served by another GPFS cluster topic in the IBM Storage Scale: Administration Guide to
ensure that your setup is correct.
These are some of the errors encountered when mounting remote file systems:
• “Remote file system I/O fails with the “Function not implemented” error message when UID mapping is
enabled” on page 387
• “Remote file system does not mount due to differing GPFS cluster security configurations” on page 388
• “Cannot resolve contact node address” on page 388
• “The remote cluster name does not match the cluster name supplied by the mmremotecluster
command” on page 389
• “Contact nodes down or GPFS down on contact nodes” on page 389
• “GPFS is not running on the local node” on page 390
• “The NSD disk does not have an NSD server specified, and the mounting cluster does not have direct
access to the disks” on page 390
• “The cipherList option has not been set properly” on page 390
• “Remote mounts fail with the permission denied error message” on page 391

Remote file system I/O fails with the “Function not implemented” error
message when UID mapping is enabled
There are error messages when remote file system has an I/O failure and the course of action that you can
take to correct this issue.
When user ID (UID) mapping in a multi-cluster environment is enabled, certain kinds of mapping
infrastructure configuration problems might result in I/O requests on a remote file system failing:

Chapter 24. File system issues 387


ls -l /fs1/testfile
ls: /fs1/testfile: Function not implemented

To troubleshoot this error, verify the following configuration details:


1. That /var/mmfs/etc/mmuid2name and /var/mmfs/etc/mmname2uid helper scripts are present and
executable on all nodes in the local cluster and on all quorum nodes in the file system home cluster,
along with any data files needed by the helper scripts.
2. That UID mapping is enabled in both local cluster and remote file system home cluster configuration
by issuing the mmlsconfig enableUIDremap command.
3. That UID mapping helper scripts are working correctly.
For more information about configuring UID mapping, see the IBM white paper UID Mapping for GPFS in a
Multi-cluster Environment (https://www.ibm.com/docs/en/storage-scale?topic=STXKQY/uid_gpfs.pdf).

Remote file system does not mount due to differing GPFS cluster security
configurations
There are indications leading you to the conclusion that the remote file system does not mount and
courses of action you can take to correct the problem.
A mount command fails with a message similar to this:

Cannot mount gpfsxx2.ibm.com:gpfs66: Host is down.

The GPFS log on the cluster issuing the mount command should have entries similar to these:

There is more information in the log file /var/adm/ras/mmfs.log.latest


Mon Jun 25 16:39:27 2007: Waiting to join remote cluster gpfsxx2.ibm.com
Mon Jun 25 16:39:27 2007: Command: mount gpfsxx2.ibm.com:gpfs66 30291
Mon Jun 25 16:39:27 2007: The administrator of 199.13.68.12 gpfslx2 requires
secure connections. Contact the administrator to obtain the target clusters
key and register the key using "mmremotecluster update".
Mon Jun 25 16:39:27 2007: A node join was rejected. This could be due to
incompatible daemon versions, failure to find the node
in the configuration database, or no configuration manager found.
Mon Jun 25 16:39:27 2007: Failed to join remote cluster gpfsxx2.ibm.com
Mon Jun 25 16:39:27 2007: Command err 693: mount gpfsxx2.ibm.com:gpfs66 30291

The GPFS log file on the cluster that owns and serves the file system has an entry indicating the problem
as well, similar to this:

Mon Jun 25 16:32:21 2007: Kill accepted connection from 199.13.68.12 because security is
required, err 74

To resolve this problem, contact the administrator of the cluster that owns and serves the file system to
obtain the key and register the key using mmremotecluster command.
The SHA digest field of the mmauth show and mmremotecluster commands may be used to determine
if there is a key mismatch, and on which cluster the key should be updated. For more information on the
SHA digest, see “The SHA digest” on page 329.

Cannot resolve contact node address


There are error messages which are displayed if the contact node address does not get resolved and the
courses of action you can take to correct the problem.
The following error may occur if the contact nodes for gpfsyy2.ibm.com could not be resolved. You
would expect to see this if your DNS server was down, or the contact address has been deleted.

Mon Jun 25 15:24:14 2007: Command: mount gpfsyy2.ibm.com:gpfs14 20124


Mon Jun 25 15:24:14 2007: Host 'gpfs123.ibm.com' in gpfsyy2.ibm.com is not valid.
Mon Jun 25 15:24:14 2007: Command err 2: mount gpfsyy2.ibm.com:gpfs14 20124

388 IBM Storage Scale 5.1.9: Problem Determination Guide


To resolve the problem, correct the contact list and try the mount again.

The remote cluster name does not match the cluster name supplied by the
mmremotecluster command
There are error messages that gets displayed when the remote cluster name does not match with the
cluster name that is provided by the mmremotecluster command, and the courses of action you can
take to correct the problem.
A mount command fails with a message similar to this:

Cannot mount gpfslx2:gpfs66: Network is unreachable

and the GPFS log contains message similar to this:

Mon Jun 25 12:47:18 2007: Waiting to join remote cluster gpfslx2


Mon Jun 25 12:47:18 2007: Command: mount gpfslx2:gpfs66 27226
Mon Jun 25 12:47:18 2007: Failed to join remote cluster gpfslx2
Mon Jun 25 12:47:18 2007: Command err 719: mount gpfslx2:gpfs66 27226

Perform these steps:


1. Verify that the remote cluster name reported by the mmremotefs show command is the same name
as reported by the mmlscluster command from one of the contact nodes.
2. Verify the list of contact nodes against the list of nodes as shown by the mmlscluster command from
the remote cluster.
In this example, the correct cluster name is gpfslx2.ibm.com and not gpfslx2

mmlscluster

Output is similar to this:

GPFS cluster information


========================
GPFS cluster name: gpfslx2.ibm.com
GPFS cluster id: 649437685184692490
GPFS UID domain: gpfslx2.ibm.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR

GPFS cluster configuration servers:


-----------------------------------
Primary server: gpfslx2.ibm.com
Secondary server: (none)

Node Daemon node name IP address Admin node name Designation


---------------------------------------------------------------------------
1 gpfslx2 198.117.68.68 gpfslx2.ibm.com quorum

Contact nodes down or GPFS down on contact nodes


There are error messages that gets displayed if the contact nodes are down or the GPFS is down on the
contact nodes, and the courses of action you can take to correct the problem.
A mount command fails with a message similar to this:

GPFS: 6027-510 Cannot mount /dev/gpfs22 on /gpfs22: A remote host did not respond
within the timeout period.

The GPFS log has entries similar to this:

Mon Jun 25 13:11:14 2007: Command: mount gpfslx22:gpfs22 19004


Mon Jun 25 13:11:14 2007: Waiting to join remote cluster gpfslx22
Mon Jun 25 13:11:15 2007: Connecting to 199.13.68.4 gpfslx22
Mon Jun 25 13:16:36 2007: Failed to join remote cluster gpfslx22
Mon Jun 25 13:16:36 2007: Command err 78: mount gpfslx22:gpfs22 19004

Chapter 24. File system issues 389


To resolve the problem, use the mmremotecluster show command and verify that the cluster name
matches the remote cluster and the contact nodes are valid nodes in the remote cluster. Verify that GPFS
is active on the contact nodes in the remote cluster. Another way to resolve this problem is to change the
contact nodes using the mmremotecluster update command.

GPFS is not running on the local node


There are error messages that gets displayed if the GPFS does not run on the local nodes, and the courses
of action that you can take to correct the problem.
A mount command fails with a message similar to this:

mount: fs type gpfs not supported by kernel

Follow your procedures for starting GPFS on the local node.

The NSD disk does not have an NSD server specified, and the mounting
cluster does not have direct access to the disks
There are error messages that gets displayed if the file system mounting gets failed, and the courses of
action that you can take to correct the problem.
A file system mount fails with a message similar to this:

Failed to open gpfs66.


No such device
mount: Stale NFS file handle
Some file system data are inaccessible at this time.
Check error log for additional information.
Cannot mount gpfslx2.ibm.com:gpfs66: Stale NFS file handle

The GPFS log contains information similar to this:

Mon Jun 25 14:10:46 2007: Command: mount gpfslx2.ibm.com:gpfs66 28147


Mon Jun 25 14:10:47 2007: Waiting to join remote cluster gpfslx2.ibm.com
Mon Jun 25 14:10:47 2007: Connecting to 199.13.68.4 gpfslx2
Mon Jun 25 14:10:47 2007: Connected to 199.13.68.4 gpfslx2
Mon Jun 25 14:10:47 2007: Joined remote cluster gpfslx2.ibm.com
Mon Jun 25 14:10:48 2007: Global NSD disk, gpfs1nsd, not found.
Mon Jun 25 14:10:48 2007: Disk failure. Volume gpfs66. rc = 19. Physical volume gpfs1nsd.
Mon Jun 25 14:10:48 2007: File System gpfs66 unmounted by the system with return code 19 reason
code 0
Mon Jun 25 14:10:48 2007: No such device
Mon Jun 25 14:10:48 2007: Command err 666: mount gpfslx2.ibm.com:gpfs66 28147

To resolve the problem, the cluster that owns and serves the file system must define one or more NSD
servers.

The cipherList option has not been set properly


The remote mount failure, due to invalid value of cipherList, leads the error messages and the course
of actions that you can take to resolve the issue.
Another reason for remote mount to fail is if cipherList is not set to a valid value. A mount command
would fail with messages similar to this:

6027-510 Cannot mount /dev/dqfs1 on /dqfs1: A remote host is not available.

The GPFS log would contain messages similar to this:

Wed Jul 18 16:11:20.496 2007: Command: mount remote.cluster:fs3 655494


Wed Jul 18 16:11:20.497 2007: Waiting to join remote cluster remote.cluster
Wed Jul 18 16:11:20.997 2007: Remote mounts are not enabled within this cluster. \
See the Advanced Administration Guide for instructions. In particular ensure keys have been \
generated and a cipherlist has been set.
Wed Jul 18 16:11:20.998 2007: A node join was rejected. This could be due to
incompatible daemon versions, failure to find the node
in the configuration database, or no configuration manager found.

390 IBM Storage Scale 5.1.9: Problem Determination Guide


Wed Jul 18 16:11:20.999 2007: Failed to join remote cluster remote.cluster
Wed Jul 18 16:11:20.998 2007: Command: err 693: mount remote.cluster:fs3 655494
Wed Jul 18 16:11:20.999 2007: Message failed because the destination node refused the
connection.

The mmchconfig cipherlist=AUTHONLY command must be run on both the cluster that owns and
controls the file system, and the cluster that is attempting to mount the file system.

Remote mounts fail with the "permission denied" error message


There are many reasons why remote mounts can fail with a "permission denied" error message.
Follow these steps to resolve permission denied problems:
1. Check with the remote cluster's administrator to make sure that the proper keys are in place. The
mmauth show command on both clusters help with this.
2. Check that the grant access for the remote mounts has been given on the remote cluster with the
mmauth grant command. Use the mmauth show command from the remote cluster to verify this.
3. Check that the file system access permission is the same on both clusters using the mmauth show
command and the mmremotefs show command. If a remote cluster is only allowed to do a read-only
mount (see the mmauth show command), the remote nodes must specify -o ro on their mount
requests (see the mmremotefs show command). If you try to do remote mounts with read/write (rw)
access for remote mounts that have read-only (ro) access, then you get a "permission denied" error.
For detailed information about the mmauth command and the mmremotefs command, see the mmauth
command and the mmremotefs command pages in the IBM Storage Scale: Command and Programming
Reference Guide.

Unable to determine whether a file system is mounted


Certain GPFS file system commands cannot be performed when the file system in question is mounted.
In certain failure situations, GPFS cannot determine whether the file system in question is mounted or
not, and so cannot perform the requested command. In such cases, message 6027-1996 (Command was
unable to determine whether file system fileSystem is mounted) is issued.
If you encounter this message, perform problem determination, resolve the problem, and reissue the
command. If you cannot determine or resolve the problem, you may be able to successfully run the
command by first shutting down the GPFS daemon on all nodes of the cluster (using mmshutdown -a),
thus ensuring that the file system is not mounted.

GPFS error messages for file system mount status


The GPFS file system commands displays error message when they are unable to determine if the file
system in question is mounted.
6027-1996
Command was unable to determine whether file system fileSystem is mounted.

Multiple file system manager failures


The correct operation of GPFS requires that one node per file system function as the file system manager
at all times. This instance of GPFS has additional responsibilities for coordinating usage of the file system.
When the file system manager node fails, another file system manager is appointed in a manner that is not
visible to applications except for the time required to switch over.
There are situations where it may be impossible to appoint a file system manager. Such situations involve
the failure of paths to disk resources from many, if not all, nodes. In this event, the cluster manager
nominates several host names to successively try to become the file system manager. If none succeed,
the cluster manager unmounts the file system everywhere. See “NSD and underlying disk subsystem
failures” on page 407.

Chapter 24. File system issues 391


The required action here is to address the underlying condition that caused the forced unmounts and
then remount the file system. In most cases, this means correcting the path to the disks required by
GPFS. If NSD disk servers are being used, the most common failure is the loss of access through the
communications network. If SAN access is being used to all disks, the most common failure is the loss of
connectivity through the SAN.

GPFS error messages for multiple file system manager failures


Certain GPFS error messages are displayed for multiple file system manager failures.
The inability to successfully appoint a file system manager after multiple attempts can be associated with
both the error messages listed in “File system forced unmount” on page 382, as well as these additional
messages:
• When a forced unmount occurred on all nodes:
6027-635 [E]
The current file system manager failed and no new manager is appointed.
• If message 6027-636 is displayed, it means that there may be a disk failure. See “NSD and underlying
disk subsystem failures” on page 407 for NSD problem determination and repair procedures.
6027-636 [E]
Disk marked as stopped or offline.
• Message 6027-632 is the last message in this series of messages. See the accompanying messages:
6027-632
Failed to appoint new manager for fileSystem.
• Message 6027-631 occurs on each attempt to appoint a new manager (see the messages on the
referenced node for the specific reason as to why it failed):
6027-631
Failed to appoint node nodeName as manager for fileSystem.
• Message 6027-638 indicates which node had the original error (probably the original file system
manager node):
6027-638 [E]
File system fileSystem unmounted by node nodeName

Error numbers specific to GPFS application calls when file system manager
appointment fails
Certain error numbers and messages are displayed when the file system manager appointment fails .
When the appointment of a file system manager is unsuccessful after multiple attempts, GPFS may report
these error numbers in error logs, or return them to an application:
ENO_MGR = 212, The current file system manager failed and no new manager could be appointed.
This usually occurs when a large number of disks are unavailable or when there has been a major
network failure. Run mmlsdisk to determine whether disks have failed and take corrective action if
they have by issuing the mmchdisk command.

Discrepancy between GPFS configuration data and the on-disk


data for a file system
There is an indication leading you to the conclusion that there may be a discrepancy between the GPFS
configuration data and the on-disk data for a file system.
You issue a disk command (for example, mmadddisk, mmdeldisk, or mmrpldisk) and receive the
message:

392 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1290
GPFS configuration data for file system fileSystem may not be in agreement with the on-disk data for
the file system. Issue the command:

mmcommon recoverfs fileSystem

Before a disk is added to or removed from a file system, a check is made that the GPFS configuration data
for the file system is in agreement with the on-disk data for the file system. The preceding message is
issued if this check was not successful. This may occur if an earlier GPFS disk command was unable to
complete successfully for some reason. Issue the mmcommon recoverfs command to bring the GPFS
configuration data into agreement with the on-disk data for the file system.
If running mmcommon recoverfs does not resolve the problem, follow the procedures in “Information to
be collected before contacting the IBM Support Center” on page 555, and then contact the IBM Support
Center.

Errors associated with storage pools, filesets and policies


There are certain error messages associated with the storage pools, filesets and policies.
When an error is suspected while working with storage pools, policies and filesets, check the relevant
section in the IBM Storage Scale: Administration Guide to ensure that your setup is correct.
When you are sure that your setup is correct, see if your problem falls into one of these categories:
• “A NO_SPACE error occurs when a file system is known to have adequate free space” on page 393
• “Negative values occur in the 'predicted pool utilizations', when some files are 'ill-placed'” on page 395
• “Policies - usage errors” on page 395
• “Errors encountered with policies” on page 396
• “Filesets - usage errors” on page 397
• “Errors encountered with filesets” on page 398
• “Storage pools - usage errors” on page 398
• “Errors encountered with storage pools” on page 399

A NO_SPACE error occurs when a file system is known to have adequate free
space
The GPFS commands display a NO_SPACE error even if a file system has free space and the course of
actions that you can take to correct this issue.
A ENOSPC (NO_SPACE) message can be returned even if a file system has remaining space. The
NO_SPACE error might occur even if the df command shows that the file system is not full.
The user might have a policy that writes data into a specific storage pool. When the user tries to create
a file in that storage pool, it returns the ENOSPC error if the storage pool is full. The user next issues the
df command, which indicates that the file system is not full, because the problem is limited to the one
storage pool in the user's policy. In order to see if a particular storage pool is full, the user must issue the
mmdf command.
The following is a sample scenario:
1. The user has a policy rule that says files whose name contains the word 'tmp' should be put into
storage pool sp1 in the file system fs1. This command displays the rule:

mmlspolicy fs1 -L

The system displays an output similar to this:

/* This is a policy for GPFS file system fs1 */

Chapter 24. File system issues 393


/* File Placement Rules */
RULE SET POOL 'sp1' WHERE name like '%tmp%'
RULE 'default' SET POOL 'system'
/* End of Policy */

2. The user moves a file from the /tmp directory to fs1 that has the word 'tmp' in the file name, meaning
data of tmpfile should be placed in storage pool sp1:

mv /tmp/tmpfile /fs1/

The system produces output similar to this:

mv: writing `/fs1/tmpfile': No space left on device

This is an out-of-space error.


3. This command shows storage information for the file system:

df |grep fs1

The system produces output similar to this:

/dev/fs1 280190976 140350976 139840000 51% /fs1

This output indicates that the file system is only 51% full.
4. To query the storage usage for an individual storage pool, the user must issue the mmdf command.

mmdf fs1

The system produces output similar to this:

disk disk size failure holds holds free KB free


KB
name in KB group metadata data in full blocks in
fragments
--------------- ------------- -------- -------- ----- --------------------
-------------------
Disks in storage pool: system
gpfs1nsd 140095488 4001 yes yes 139840000 (100%) 19936
( 0%)
------------- --------------------
-------------------
(pool total) 140095488 139840000 (100%) 19936
( 0%)

Disks in storage pool: sp1


gpfs2nsd 140095488 4001 no yes 0s ( 0%) 248
( 0%)
------------- --------------------
-------------------
(pool total) 140095488 0 ( 0%) 248
( 0%)

============= ====================
===================
(data) 280190976 139840000 ( 50%) 20184
( 0%)
(metadata) 140095488 139840000 (100%) 19936
( 0%)
============= ====================
===================
(total) 280190976 139840000 ( 50%) 20184
( 0%)

Inode Information
------------------
Number of used inodes: 74
Number of free inodes: 137142
Number of allocated inodes: 137216
Maximum number of inodes: 150016

In this case, the user sees that storage pool sp1 has 0% free space left and that is the reason for the
NO_SPACE error message.

394 IBM Storage Scale 5.1.9: Problem Determination Guide


5. To resolve the problem, the user must change the placement policy file to avoid putting data in a full
storage pool, delete some files in storage pool sp1, or add more space to the storage pool.

Negative values occur in the 'predicted pool utilizations', when some files
are 'ill-placed'
A scenario where an ill-placed files may cause GPFS to produce a 'Predicted Pool Utilization' of a negative
value and the course of action that you can take to resolve this issue.
This is a hypothetical situation where ill-placed files can cause GPFS to produce a 'Predicted Pool
Utilization' of a negative value.
Suppose that 2 GB of data from a 5 GB file named abc, that is supposed to be in the system storage pool,
are actually located in another pool. This 2 GB of data is said to be 'ill-placed'. Also, suppose that 3 GB of
this file are in the system storage pool, and no other file is assigned to the system storage pool.
If you run the mmapplypolicy command to schedule file abc to be moved from the system storage pool
to a storage pool named YYY, the mmapplypolicy command does the following:
1. Starts with the 'Current pool utilization' for the system storage pool, which is 3 GB.
2. Subtracts 5 GB, the size of file abc.
3. Arrives at a 'Predicted Pool Utilization' of negative 2 GB.
The mmapplypolicy command does not know how much of an 'ill-placed' file is currently in the wrong
storage pool and how much is in the correct storage pool.
When there are ill-placed files in the system storage pool, the 'Predicted Pool Utilization' can be any
positive or negative value. The positive value can be capped by the LIMIT clause of the MIGRATE rule.
The 'Current® Pool Utilizations' should always be between 0% and 100%.

Policies - usage errors


Certain misunderstandings that may be encountered while using policies and the suggestions to
overcome such mistakes.
The following are common mistakes and misunderstandings encountered when dealing with policies:
• You are advised to test your policy rules using the mmapplypolicy command with the -I test option.
Also consider specifying a test-subdirectory within your file system. Do not apply a policy to an entire
file system of vital files until you are confident that the rules correctly express your intentions. Even
then, you are advised to do a sample run with the mmapplypolicy -I test command using
the option -L 3 or higher, to better understand which files are selected as candidates, and which
candidates are chosen. The -L flag of the mmapplypolicy command can be used to check a policy
before it is applied. For examples and more information on this flag, see “The mmapplypolicy -L
command” on page 319.
• There is a 1 MB limit on the total size of the policy file installed in GPFS.
• Ensure that all clocks on all nodes of the GPFS cluster are synchronized. Depending on the policies in
effect, variations in the clock times can cause unexpected behavior.
The mmapplypolicy command uses the time on the node on which it is run as the current time.
Policy rules may refer to a file's last access time or modification time, which is set by the node which
last accessed or modified the file. If the clocks are not synchronized, files may be treated as older or
younger than their actual age, and this could cause files to be migrated or deleted prematurely, or not at
all. A suggested solution is to use NTP to keep the clocks synchronized on all nodes in the cluster.
• The rules of a policy file are evaluated in order. A new file is assigned to the storage pool of the first rule
that it matches. If the file fails to match any rule, the file creation fails with an EINVAL error code. A
suggested solution is to put a DEFAULT clause as the last entry of the policy file.
• When a policy file is installed, GPFS verifies that the named storage pools exist. However, GPFS allows
an administrator to delete pools that are mentioned in the policy file. This allows more freedom for

Chapter 24. File system issues 395


recovery from hardware errors. Consequently, the administrator must be careful when deleting storage
pools referenced in the policy.

Errors encountered with policies


The analysis of those errors, which might be encountered while dealing with the policies.
These are errors that are encountered with policies and how to analyze them:
• Policy file never finishes, appears to be looping.
The mmapplypolicy command runs by making two passes over the file system - one over the inodes
and one over the directory structure. The policy rules are applied to each file to determine a list of
candidate files. The list is sorted by the weighting that is specified in the rules, then applied to the
file system. No file is ever moved more than once. However, due to the quantity of data involved, this
operation might take a long time and appear to be hung or looping.
The time required to run the mmapplypolicy command is a function of the number of files in the file
system, the current load on the file system, and on the node in which the mmapplypolicy command is
run. If this function appears to not finish, you might need to reduce the load on the file system or run the
mmapplypolicy command on a less loaded node in the cluster.
• Initial file placement is not correct.
The placement rules specify a single pool for initial placement. The first rule that matches the file's
attributes selects the initial pool. If that pool is incorrect, then the placement rules must be updated
to select a different pool. You might see current placement rules by running the mmlspolicy -L
command. For existing files, the file can be moved to its desired pool by using the mmrestripefile or
the mmchattr commands.
For examples and more information on the mmlspolicy -L command, see “The mmapplypolicy -L
command” on page 319.
• Data migration, deletion, error exclusion not working properly.
The mmapplypolicy command selects a list of candidate files to be migrated or deleted. The list is
sorted by the weighting factor that is specified in the rules, then applied to a sufficient number of files
on the candidate list to achieve the utilization thresholds specified by the pools. The actual migration
and deletion are done in parallel. The following are the possibilities for an incorrect operation:
– The file was not selected as a candidate for the expected rule. Each file is selected as a candidate for
only the first rule that matched its attributes. If the matched rule specifies an invalid storage pool,
the file is not moved. The -L 4 option on the mmapplypolicy command displays the details for
candidate selection and file exclusion.
– The file was a candidate, but was not operated on. Only the candidates necessary to achieve the
desired pool utilization are migrated. Using the -L 3 option displays more information on candidate
selection and files that are chosen for migration.
For more information on the mmlspolicy -L command, see “The mmapplypolicy -L command” on
page 319.
– The file was scheduled for migration but was not moved. In this case, the file is shown as 'ill-placed'
by the mmlsattr -L command, indicating that the migration did not succeed. This occurs if the
new storage pool assigned to the file did not have sufficient free space for the file when the actual
migration was attempted. Since migrations are done in parallel, it is possible that the target pool
had files that were also migrating, but had not yet been moved. If the target pool now has sufficient
free space, then files can be moved by using the mmrestripefs, mmrestripefile, mmchattr
commands.
• Asserts or error messages indicating a problem.
The policy rule language can only check for some errors at runtime. For example, a rule that causes a
divide by zero cannot be checked when the policy file is installed. Errors of this type generate an error
message and stop the policy evaluation for that file.

396 IBM Storage Scale 5.1.9: Problem Determination Guide


Note: I/O errors while migrating files indicate failing storage devices and must be addressed like any
other I/O error. The same is true for any file system error or panic that is encountered while migrating
files.

Filesets - usage errors


The misunderstandings while dealing with the filesets and the course of actions to correct them.
These are common mistakes and misunderstandings that are encountered when dealing with filesets:
1. Fileset junctions look very much like ordinary directories, but they cannot be deleted by the usual
commands, such as rm -r or rmdir command. By using these commands on a fileset junction can
result in a Not owner message on an AIX system, or an Operation not permitted message on a
Linux system.
As a consequence these commands might fail when applied to a directory that is a fileset junction.
Similarly, when the rm -r command is applied to a directory that contains a fileset junction, it fails as
well.

On the other hand, the rm -r command deletes all the files that are contained in the filesets that
are linked under the specified directory. Use the mmunlinkfileset command to remove fileset
junctions.
2. Files and directories are moved from one fileset to another, or a hard link cross fileset boundary.
If the user is unaware of the locations of fileset junctions, then the mv and ln commands might fail
unexpectedly. In most cases, the mv command automatically compensates for this failure and uses a
combination of cp and rm to accomplish the desired result. Use the mmlsfileset command to view
the locations of fileset junctions. Use the mmlsattr -L command to determine the fileset for any
given file.
3. Because a snapshot saves the contents of a fileset, deleting a fileset that is included in a snapshot
cannot completely remove the fileset.
The fileset is put into a 'deleted' state and continues to appear in the mmlsfileset command
output. Once the last snapshot that is containing the fileset is deleted, the fileset is automatically
removed. The mmlsfileset --deleted command indicates deleted filesets and shows their names
in parentheses.
4. Deleting a large fileset might take some time and might be interrupted by other failures, such as disk
errors or system crashes.
When this occurs, the recovery action leaves the fileset in a being deleted state. Such a fileset might
not be linked into the namespace. The corrective action it to finish the deletion by reissuing the fileset
delete command:

mmdelfileset fs1 fsname1 -f

The mmlsfileset command identifies filesets in this state by displaying a status of 'Deleting'.
5. If you unlink a fileset that has other filesets that are linked below it, then any filesets that are linked
to it (that is, child filesets) becomes inaccessible. The child filesets remain linked to the parent and
becomes accessible again when the parent is relinked.
6. By default, the mmdelfileset command does not delete a fileset that is not empty.
To empty a fileset, first unlink all its immediate child filesets to remove their junctions from the fileset
to be deleted. Then, while the fileset itself is still linked, use the rm -rf or a similar command
to remove the rest of the contents of the fileset. Now the fileset might be unlinked and deleted.
Alternatively, the fileset to be deleted can be unlinked first and then the mmdelfileset command can
be used with the -f (force) option. This unlinks its child filesets, then deletes the files and directories
that are contained in the fileset.
7. When a small dependent fileset is deleted, it might be faster to use the rm -rf command instead of
the mmdelfileset command with the -f option.

Chapter 24. File system issues 397


Errors encountered with filesets
The analysis of those errors, which might be encountered while dealing with the filesets.
Following errors might be encountered with filesets:
1. Problems can arise when running backup and archive utilities against a file system with unlinked
filesets. See the Filesets and backup topic in the IBM Storage Scale: Administration Guide for details.
2. In a rare case, if the mmfsck command encounters a serious error that is checking the file system's
fileset metadata, it might not be possible to reconstruct the fileset name and comment. These cannot
be inferred from information elsewhere in the file system. If this happens, then the mmfsck command
creates a dummy name for the fileset, such as Fileset911 and the comment is set to the empty string.
3. Sometimes the mmfsck command encounters orphaned files or directories (those without a parent
directory), and traditionally these are reattached in a special directory that is called lost+found in the
file system root. However, when a file system contains multiple filesets orphaned files and directories
are reattached in the lost+found directory in the root of the fileset to which they belong. For the
root fileset, this directory appears in the usual place, but other filesets might each have their own
lost+found directory.

Active file management fileset errors


When the mmafmctl Device getstate command displays a NeedsResync target or fileset state,
inconsistencies exist between the home and cache. To ensure that the cached data is synchronized with
the home and the fileset is returned to Active state, either the file system must be unmounted and
mounted, or the fileset must be unlinked and linked. Once this is done, the next update to fileset data
triggers an automatic synchronization of data from the cache to the home.

Storage pools - usage errors


The misunderstandings while dealing with the storage pools and the course of actions to correct them.
These are common mistakes and misunderstandings that are encountered when dealing with storage
pools:
1. Only the system storage pool is allowed to store metadata. All other pools must have the dataOnly
attribute.
2. Take care to create your storage pools with sufficient numbers of failure groups to enable the desired
level of replication.
When the file system is created, GPFS requires all of the initial pools to have at least as many failure
groups as defined by the default replication (-m and -r flags on the mmcrfs command). However, once
the file system has been created, the user can create a storage pool with fewer failure groups than the
default replication.
The mmadddisk command issues a warning, but it allows the disks to be added and the storage pool
defined. To use the new pool, the user must define a policy rule to create or migrate files into the new
pool. This rule should be defined to set an appropriate replication level for each file that is assigned to
the pool. If the replication level exceeds the number of failure groups in the storage pool, all files that
are assigned to the pool incur added overhead on each write to the file to mark the file as ill-replicated.
To correct the problem, add additional disks to the storage pool, defining a different failure group, or
ensure that all policy rules that assign files to the pool also set the replication appropriately.
3. GPFS does not permit the mmchdisk or mmrpldisk command to change a disk's storage pool
assignment. Changing the pool assignment requires all data residing on the disk to be moved to
another disk before the disk can be reassigned. Moving the data is a costly and time-consuming
operation; therefore, GPFS requires an explicit mmdeldisk command to move it, rather than moving it
as a side effect of another command.
4. Some storage pools allow larger disks to be added than do other storage pools.

398 IBM Storage Scale 5.1.9: Problem Determination Guide


When the file system is created, GPFS defines the maximum size disk that can be supported by using
the on-disk data structures to represent it. Likewise, when defining a new storage pool, the newly
created on-disk structures establish a limit on the maximum size disk that can be added to that pool.
To add disks that exceed the maximum size that is allowed by a storage pool, simply create a new pool
by using the larger disks.
The mmdf command can be used to find the maximum disk size allowed for a storage pool.
5. If you try to delete a storage pool when there are files that are still assigned to the pool, consider this:
A storage pool is deleted when all disks that are assigned to the pool are deleted. To delete the last
disk, all data that is residing in the pool must be moved to another pool. Likewise, any files assigned to
the pool, whether or not they contain data, must be reassigned to another pool. The easiest method for
reassigning all files and migrating all data is to use the mmapplypolicy command with a single rule to
move all data from one pool to another. You should also install a new placement policy that does not
assign new files to the old pool. Once all files have been migrated, reissue the mmdeldisk command
to delete the disk and the storage pool.
If all else fails, and you have a disk that has failed and cannot be recovered, follow the procedures
in “Information to be collected before contacting the IBM Support Center” on page 555, and then
contact the IBM Support Center for commands to allow the disk to be deleted without migrating all
data from it. Files with data left on the failed device loses data. If the entire pool is deleted, any
existing files that are assigned to that pool are reassigned to a "broken" pool, which prevents writes to
the file until the file is reassigned to a valid pool.
6. Ill-placed files - understanding and correcting them.
The mmapplypolicy command migrates a file between pools by first assigning it to a new pool, then
moving the file's data. Until the existing data is moved, the file is marked as 'ill-placed' to indicate
that some of its data resides in its previous pool. In practice, mmapplypolicy assigns all files to be
migrated to their new pools, then it migrates all of the data in parallel. Ill-placed files indicate that the
mmapplypolicy or mmchattr command did not complete its last migration or that -I defer was
used.
To correct the placement of the ill-placed files, the file data needs to be migrated to the assigned
pools. You can use the mmrestripefs, or mmrestripefile commands to move the data.
7. Using the -P PoolName option on the mmrestripefs, command:
This option restricts the restripe operation to a single storage pool. For example, after adding a disk to
a pool, only the data in that pool needs to be restriped. In practice, -P PoolName simply restricts the
operation to the files assigned to the specified pool. Files assigned to other pools are not included in
the operation, even if the file is ill-placed and has data in the specified pool.

Errors encountered with storage pools


The analysis of those errors which may be encountered while dealing with the storage pools.
These are error encountered with policies and how to analyze them:
1. Access time to one pool appears slower than the others.
A consequence of striping data across the disks is that the I/O throughput is limited by the slowest
device. A device encountering hardware errors or recovering from hardware errors may effectively
limit the throughput to all devices. However using storage pools, striping is done only across the disks
assigned to the pool. Thus a slow disk impacts only its own pool; all other pools are not impeded.
To correct the problem, check the connectivity and error logs for all disks in the slow pool.
2. Other storage pool problems might really be disk problems and should be pursued from the standpoint
of making sure that your disks are properly configured and operational. See Chapter 25, “Disk issues,”
on page 407.

Chapter 24. File system issues 399


Snapshot problems
Use the mmlssnapshot command as a general hint for snapshot-related problems, to find out what
snapshots exist, and what state they are in. Use the mmsnapdir command to find the snapshot directory
name used to permit access.
The mmlssnapshot command displays the list of all snapshots of a file system. This command lists the
snapshot name, some attributes of the snapshot, as well as the snapshot's status. The mmlssnapshot
command does not require the file system to be mounted.

Problems with locating a snapshot


Use the mmlssnapshot command and mmsnapdir command to find the snapshot detail and locate
them.
The mmlssnapshot and mmsnapdir commands are provided to assist in locating the snapshots in the
file system directory structure. Only valid snapshots are visible in the file system directory structure.
They appear in a hidden subdirectory of the file system's root directory. By default the subdirectory
is named .snapshots. The valid snapshots appear as entries in the snapshot directory and may be
traversed like any other directory. The mmsnapdir command can be used to display the assigned
snapshot directory name.

Problems not directly related to snapshots


There are errors which are returned from the snapshot commands but are not linked with the snapshots
directly.
Many errors returned from the snapshot commands are not specifically related to the snapshot. For
example, disk failures or node failures could cause a snapshot command to fail. The response to these
types of errors is to fix the underlying problem and try the snapshot command again.

GPFS error messages for indirect snapshot errors


There are GPFS error messages which may be associated with snapshots directly but does not show a
clear relation to snapshot issues.
The error messages for this type of problem do not have message numbers, but can be recognized by
their message text:
• 'Unable to sync all nodes, rc=errorCode.'
• 'Unable to get permission to create snapshot, rc=errorCode.'
• 'Unable to quiesce all nodes, rc=errorCode.'
• 'Unable to resume all nodes, rc=errorCode.'
• 'Unable to delete snapshot filesystemName from file system snapshotName, rc=errorCode.'
• 'Error restoring inode number, error errorCode.'
• 'Error deleting snapshot snapshotName in file system filesystemName, error errorCode.'
• 'commandString failed, error errorCode.'
• 'None of the nodes in the cluster is reachable, or GPFS is down on all of the nodes.'
• 'File system filesystemName is not known to the GPFS cluster.'

Snapshot usage errors


Certain error in the GPFS error messages are related to the snapshot usage restrictions or incorrect
snapshot names .
Many errors returned from the snapshot commands are related to usage restrictions or incorrect snapshot
names.

400 IBM Storage Scale 5.1.9: Problem Determination Guide


An example of a snapshot restriction error is exceeding the maximum number of snapshots allowed at
one time. For simple errors of these types, you can determine the source of the error by reading the error
message or by reading the description of the command. You can also run the mmlssnapshot command
to see the complete list of existing snapshots.
Examples of incorrect snapshot name errors are trying to delete a snapshot that does not exist or trying
to create a snapshot using the same name as an existing snapshot. The rules for naming global and
fileset snapshots are designed to minimize conflicts between the file system administrator and the fileset
owners. These rules can result in errors when fileset snapshot names are duplicated across different
filesets or when the snapshot command -j option (specifying a qualifying fileset name) is provided or
omitted incorrectly. To resolve name problems, review the mmlssnapshot output with careful attention
to the Fileset column. You can also specify the -s or -j options of the mmlssnapshot command to
limit the output. For snapshot deletion, the -j option must exactly match the Fileset column.
For more information about snapshot naming conventions, see the mmcrsnapshot command in the IBM
Storage Scale: Command and Programming Reference Guide.

GPFS error messages for snapshot usage errors


Certain error messages for snapshot usage errors have no error message numbers but may be recognized
using the message texts.
The error messages for this type of problem do not have message numbers, but can be recognized by
their message text:
• 'File system filesystemName does not contain a snapshot snapshotName, rc=errorCode.'
• 'Cannot create a new snapshot until an existing one is deleted. File system filesystemName has a limit of
number online snapshots.'
• 'Cannot restore snapshot. snapshotName is mounted on number nodes and in use on number nodes.'
• 'Cannot create a snapshot in a DM enabled file system, rc=errorCode.'

Snapshot status errors


There are certain snapshot commands like mmdelsnapshot and mmrestorefs, which lets snapshot go
invalid if they got interrupted while running.
Some snapshot commands like mmdelsnapshot and mmrestorefs may require a substantial amount
of time to complete. If the command is interrupted, say by the user or due to a failure, the snapshot
may be left in an invalid state. In many cases, the command must be completed before other snapshot
commands are allowed to run. The source of the error may be determined from the error message, the
command description, or the snapshot status available from mmlssnapshot.

GPFS error messages for snapshot status errors


Certain error messages for snapshot status error have no error message numbers and may be recognized
by the message texts only.
The error messages for this type of problem do not have message numbers, but can be recognized by
their message text:
• 'Cannot delete snapshot snapshotName which is snapshotState, error = errorCode.'
• 'Cannot restore snapshot snapshotName which is snapshotState, error = errorCode.'
• 'Previous snapshot snapshotName is invalid and must be deleted before a new snapshot may be
created.'
• 'Previous snapshot snapshotName must be restored before a new snapshot may be created.'
• 'Previous snapshot snapshotName is invalid and must be deleted before another snapshot may be
deleted.'
• 'Previous snapshot snapshotName is invalid and must be deleted before another snapshot may be
restored.'

Chapter 24. File system issues 401


• 'More than one snapshot is marked for restore.'
• 'Offline snapshot being restored.'

Snapshot directory name conflicts


The snapshot that is generated by the mmcrsnapshot command might not be accessed due to directory
conflicts and the course of action to correct the snapshot directory name conflict.
By default, all snapshots appear in a directory that is named .snapshots in the root directory of the file
system. This directory is dynamically generated when the first snapshot is created and continues to exist
even after the last snapshot is deleted. If the user tries to create the first snapshot, and a normal file or
directory that is named .snapshots already exists, the mmcrsnapshot command is successful but the
snapshot might not be accessed.
There are two ways to fix this problem:
1. Delete or rename the existing file or directory.
2. Tell GPFS to use a different name for the dynamically generated directory of snapshots by running the
mmsnapdir command.
It is also possible to get a name conflict as a result of issuing the mmrestorefs command. Since the
mmsnapdir command allows changing the name of the dynamically generated snapshot directory, it
is possible that an older snapshot contains a normal file or directory that conflicts with the current
name of the snapshot directory. When this older snapshot is restored, the mmrestorefs command
recreates the old normal file or directory in the file system root directory. The mmrestorefs command
does not fail in this case, but the restored file or directory hides the existing snapshots. After the
mmrestorefs command is issued, it might appear that existing snapshots have disappeared. However,
the mmlssnapshot command still shows all existing snapshots.
The fix is the similar to the one mentioned before. Perform one of these two steps:
1. After the mmrestorefs command completes, rename the conflicting file or directory that was
restored in the root directory.
2. Run the mmsnapdir command to select a different name for the dynamically generated snapshot
directory.
Finally, the mmsnapdir -a option enables a dynamically generated snapshot directory in every directory,
not just the file system root. This allows each user quick access to snapshots of their own files by going
into .snapshots in their home directory or any other of their directories.
Unlike .snapshots in the file system root, .snapshots in other directories is invisible, that is, the ls
-a command does not list .snapshots. This is intentional because recursive file system utilities such
as find, du or ls -R would otherwise either fail or produce incorrect or undesirable results. The user
must explicitly specify the name of the snapshot directory to access the snapshots. For example, ls
~/.snapshots. If there is a name conflict (that is, a normal file or directory that is named .snapshots
already exists in the user's home directory), the user must rename the existing file or directory.
The inode numbers that are used for and within these special .snapshots directories are constructed
dynamically and do not follow the standard rules. These inode numbers are visible to applications through
standard commands, such as stat, readdir, or ls. The inode numbers that are reported for these
directories can also be reported differently on different operating systems. Applications should not expect
consistent numbering for such inodes.

Errors encountered when restoring a snapshot


There are errors which may be displayed while restoring a snapshot.
The following errors might be encountered when restoring from a snapshot:
• The mmrestorefs command fails with an ENOSPC message. In this case, there are not enough free
blocks in the file system to restore the selected snapshot. You can add space to the file system by
adding a new disk. As an alternative, you can delete a different snapshot from the file system to free

402 IBM Storage Scale 5.1.9: Problem Determination Guide


some existing space. You cannot delete the snapshot that is being restored. After there is additional free
space, issue the mmrestorefs command again.
• The mmrestorefs command fails with quota exceeded errors. Try adjusting the quota configuration or
disabling quota, and then issue the command again.
• The mmrestorefs command is interrupted and some user data is not restored completely. Try
repeating the mmrestorefs command in this instance.
• The mmrestorefs command fails because of an incorrect file system, fileset, or snapshot name. To fix
this error, issue the command again with the correct name.
• The mmrestorefs -j command fails with the following error:
6027-953
Failed to get a handle for fileset filesetName, snapshot snapshotName in file system fileSystem.
errorMessage.
In this case, the file system that contains the snapshot to restore should be mounted, and then the
fileset of the snapshot should be linked.
If you encounter additional errors that cannot be resolved, contact the IBM Support Center.

Failures using the mmbackup command


Use the mmbackup command to back up the files in a GPFS file system to storage on an IBM Storage
Protect server. A number of factors can cause mmbackup to fail.
The most common of these are:
• The file system is not mounted on the node issuing the mmbackup command.
• The file system is not mounted on the IBM Storage Protect client nodes.
• The mmbackup command was issued to back up a file system owned by a remote cluster.
• The IBM Storage Protect clients are not able to communicate with the IBM Storage Protect server due
to authorization problems.
• The IBM Storage Protect server is down or out of storage space.
• When the target of the backup is tape, the IBM Storage Protect server may be unable to handle all of the
backup client processes because the value of the IBM Storage Protect server's MAXNUMMP parameter
is set lower than the number of client processes. This failure is indicated by message ANS1312E from
IBM Storage Protect.
The errors from mmbackup normally indicate the underlying problem.

GPFS error messages for mmbackup errors


Error messages that are displayed for mmbackup errors.
6027-1995
Device deviceName is not mounted on node nodeName.

IBM Storage Protect error messages


Error message displayed for server media mount.
ANS1312E
Server media mount not possible.

Chapter 24. File system issues 403


Data integrity
GPFS takes extraordinary care to maintain the integrity of customer data. However, in case of certain
hardware failures or unusual circumstances, the occurrence of a programming error can cause the loss of
data in a file system.
GPFS performs extensive checking to validate metadata and ceases by using the file system if metadata
becomes inconsistent. This can appear in two ways:
1. The file system is unmounted and applications begin seeing ESTALE return codes to file operations.
2. Error log entries that indicate MMFS_SYSTEM_UNMOUNT and a corruption error is generated.
If actual disk data corruption occurs, this error appears on each node in succession. Before proceeding
with the following steps, follow the procedures in “Information to be collected before contacting the IBM
Support Center” on page 555, and then contact the IBM Support Center.
1. Examine the error logs on the NSD servers for any indication of a disk error that is reported.
2. Take appropriate disk problem determination and repair actions before continuing.
3. After completing any required disk repair actions, run the offline version of the mmfsck command on
the file system.
4. If your error log or disk analysis tool indicates that specific disk blocks are in error, use the mmfileid
command to determine which files are on damaged areas of the disk, and then restore these files. For
more information, see “The mmfileid command” on page 327.
5. If data corruption errors occur in only one node, it is probable that memory structures within the node
are corrupted. In this case, the file system is probably good but a program error exists in GPFS or
another authorized program with access to GPFS data structures.
Follow the directions in “Data integrity” on page 404 and then restart the node. This should clear the
problem. If the problem repeats on one node without affecting other nodes check the programming
specifications code levels to determine that they are current and compatible, and that no hardware
errors were reported. Refer to the IBM Storage Scale: Concepts, Planning, and Installation Guide for
correct software levels.

Error numbers specific to GPFS application calls when data integrity may be
corrupted
If there is a possibility of the corruption of data integrity, GPFS displays specific error messages or returns
them to the application.
When there is the possibility of data corruption, GPFS may report these error numbers in the operating
system error log, or return them to an application:
EVALIDATE=214, Invalid checksum or other consistency check failure on disk data structure.
This indicates that internal checking has found an error in a metadata structure. The severity of the
error depends on which data structure is involved. The cause of this is usually GPFS software, disk
hardware or other software between GPFS and the disk. Running mmfsck should repair the error. The
urgency of this depends on whether the error prevents access to some file or whether basic metadata
structures are involved.

Messages requeuing in AFM


The course of actions to be followed for resolving the requeued messages in the Gateway node.
Sometimes requests in the AFM messages queue on the gateway node get requested because of errors at
home. For example, if there is no space at home to perform a new write, a write message that is queued
is not successful and gets requeued. The administrator would see the failed message getting requeued in
the queue on the gateway node. The administrator has to resolve the issue by adding more space at home
and running the mmafmctl resumeRequeued command, so that the requeued messages are executed
at home again. If mmafmctl resumeRequeued is not run by an administrator, AFM would still execute
the message in the regular order of message executions from cache to home.

404 IBM Storage Scale 5.1.9: Problem Determination Guide


Running the mmfsadm dump afm all command on the gateway node shows the queued messages.
Requeued messages show in the dumps similar to the following example:

c12c4apv13.gpfs.net: Normal Queue: (listed by execution order) (state: Active)


c12c4apv13.gpfs.net: Write [612457.552962] requeued file3 (43 @ 293) chunks 0 bytes 0 0

NFSv4 ACL problems


The analysis of NFS V4 issues and suggestions to resolve these issues.
Before analyzing an NFS V4 problem, review this documentation to determine if you are using NFS V4
ACLs and GPFS correctly:
1. The NFS Version 4 Protocol paper and other related information that are available in the Network
File System Version 4 (nfsv4) section of the IETF Datatracker website (datatracker.ietf.org/wg/nfsv4/
documents).
2. The Managing GPFS access control lists and NFS export topic in the IBM Storage Scale: Administration
Guide.
3. The GPFS exceptions and limitations to NFS V4 ACLs topic in the IBM Storage Scale: Administration
Guide.
The commands mmdelacl and mmputacl can be used to revert an NFS V4 ACL to a traditional ACL. Use
the mmdelacl command to remove the ACL, leaving access controlled entirely by the permission bits in
the mode. Then use the chmod command to modify the permissions, or the mmputacl and mmeditacl
commands to assign a new ACL.
For files, the mmputacl and mmeditacl commands can be used at any time (without first issuing the
mmdelacl command) to assign any type of ACL. The command mmeditacl -k posix provides a
translation of the current ACL into traditional POSIX form and can be used to more easily create an ACL to
edit, instead of having to create one from scratch.

Chapter 24. File system issues 405


406 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 25. Disk issues
GPFS uses only disk devices prepared as Network Shared Disks (NSDs). However NSDs might exist on top
of a number of underlying disk technologies.
NSDs, for example, might be defined on top of Fibre Channel SAN connected disks. This information
provides detail on the creation, use, and failure of NSDs and their underlying disk technologies.
These are some of the errors encountered with GPFS disks and NSDs:
• “NSD and underlying disk subsystem failures” on page 407
• “GPFS has declared NSDs built on top of AIX logical volumes as down” on page 413
• “Disk accessing commands fail to complete due to problems with some non-IBM disks” on page 415
• “Persistent Reserve errors” on page 425
• “GPFS is not using the underlying multipath device” on page 428

NSD and underlying disk subsystem failures


There are indications that will lead you to the conclusion that your file system has disk failures.
Some of those indications include:
• Your file system has been forced to unmount. For more information about forced file system unmount,
see “File system forced unmount” on page 382.
• The mmlsmount command indicates that the file system is not mounted on certain nodes.
• Your application is getting EIO errors.
• Operating system error logs indicate you have stopped using a disk in a replicated system, but your
replication continues to operate.
• The mmlsdisk command shows that disks are down.
Note: If you are reinstalling the operating system on one node and erasing all partitions from the system,
GPFS descriptors will be removed from any NSD this node can access locally. The results of this action
might require recreating the file system and restoring from backup. If you experience this problem, do not
unmount the file system on any node that is currently mounting the file system. Contact the IBM Support
Center immediately to see if the problem can be corrected.

Error encountered while creating and using NSD disks


Use mmcrnsd command to prepare NSD disks. While preparing the NSD disks, there are several errors
conditions encountered.
GPFS requires that disk devices be prepared as NSDs. This is done using the mmcrnsd command. The
input to the mmcrnsd command is given in the form of disk stanzas. For a complete explanation of disk
stanzas, see the Stanza files section in the IBM Storage Scale: Administration Guide, and the following
topics from the IBM Storage Scale: Command and Programming Reference Guide:
• mmcdisk command
• mmchnsd command
• mmcrfs command
• mmcrnsd command
For disks that are SAN-attached to all nodes in the cluster, device=DiskName should refer to the disk
device name in /dev on the node where the mmcrnsd command is issued. If a server list is specified,
device=DiskName must refer to the name of the disk on the first server node. The same disk can have
different local names on different nodes.

© Copyright IBM Corp. 2015, 2024 407


When you specify an NSD server node, that node performs all disk I/O operations on behalf of nodes
in the cluster that do not have connectivity to the disk. You can also specify up to eight additional NSD
server nodes. These additional NSD servers will become active if the first NSD server node fails or is
unavailable.
When the mmcrnsd command encounters an error condition, one of these messages is displayed:
6027-2108
Error found while processing stanza
or
6027-1636
Error found while checking disk descriptor descriptor
Usually, this message is preceded by one or more messages describing the error more specifically.
Another possible error from mmcrnsd is:
6027-2109
Failed while processing disk stanza on node nodeName.
or
6027-1661
Failed while processing disk descriptor descriptor on node nodeName.
One of these errors can occur if an NSD server node does not have read and write access to the disk.
The NSD server node needs to write an NSD volume ID to the raw disk. If an additional NSD server node
is specified, that NSD server node will scan its disks to find this NSD volume ID string. If the disk is
SAN-attached to all nodes in the cluster, the NSD volume ID is written to the disk by the node on which
the mmcrnsd command is running.

Displaying NSD information


Use mmlsnsd command to display the NSD information and analyze the cluster details pertaining to
NSDs.
Use the mmlsnsd command to display information about the currently defined NSDs in the cluster. For
example, if you issue mmlsnsd, your output may be similar to this:

File system Disk name NSD servers


---------------------------------------------------------------------------
fs1 t65nsd4b (directly attached)
fs5 t65nsd12b c26f4gp01.ppd.pok.ibm.com,c26f4gp02.ppd.pok.ibm.com
fs6 t65nsd13b
c26f4gp01.ppd.pok.ibm.com,c26f4gp02.ppd.pok.ibm.com,c26f4gp03.ppd.pok.ibm.com

This output shows that:


• There are three NSDs in this cluster: t65nsd4b, t65nsd12b, and t65nsd13b.
• NSD disk t65nsd4b of file system fs1 is SAN-attached to all nodes in the cluster.
• NSD disk t65nsd12b of file system fs5 has 2 NSD server nodes.
• NSD disk t65nsd13b of file system fs6 has 3 NSD server nodes.
If you need to find out the local device names for these disks, you could use the -m option on the
mmlsnsd command. For example, issuing:

mmlsnsd -m

produces output similar to this example:

Disk name NSD volume ID Device Node name Remarks


-----------------------------------------------------------------------------------------
t65nsd12b 0972364D45EF7B78 /dev/hdisk34 c26f4gp01.ppd.pok.ibm.com server node
t65nsd12b 0972364D45EF7B78 /dev/hdisk34 c26f4gp02.ppd.pok.ibm.com server node
t65nsd12b 0972364D45EF7B78 /dev/hdisk34 c26f4gp04.ppd.pok.ibm.com
t65nsd13b 0972364D00000001 /dev/hdisk35 c26f4gp01.ppd.pok.ibm.com server node

408 IBM Storage Scale 5.1.9: Problem Determination Guide


t65nsd13b 0972364D00000001 /dev/hdisk35 c26f4gp02.ppd.pok.ibm.com server node
t65nsd13b 0972364D00000001 - c26f4gp03.ppd.pok.ibm.com (not found) server
node
t65nsd4b 0972364D45EF7614 /dev/hdisk26 c26f4gp04.ppd.pok.ibm.com

From this output we can tell that:


• The local disk name for t65nsd12b on NSD server c26f4gp01 is hdisk34.
• NSD disk t65nsd13b is not attached to node on which the mmlsnsd command was issued,
nodec26f4gp04.
• The mmlsnsd command was not able to determine the local device for NSD disk t65nsd13b on
c26f4gp03 server.
To find the nodes to which disk t65nsd4b is attached and the corresponding local devices for that disk,
issue:

mmlsnsd -d t65nsd4b -M

Output is similar to this example:

Disk name NSD volume ID Device Node name Remarks


-----------------------------------------------------------------------------------------
t65nsd4b 0972364D45EF7614 /dev/hdisk92 c26f4gp01.ppd.pok.ibm.com
t65nsd4b 0972364D45EF7614 /dev/hdisk92 c26f4gp02.ppd.pok.ibm.com
t65nsd4b 0972364D45EF7614 - c26f4gp03.ppd.pok.ibm.com (not found)
directly attached
t65nsd4b 0972364D45EF7614 /dev/hdisk26 c26f4gp04.ppd.pok.ibm.com

From this output we can tell that NSD t65nsd4b is:


• Known as hdisk92 on node c26f4gp01 and c26f4gp02.
• Known as hdisk26 on node c26f4gp04
• Is not attached to node c26f4gp03
To display extended information about a node's view of its NSDs, the mmlsnsd -X command can be
used:

mmlsnsd -X -d "hd3n97;sdfnsd;hd5n98"

The system displays information similar to:

Disk name NSD volume ID Device Devtype Node name Remarks


------------------------------------------------------------------------------------------------
---
hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n97g.ppd.pok.ibm.com server
node,pr=no
hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n98g.ppd.pok.ibm.com server
node,pr=no
hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n97g.ppd.pok.ibm.com server
node,pr=no
hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n98g.ppd.pok.ibm.com server
node,pr=no
sdfnsd 0972845E45F02E81 /dev/sdf generic c5n94g.ppd.pok.ibm.com server node
sdfnsd 0972845E45F02E81 /dev/sdm generic c5n96g.ppd.pok.ibm.com server node

From this output we can tell that:


• Disk hd3n97 is an hdisk known as /dev/hdisk3 on NSD server node c5n97 and c5n98.
• Disk sdfnsd is a generic disk known as /dev/sdf and /dev/sdm on NSD server node c5n94g and
c5n96g, respectively.
• In addition to the preceding information, the NSD volume ID is displayed for each disk.
Note: The -m, -M and -X options of the mmlsnsd command can be very time consuming, especially on
large clusters. Use these options judiciously.

Chapter 25. Disk issues 409


Disk device name is an existing NSD name
Learn how to respond to an NSD creation error message in which the device name is an existing NSD
name.
When you run the mmcrnsd command to create an NSD, the command might display an error message
saying that a DiskName value that you specified refers to an existing NSD name.
This type of error message indicates one of the following situations:
• The disk is an existing NSD.
• The disk is a previous NSD that was removed from the cluster with the mmdelnsd command but is not
yet marked as available.
In second situation, you can override the check by running the mmcrnsd command again with the -v no
option. Do not take this step unless you are sure that another cluster is not using this disk. Enter the
following command:

mmcrnsd -F StanzaFile -v no

A possible cause for the NSD creation error message is that a previous mmdelnsd command failed to
zero internal data structures on the disk, even though the disk is functioning correctly. To complete the
deletion, run the mmdelnsd command with the -p NSDId option. Do not take this step unless you are
sure that another cluster is not using this disk. The following command is an example:

mmdelnsd -p NSDId -N Node

GPFS has declared NSDs as down


GPFS reactions to NSD failures and the recovery procedure.
There are several situations in which disks can appear to fail to GPFS. Almost all of these situations
involve a failure of the underlying disk subsystem. The following information describes how GPFS reacts
to these failures and how to find the cause.
GPFS will stop using a disk that is determined to have failed. This event is marked as MMFS_DISKFAIL in
an error log entry (see “Operating system error logs” on page 274). The state of a disk can be checked by
issuing the mmlsdisk command.
The consequences of stopping disk usage depend on what is stored on the disk:
• Certain data blocks may be unavailable because the data residing on a stopped disk is not replicated.
• Certain data blocks may be unavailable because the controlling metadata resides on a stopped disk.
• In conjunction with other disks that have failed, all copies of critical data structures may be unavailable
resulting in the unavailability of the entire file system.
The disk will remain unavailable until its status is explicitly changed through the mmchdisk command.
After that command is issued, any replicas that exist on the failed disk are updated before the disk is
used.
GPFS can declare disks down for a number of reasons:
• If the first NSD server goes down and additional NSD servers were not assigned, or all of the additional
NSD servers are also down and no local device access is available on the node, the disks are marked as
stopped.
• A failure of an underlying disk subsystem may result in a similar marking of disks as stopped.
1. Issue the mmlsdisk command to verify the status of the disks in the file system.
2. Issue the mmchdisk command with the -a option to start all stopped disks.
• Disk failures should be accompanied by error log entries (see The operating system error log facility) for
the failing disk. GPFS error log entries labelled MMFS_DISKFAIL will occur on the node detecting the
error. This error log entry will contain the identifier of the failed disk. Follow the problem determination

410 IBM Storage Scale 5.1.9: Problem Determination Guide


and repair actions specified in your disk vendor problem determination guide. After performing problem
determination and repair issue the mmchdisk command to bring the disk back up.

Unable to access disks


Access to the disk might be restricted due to incorrect disk specification or configuration failure during
disk subsystem initialization.
If you cannot open a disk, the specification of the disk may be incorrect. It is also possible that a
configuration failure may have occurred during disk subsystem initialization. For example, on Linux you
should consult /var/log/messages to determine if disk device configuration errors have occurred.

Feb 16 13:11:18 host123 kernel: SCSI device sdu: 35466240 512-byte hdwr sectors (18159 MB)
Feb 16 13:11:18 host123 kernel: sdu: I/O error: dev 41:40, sector 0
Feb 16 13:11:18 host123 kernel: unable to read partition table

On AIX, consult “Operating system error logs” on page 274 for hardware configuration error log entries.
Accessible disk devices will generate error log entries similar to this example for a SSA device:

--------------------------------------------------------------------------
LABEL: SSA_DEVICE_ERROR
IDENTIFIER: FE9E9357

Date/Time: Wed Sep 8 10:28:13 edt


Sequence Number: 54638
Machine Id: 000203334C00
Node Id: c154n09
Class: H
Type: PERM
Resource Name: pdisk23
Resource Class: pdisk
Resource Type: scsd
Location: USSA4B33-D3
VPD:
Manufacturer................IBM
Machine Type and Model......DRVC18B
Part Number.................09L1813
ROS Level and ID............0022
Serial Number...............6800D2A6HK
EC Level....................E32032
Device Specific.(Z2)........CUSHA022
Device Specific.(Z3)........09L1813
Device Specific.(Z4)........99168

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE

Failure Causes
DISK DRIVE

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ERROR CODE
2310 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------

or this one from GPFS:

---------------------------------------------------------------------------
LABEL: MMFS_DISKFAIL
IDENTIFIER: 9C6C05FA

Date/Time: Tue Aug 3 11:26:34 edt


Sequence Number: 55062
Machine Id: 000196364C00
Node Id: c154n01
Class: H
Type: PERM
Resource Name: mmfs
Resource Class: NONE

Chapter 25. Disk issues 411


Resource Type: NONE
Location:

Description
DISK FAILURE

Probable Causes
STORAGE SUBSYSTEM
DISK

Failure Causes
STORAGE SUBSYSTEM
DISK

Recommended Actions
CHECK POWER
RUN DIAGNOSTICS AGAINST THE FAILING DEVICE

Detail Data
EVENT CODE
1027755
VOLUME
fs3
RETURN CODE
19
PHYSICAL VOLUME
vp31n05
-----------------------------------------------------------------

Guarding against disk failures


Protection methods to guard against data loss due to disk media failure.
There are various ways to guard against the loss of data due to disk media failures. For example, the use
of a RAID controller, which masks disk failures with parity disks, or a twin-tailed disk, could prevent the
need for using these recovery steps.
GPFS offers a method of protection that is called replication, which overcomes disk failure at the expense
of extra disk space. GPFS allows replication of data and metadata. This means that up to three instances
of data, metadata, or both can be automatically created and maintained for any file in a GPFS file
system. If one instance becomes unavailable due to disk failure, another instance is used instead. You
can set different replication specifications for each file, or apply default settings that are specified at file
system creation. Refer to the File system replication parameters topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide.

Disk connectivity failure and recovery


The GPFS has certain error messages defined for local connection failure from NSD servers.
If a disk is defined to have a local connection and to be connected to defined NSD servers, and the local
connection fails, GPFS bypasses the broken local connection and uses the NSD servers to maintain disk
access. The following error message appears in the GPFS log:
6027-361 [E]
Local access to disk failed with EIO, switching to access the disk remotely.
This is the default behavior, and can be changed with the useNSDserver file system mount option. See
the NSD server considerations topic in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
For a file system using the default mount option useNSDserver=asneeded, disk access fails over from
local access to remote NSD access. Once local access is restored, GPFS detects this fact and switches
back to local access. The detection and switch over are not instantaneous, but occur at approximately five
minute intervals.
Note: In general, after fixing the path to a disk, you must run the mmnsddiscover command on the
server that lost the path to the NSD. (Until the mmnsddiscover command is run, the reconnected node
will see its local disks and start using them by itself, but it will not act as the NSD server.)

412 IBM Storage Scale 5.1.9: Problem Determination Guide


After that, you must run the command on all client nodes that need to access the NSD on that server; or
you can achieve the same effect with a single mmnsddiscover invocation if you utilize the -N option to
specify a node list that contains all the NSD servers and clients that need to rediscover paths.

Partial disk failure


Partial disk failures when you have chosen not to implement hardware protection against media failures
and the course of action to correct this problem.
If the disk has only partially failed and you have chosen not to implement hardware protection against
media failures, the steps to restore your data depends on whether you have used replication. If you
have replicated neither your data nor metadata, you will need to issue the offline version of the mmfsck
command, and then restore the lost information from the backup media. If it is just the data which was
not replicated, you will need to restore the data from the backup media. There is no need to run the
mmfsck command if the metadata is intact.
If both your data and metadata have been replicated, implement these recovery actions:
1. Unmount the file system:

mmumount fs1 -a

2. Delete the disk from the file system:

mmdeldisk fs1 gpfs10nsd -c

3. If you are replacing the disk, add the new disk to the file system:

mmadddisk fs1 gpfs11nsd

4. Then restripe the file system:

mmrestripefs fs1 -b

Note: Ensure there is sufficient space elsewhere in your file system for the data to be stored by using
the mmdf command.

GPFS has declared NSDs built on top of AIX logical volumes as


down
Earlier releases of GPFS allowed AIX logical volumes to be used in GPFS file systems. Using AIX logical
volumes in GPFS file systems is now discouraged as they are limited with regard to their clustering ability
and cross platform support.
Existing file systems using AIX logical volumes are however still supported, and this information might be
of use when working with those configurations.

Verify whether the logical volumes are properly defined


Logical volumes must be properly configured to map between the NSD and the underlying disks.
To verify the logical volume configuration, issue the following command:

mmlsnsd -m

The system displays any underlying physical device present on this node, which is backing the NSD. If the
underlying device is a logical volume, issue the following command to map from the logical volume to the
volume group.

lsvg -o | lsvg -i -l

Chapter 25. Disk issues 413


The system displays a list of logical volumes and corresponding volume groups. Now, issue the lsvg
command for the volume group that contains the logical volume. For example:

lsvg gpfs1vg

The system displays information similar to the following example:

VOLUME GROUP: gpfs1vg VG IDENTIFIER: 000195600004c00000000ee60c66352


VG STATE: active PP SIZE: 16 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 542 (8672 megabytes)
MAX LVs: 256 FREE PPs: 0 (0 megabytes)
LVs: 1 USED PPs: 542 (8672 megabytes)
OPEN LVs: 1 QUORUM: 2
TOTAL PVs: 1 VG DESCRIPTORS: 2
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 1 AUTO ON: no
MAX PPs per PV: 1016 MAX PVs: 32
LTG size: 128 kilobyte(s) AUTO SYNC: no
HOT SPARE: no

Check the volume group on each node


All the disks in GPFS cluster must be properly defined to all the nodes.
Make sure that all disks are properly defined to all nodes in the GPFS cluster:
1. Issue the AIX lspv command on all nodes in the GPFS cluster and save the output.
2. Compare the pvid and volume group fields for all GPFS volume groups.
Each volume group must have the same pvid and volume group name on each node. The hdisk name
for these disks may vary.
For example, to verify the volume group gpfs1vg on the five nodes in the GPFS cluster, for each node in
the cluster issue:

lspv | grep gpfs1vg

The system displays information similar to:

k145n01: hdisk3 00001351566acb07 gpfs1vg active


k145n02: hdisk3 00001351566acb07 gpfs1vg active
k145n03: hdisk5 00001351566acb07 gpfs1vg active
k145n04: hdisk5 00001351566acb07 gpfs1vg active
k145n05: hdisk7 00001351566acb07 gpfs1vg active

Here the output shows that on each of the five nodes the volume group gpfs1vg is the same physical disk
(has the same pvid). The hdisk numbers vary, but the fact that they may be called different hdisk names
on different nodes has been accounted for in the GPFS product. This is an example of a properly defined
volume group.
If any of the pvids were different for the same volume group, this would indicate that the same volume
group name has been used when creating volume groups on different physical volumes. This will not work
for GPFS. A volume group name can be used only for the same physical volume shared among nodes in
a cluster. For more information, refer to AIX in IBM Documentation and search for operating system and
device management.

Volume group varyon problems


Use varyoffvg command for the volume group at all nodes to correct varyonvg issues at the volume
group layer.
If an NSD backed by an underlying logical volume will not come online to a node, it may be due to
varyonvg problems at the volume group layer. Issue the varyoffvg command for the volume group
at all nodes and restart GPFS. On startup, GPFS will varyon any underlying volume groups in proper
sequence.

414 IBM Storage Scale 5.1.9: Problem Determination Guide


Disk accessing commands fail to complete due to problems with
some non-IBM disks
Certain disk commands, such as mmcrfs, mmadddisk, mmrpldisk, mmmount and the operating system's
mount, might issue the varyonvg -u command if the NSD is backed by an AIX logical volume.
For some non-IBM disks, when many varyonvg -u commands are issued in parallel, some of the AIX
varyonvg -u invocations do not complete, causing the disk command to hang.
This situation is recognized by the GPFS disk command not completing after a long period of time, and the
persistence of the varyonvg processes as shown by the output of the ps -ef command on some of the
nodes of the cluster. In these cases, kill the varyonvg processes that were issued by the GPFS disk
command on the nodes of the cluster. This allows the GPFS disk command to complete. Before mounting
the affected file system on any node where a varyonvg process was killed, issue the varyonvg -u
command (varyonvg -u vgname) on the node to make the disk available to GPFS. Do this on each of the
nodes in question, one by one, until all of the GPFS volume groups are varied online.

Disk media failure


Recovery procedures to recover lost data in case of disk media failure.
Regardless of whether you have chosen additional hardware or replication to protect your data against
media failures, you first need to determine that the disk has completely failed. If the disk has completely
failed and it is not the path to the disk which has failed, follow the procedures defined by your disk vendor.
Otherwise:
1. Check on the states of the disks for the file system:

mmlsdisk fs1 -e

GPFS will mark disks down if there have been problems accessing the disk.
2. To prevent any I/O from going to the down disk, issue these commands immediately:

mmchdisk fs1 suspend -d gpfs1nsd


mmchdisk fs1 stop -d gpfs1nsd

Note: If there are any GPFS file systems with pending I/O to the down disk, the I/O will timeout if the
system administrator does not stop it.
To see if there are any threads that have been waiting a long time for I/O to complete, on all nodes
issue:

mmfsadm dump waiters 10 | grep "I/O completion"

3. The next step is irreversible! Do not run this command unless data and metadata have been replicated.
This command scans file system metadata for disk addresses belonging to the disk in question, then
replaces them with a special "broken disk address" value, which might take a while.

CAUTION: Be extremely careful with using the -p option of mmdeldisk, because by design it
destroys references to data blocks, making affected blocks unavailable. This is a last-resort
tool, to be used when data loss might have already occurred, to salvage the remaining data–
which means it cannot take any precautions. If you are not absolutely certain about the state of
the file system and the impact of running this command, do not attempt to run it without first
contacting the IBM Support Center.

mmdeldisk fs1 gpfs1n12 -p

4. Invoke the mmfileid command with the operand :BROKEN:

mmfileid fs1 -d :BROKEN

For more information, see “The mmfileid command” on page 327.

Chapter 25. Disk issues 415


5. After the disk is properly repaired and available for use, you can add it back to the file system.

Replica mismatches
IBM Storage Scale includes logic that ensures data and metadata replicas always have identical content.
A replica mismatch is a condition in which two or more replicas of a data or metadata block differ from
each other. Replica mismatches might happen for the following reasons:
• Corrupted log files (replica writes are protected by IBM Storage Scale journals) found by IBM Storage
Scale.
• Hardware issues such as a missed write or a redirected write that led to stale or corrupted disk blocks.
• Software bugs in administrative options for commands such as mmchdisk start and mmrestripefs
that fail to sync the replicas.
When a replicated block has mismatched replicas, the wrong replica block can be read and cause further
data integrity issues or application errors.
If the mismatch is in a metadata block and is found by IBM Storage Scale to be corrupted, it is flagged
in the system log as an FSSTRUCT error. The next replica block is then read. While this action does not
disrupt file system operations, it leaves the metadata block with an insufficient number of good replicas.
A failure of the disk that contains the good replica can lead to data or metadata loss. Alternatively, if the
replica block that is read contains valid but stale metadata, then it can lead to further corruption of the
data and metadata. For instance, if the block belongs to the block allocation map system file, then reading
a stale replica block of this file means that IBM Storage Scale sees a stale block allocation state. This
issue might allow IBM Storage Scale to double allocate blocks, which leads to corruption due to block
overwrite. You can identify such a corruption by looking for the FSErrDeallocBlock FSSTRUCT error in the
system log, which is logged at the time of the block's deallocation.
If the mismatch is in a data block, then IBM Storage Scale cannot determine whether the replica that is
read is corrupted since it does not have enough context to validate user data. Hence, applications might
receive corrupted data even when a good replica of the data block is available in the file system.
For these reasons, it is important to be able to repair replica mismatches as soon as they are detected.
You can detect replica mismatches with any of the following methods:
• With an online replica compare:

mmrestripefs <fs> --check-conflicting-replicas

• With a read-only offline fsck:

mmfsck <fs> -nv [-c|-m]

• By comparing the replicas on a per file basis:

mmrestripefile -c --read-only <Filename>

Note: This option detects data and directory block replica mismatches only.
Any of these methods can be used when file reads return stale or invalid data, or if FSErrDeallocBlock
FSSTRUCT errors are in the system logs.
If replica mismatches are detected, then the next step is to make the replicas consistent. The replicas can
be made consistent by choosing a reference replica block to copy over the other replicas of the block. If
they are metadata block replica mismatches, the reference replica block is chosen by IBM Storage Scale.
However, if they are data block replica mismatches, the reference replica should be chosen by the file
owner.

416 IBM Storage Scale 5.1.9: Problem Determination Guide


Methods to repair metadata and data block replica mismatches
Use this information to learn about the methods that can be used to repair metadata and data block
replica mismatches.
You can repair metadata block replica mismatches with the -m option of offline fsck. Using an offline
fsck is a requirement because metadata blocks can have two valid but inconsistent replicas. If the
chosen reference replica block happens to be the stale one, it can cause inconsistencies for other
dependent metadata. Such resulting inconsistencies can be repaired reliably by offline fsck only. For more
information, see the mmfsck command in the IBM Storage Scale: Command and Programming Reference
Guide.
Metadata block replica mismatches must be repaired before you can attempt to repair data block replica
mismatches. Otherwise, IBM Storage Scale might read data blocks from the bad metadata block replicas
and cause file system corruption during repair.
After any metadata block replica mismatches are repaired, the file system can be brought online. Then,
you can run an online replica compare operation to find the data block replica mismatches with the
following command:

mmrestripefs <fs> --check-conflicting-replicas

This command will report replica mismatches with output similar to the following:


Scanning user file metadata …
Inode 9824 [fileset 0, snapshot 0 ] has mismatch in replicated disk address 1:173184 3:135808
at block 207
Inode 10304 [fileset 0, snapshot 0 ] has mismatch in replicated disk address 2:145792 1:145920
at block 1

You can choose from two methods to resolve the data block replica mismatches. Both options require the
data block replica repair feature to be enabled with the following command:

mmchconfig readReplicaRuleEnabled=yes -i

This configuration option can be enabled and disabled dynamically, so you do not need to restart the
GPFS daemon. You might experience a small impact on performance for file reads due to enabling
the configuration option. It is advised to turn off this configuration option after all of the data block
replica mismatches are repaired. Also, it is advised to turn off this configuration option while performing
operations like restriping and rebalancing the file system. For more information about the two methods
to resolve the data block replica mismatches, see “Repairing data block replica mismatches with the
global replica selection rule” on page 417 and “Repairing data block replica mismatches with the file level
replica selection rule” on page 418.

Repairing data block replica mismatches with the global replica selection rule
Follow this procedure if the data block replica mismatches are due to one or more bad disks.
You can confirm if one or more disks are bad by looking at the frequency of disks that contain mismatched
replicas in the online replica compare operation output. Follow the steps to exclude and repair the data
block replica mismatches.
1. To exclude the bad disks from being read, run the following command:

mmchconfig diskReadExclusionList=<nsd1;nsd2;...> -i

Setting this configuration option prevents the read of data blocks from the specified disks when the
disks have one of the following statuses: ready, suspended, or replacement. If all of the replicas of a
data block are on read-excluded disks, then the data block is fetched from the disk that was specified
earlier in the diskReadExclusionList.

Chapter 25. Disk issues 417


Note: Setting this configuration option does not invalidate existing caches. So if a block of a read-
excluded disk is already cached, then the cached version is returned on block read. Writes to the
excluded disks are not blocked when the disks are available.
This configuration option works by marking the in-memory disk data structure with a flag. The status
and availability of such disks are preserved without any disk configuration changes.
This configuration option can be enabled and disabled dynamically, so you do not need to restart the
GPFS daemon.
2. Validate the files that are reported by the online replica compare operation by processing them
through their associated application. If the files can be read correctly, then the replica mismatches can
be repaired. Otherwise, adjust the diskReadExclusionList.
3. To repair the replica mismatches, run the following command:

mmrestripefile -c --inode-number <SnapPath/InodeNumber>

Where SnapPath is the path to the snapshot root directory, which contains the InodeNumber with
replica mismatches. If the replica mismatch is for a file in the active file system, then SnapPath would
be the path of the root directory of the active file system. For example:

mmrestripefile -c --inode-number /gpfs/fs1/.snapshots/snap2/11138


mmrestripefile -c --inode-number /gpfs/fs1/11138

Run this command on each of the inodes that were reported by the earlier online replica compare
operation.
4. To disable the diskReadExclusionList configuration option, run the following command:

mmchconfig diskReadExclusionList=DEFAULT -i

This method provides a fast way to exclude data block reads from disks with stale data. To exercise
more granular control over which data block replicas are read per file, see “Repairing data block replica
mismatches with the file level replica selection rule” on page 418.

Repairing data block replica mismatches with the file level replica selection
rule
Follow this procedure if you want to select the reference data block replica among mismatched replicas
on a per file basis.
This method allows for more granular control over which data block replicas are read for a file. Any
user who has permission to write on the file can use this method. Before you start, make sure that you
determine the list of the mismatched replicas of the file with one of the following commands:

mmrestripefs <fs> --check-conflicting-replicas #requires root privileges

mmrestripefile -c --read-only <Filename> #requires read permission on the file

You can figure out the correct data block replicas among the mismatched replicas of the file by setting
a file-specific replica selection rule. This rule is in the form of an extended attribute that is called
readReplicaRule, which is under the gpfs namespace. This rule causes a subsequent read of the file to
return the data block replicas as specified by the rule. You can then validate the file data by processing it
through an associated application.
Note: Setting this extended attribute invalidates existing caches of the file so that subsequent reads of
the file fetch the data from disk.
1. Set the gpfs.readReplicaRule extended attribute with one of the following methods:

mmchattr --set-attr gpfs.readReplicaRule=<"RuleString"> <FilePath>

setfattr -n gpfs.readReplicaRule -v <"RuleString"> <FilePath>

418 IBM Storage Scale 5.1.9: Problem Determination Guide


2. You can also set the extended attribute on the files in a directory tree by using a policy rule such as the
following example:

RULE EXTERNAL LIST 'files' EXEC ''


RULE LIST 'files' DIRECTORIES_PLUS ACTION(SetXattr('ATTR', 'ATTRVAL'))

Which can then be applied in a manner such as the following example:

mmapplypolicy <dir> -P <policyRuleFile> -I defer -f /tmp -L 0 -M ATTR="gpfs.readReplicaRule"


-M ATTRVAL="<RuleString>"

3. You can also set the extended attributed by using an inode number instead of the file name:

mmchattr --set-attr gpfs.readReplicaRule=<"RuleString"> --inode-number <SnapPath/InodeNumber>

Note: This action requires root privilege.


Where SnapPath is the path to the snapshot root directory, which contains the InodeNumber with
replica mismatches. If the replica mismatch is for a file in the active file system, then SnapPath would
be the path of the root directory of the active file system. For example:

mmrestripefile -c --inode-number /gpfs/fs1/.snapshots/snap2/11138


mmrestripefile -c --inode-number /gpfs/fs1/11138

The gpfs.readReplicaRule extended attribute can be set on any valid file (not a directory or soft link)
including clone parents, snapshot user files, and files with immutable or append-only flags. In most
cases, to set this extended attribute, you need permission to write on the file. However, if you own the
file, you can set the attribute even if you have only the permission to read the file. This attribute can be
set even on a read-only mounted file system.
This attribute is specific to a file, so it does not get copied during DMAPI backup, AFM, clone creation,
or snapshot copy-on-write operations. Similarly, this attribute cannot be restored from a snapshot.
If you do not have enough space to store the gpfs.readReplicaRule extended attribute, then you can
temporarily delete one or more of the existing user-defined extended attributes. After you repair the
replica mismatches in the file, you can delete the gpfs.readReplicaRule extended attribute and restore the
earlier user-defined attributes.
To save and restore all of the extended attributes for a file, run the following commands:

getfattr --absolute-names --dump -m "-" <FilePath> > /tmp/attr.save

setfattr –restore=/tmp/attr.save

Note: It is possible to filter out all of the replicas of a data block by using the gpfs.readReplicaRule
extended attribute. In such a case, the block read fails with I/O error.
Files with the gpfs.readReplicaRule extended attribute might experience a small impact on read
performance because of parsing the rule string for every data block that is read from the disk. Thus,
it is advised to delete the gpfs.readReplicaRule extended attribute after you repair the data block replica
mismatches.

Format of the gpfs.readReplicaRule string


Use this information to learn about the grammar of the gpfs.readReplicaRule string, which is in EBNF
(Extended Backus-Naur Form) notation.

rule
= block_sub_rule | file_sub_rule |
(block_sub_rule, "; ", file_sub_rule);

Note: The block_sub_rule applies to the specified blocks only; the file_sub_rule
applies to all of the blocks in the file; sub_rules are evaluated left to right and

Chapter 25. Disk issues 419


the first sub_rule that matches a block is applied to the block and the rest of the
sub_rules are ignored.

block_sub_rule = ("b=", block_list, " r=", replica_index),


{"; ", block_sub_rule}

block_list = (block_index | block_index_range), {",", block_list};


block_index = "0" | "1" | "2" | ...;
Note: block_index beyond last block is ignored.

block_index_range = block_index, ":", block_index;


replica_index = ("0" | "1" | "2" | "3" | "x"), {",",replica_index};
Note: "x" selects the last valid replica in the replica set; if a block has valid replicas,
but the replica_index does not resolve to a valid replica, then the block read returns
an error.

file_sub_rule = ("d=", disk_exclusion_list) | ("r=", replica_index)

Note: disk_exclusion_list discards matching disk addresses from the replica set.

disk_exclusion_list = (disk_num | disk_num_range), {",", disk_exclusion_list};


disk_num = "1" | "2" | ...;
disk_num_range = disk_num, ":", disk_num;

When a data block of a file is read from disk, all of the replica disk addresses of the data block are
filtered by either an existing gpfs.readReplicaRule extended attribute or an existing diskReadExclusionList
configuration option. If both are present, then the gpfs.readReplicaRule extended attribute is evaluated
first. If it fails to match the block that is being read, then the diskReadExclusionList configuration option
is applied instead. The data block is then read using only the filtered set of replica disk addresses. If the
read fails due to an I/O error or because the filtered set of replica disk addresses is empty, then the error
is returned to the application.
For more information, see “Example of using the gpfs.readReplicaRule string” on page 420.

Example of using the gpfs.readReplicaRule string


The following example illustrates the application of gpfs.readReplicaRule during a file read.

mmchattr --set-attr gpfs.readReplicaRule="b=1:2 r=1,x; b=3 r=1; b=3 r=0; d=3,1" <filename>

The following data block disk address list is for a file with 2 DataReplicas and 3 MaxDataReplicas:

0: 1:3750002688 1: 2:4904091648 2: (null)


3: 3:4860305408 4: 1:3750002688 5: (null)
6: 2:3750002688 7: (null) 8: (null)
9: 3:3750002688 10: (null) 11: (null)
12: (null) 13: (null) 14: (null)

The replica rule picks the reference replica for each block that is read as follows:

block 0: The initial replica set is (1:3750002688, 2:4904091648).


There is no matching block_sub_rule. The file_sub_rule “d=3,1”
excludes disks 3 and 1 which leaves the replica set as (2:4904091648).
So the block will be read from DA 2:4904091648.

block 1: The initial replica set is (3:4860305408, 1:3750002688).


There is a matching block_sub_rule “b=1:2 r=1,x” which picks the replica
DA 1:3750002688 for read. In case GPFS fails to read from this DA,

420 IBM Storage Scale 5.1.9: Problem Determination Guide


then it will not try the other replica and returns read error.

block 2: The initial replica set is (2:3750002688).


There is a matching block_sub_rule “b=1:2 r=1,x” which first picks replica index
1 from the set. Since there is no such replica, the next replica_index in the rule is
then evaluated which is ‘x’ ie. the last valid replica in the replica set. This is
DA 2:3750002688.

block 3: The initial replica set is (3:3750002688).


There is a matching block_sub_rule “b=3 r=1” which picks replica index 1
from the set. Since there is no such replica, a read error is returned.
Note that the block_sub_rule “b=3 r=0” is not applied as only the first matching
block_sub_rule is applied.

block 4: This is a hole. So there is no replica rule evaluation done and


the read returns a block of zeroes.

If you have permission to read a file, you can check whether the gpfs.readReplicaRule extended attribute
is set in the file by one of the following methods:

mmlsattr --get-attr gpfs.readReplicaRule <FilePath>

mmlsattr -n gpfs.readReplicaRule <FilePath>

getfattr --absolute-names -n gpfs.readReplicaRule <FilePath>

Alternatively, a policy rule such as the following example can be used to show the gpfs.readReplicaRule
extended attribute:

DEFINE(DISPLAY_NULL,[COALESCE($1,'_NULL_')])
RULE EXTERNAL LIST 'files' EXEC ''
RULE LIST 'files' DIRECTORIES_PLUS SHOW(DISPLAY_NULL(XATTR('ATTR')))

The policy rule can be applied in the following way:

mmapplypolicy <dir> -P <policyRuleFile> -I defer -f /tmp -L 0 -M ATTR="gpfs.readReplicaRule"

The extended attribute can also be queried by using the inode number instead of the file name:
Note: This action requires root privilege.

mmlsattr --get-attr gpfs.readReplicaRule --inode-number <SnapPath/InodeNumber>

To verify the effects of a gpfs.readReplicaRule string, you can dump the distribution of disk numbers for
each replica of each block of the file by using any of the following commands:

mmlsattr -D <FilePath>

mmlsattr --dump-data-block-disk-numbers <FilePath>

Running one of those commands provides output similar to the following example:

Block Index Replica 0 Replica 1 Replica 2


----------- --------- --------- ---------
0 *1 2 -
1 3 *1 -
2 *2 - -
3 3 - - e
4 - - -

A disk number that is prefixed with an asterisk indicates the data block replica that is read from the disk
for that block. By default, the first valid data block replica is always returned on read. An e at the end of
a row indicates that the gpfs.readReplicaRule selected an invalid replica and hence the read of this block
returns an error. You can change which data block replica is read by changing the readReplicaPolicy global
configuration option, the diskReadExclusionList global configuration option, or the gpfs.readReplicaRule
extended attribute. Thus, using a memory dump is a good way to check the effects of these settings.

Chapter 25. Disk issues 421


To dump the disk distribution with the inode number instead of the file name, run the following
commands:

mmlsattr -D --inode-number <SnapPath/InodeNumber>

mmlsattr --dump-data-block-disk-numbers --inode-number <SnapPath/InodeNumber>

Note: This action requires root privilege.


After the correct gpfs.readReplicaRule string is determined and the file data is verified to be valid, then the
data block replica mismatches in the file can be repaired with the following command:

mmrestripefile -c <Filename>

Note: To run this command, the file system must be read and write mounted. Additionally, you must have
permission to write on the file. This command is always audited to syslog for non-root users.
After the repair is completed, you can delete the gpfs.readReplicaRule extended attribute. Alternatively,
you can defer repair of the file and continue to do read and write operations on the file with the
gpfs.readReplicaRule extended attribute present.
The following steps summarize the workflow:
1. Identify the mismatched data block replicas in a file.
2. Ensure that the readReplicaRuleEnabled global configuration option is set to yes.
3. Write the gpfs.readReplicaRule extended attribute to select a replica index for each data block with
mismatched replicas.
4. Verify that the gpfs.readReplicaRule extended attribute selects the replicas as expected with the
mmlsattr -D command.
5. Validate the file by processing it through its associated application.
Note: The validation process should involve only reads of the file. Any attempt to write to the
blocks with mismatched replicas will overwrite all replicas. If the replica that is selected by the
gpfs.readReplicaRule extended attribute is incorrect, then writing to the block using the bad replica will
permanently corrupt the block.
6. If the file validation fails, then retry steps 2, 3, and 4 with a different replica index for the problem data
blocks.
7. After the file passes validation, repair the data block replica mismatches in the file with the
mmrestripefile -c command.
8. Delete the gpfs.readReplicaRule extended attribute.

Replicated metadata and data


The course of actions to be followed to recover the lost files if you have replicated metadata and data and
only disks in a single failure group has failed.
If you have replicated metadata and data and only disks in a single failure group have failed, everything
should still be running normally but with slightly degraded performance. You can determine the
replication values set for the file system by issuing the mmlsfs command. Proceed with the appropriate
course of action:
1. After the failed disk has been repaired, issue an mmadddisk command to add the disk to the file
system:

mmadddisk fs1 gpfs12nsd

You can rebalance the file system at the same time by issuing:

mmadddisk fs1 gpfs12nsd -r

422 IBM Storage Scale 5.1.9: Problem Determination Guide


Note: Rebalancing of files is an I/O intensive and time consuming operation, and is important only for
file systems with large files that are mostly invariant. In many cases, normal file update and creation
will rebalance your file system over time, without the cost of the rebalancing.
2. To re-replicate data that only has single copy, issue:

mmrestripefs fs1 -r

Optionally, use the -b flag instead of the -r flag to rebalance across all disks.
Note: Rebalancing of files is an I/O intensive and time consuming operation, and is important only for
file systems with large files that are mostly invariant. In many cases, normal file update and creation
will rebalance your file system over time, without the cost of the rebalancing.
3. Optionally, check the file system for metadata inconsistencies by issuing the offline version of mmfsck:

mmfsck fs1

If mmfsck succeeds, you may still have errors that occurred. Check to verify no files were lost. If files
containing user data were lost, you will have to restore the files from the backup media.
If mmfsck fails, sufficient metadata was lost and you need to recreate your file system and restore the
data from backup media.

Replicated metadata only


Using replicated metadata for lost data recovery.
If you have only replicated metadata, you should be able to recover some, but not all, of the user data.
Recover any data to be kept using normal file operations or erase the file. If you read a file in block-size
chunks and get a failure return code and an EIO errno, that block of the file has been lost. The rest of the
file may have useful data to recover, or it can be erased.

Strict replication
Use mmchfs -K no command to perform disk action for strict replication.
If data or metadata replication is enabled, and the status of an existing disk changes so that the disk
is no longer available for block allocation (if strict replication is enforced), you may receive an errno of
ENOSPC when you create or append data to an existing file. A disk becomes unavailable for new block
allocation if it is being deleted, replaced, or it has been suspended. If you need to delete, replace, or
suspend a disk, and you need to write new data while the disk is offline, you can disable strict replication
by issuing the mmchfs -K no command before you perform the disk action. However, data written while
replication is disabled will not be replicated properly. Therefore, after you perform the disk action, you
must re-enable strict replication by issuing the mmchfs -K command with the original value of the -K
option (always or whenpossible) and then run the mmrestripefs -r command. To determine if a
disk has strict replication enforced, issue the mmlsfs -K command.
Note: A disk in a down state that has not been explicitly suspended is still available for block allocation,
and thus a spontaneous disk failure will not result in application I/O requests failing with ENOSPC. While
new blocks will be allocated on such a disk, nothing will actually be written to the disk until its availability
changes to up following an mmchdisk start command. Missing replica updates that took place while
the disk was down will be performed when mmchdisk start runs.

No replication
Perform unmounting yourself if no replication has been done and the system metadata has been lost. You
can follow the course of actions for manual unmounting.
When there is no replication, the system metadata has been lost and the file system is basically
irrecoverable. You may be able to salvage some of the user data, but it will take work and time. A forced
unmount of the file system will probably already have occurred. If not, it probably will very soon if you try
to do any recovery work. You can manually force the unmount yourself:

Chapter 25. Disk issues 423


1. Mount the file system in read-only mode (see “Read-only mode mount” on page 318). This will
bypass recovery errors and let you read whatever you can find. Directories may be lost and give errors,
and parts of files will be missing. Get what you can now, for all will soon be gone. On a single node,
issue:

mount -o ro /dev/fs1

2. If you read a file in block-size chunks and get an EIO return code that block of the file has been
lost. The rest of the file may have useful data to recover or it can be erased. To save the file system
parameters for recreation of the file system, issue:

mmlsfs fs1 > fs1.saveparms

Note: This next step is irreversible!


To delete the file system, issue:

mmdelfs fs1

3. To repair the disks, see your disk vendor problem determination guide. Follow the problem
determination and repair actions specified.
4. Delete the affected NSDs. Issue:

mmdelnsd nsdname

The system displays output similar to this:

mmdelnsd: Processing disk nsdname


mmdelnsd: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

5. Create a disk descriptor file for the disks to be used. This will include recreating NSDs for the new file
system.
6. Recreate the file system with either different parameters or the same as you used before. Use the disk
descriptor file.
7. Restore lost data from backups.

GPFS error messages for disk media failures


There are some GPFS error messages associated with disk media failures.
Disk media failures can be associated with these GPFS message numbers:
6027-418
Inconsistent file system quorum. readQuorum=value writeQuorum=value quorumSize=value
6027-482 [E]
Remount failed for device name: errnoDescription
6027-485
Perform mmchdisk for any disk failures and re-mount.
6027-636 [E]
Disk marked as stopped or offline.

Error numbers specific to GPFS application calls when disk failure occurs
There are certain error numbers associated with GPFS application calls when disk failure occurs.
When a disk failure has occurred, GPFS may report these error numbers in the operating system error log,
or return them to an application:
EOFFLINE = 208, Operation failed because a disk is offline
This error is most commonly returned when an attempt to open a disk fails. Since GPFS will attempt
to continue operation with failed disks, this will be returned when the disk is first needed to complete

424 IBM Storage Scale 5.1.9: Problem Determination Guide


a command or application request. If this return code occurs, check your disk for stopped states, and
check to determine if the network path exists.
To repair the disks, see your disk vendor problem determination guide. Follow the problem
determination and repair actions specified.
ENO_MGR = 212, The current file system manager failed and no new manager could be appointed.
This error usually occurs when a large number of disks are unavailable or when there has been a
major network failure. Run the mmlsdisk command to determine whether disks have failed. If disks
have failed, check the operating system error log on all nodes for indications of errors. Take corrective
action by issuing the mmchdisk command.
To repair the disks, see your disk vendor problem determination guide. Follow the problem
determination and repair actions specified.

Persistent Reserve errors


You can use Persistent Reserve (PR) to provide faster failover times between disks that support this
feature. PR allows the stripe group manager to "fence" disks during node failover by removing the
reservation keys for that node. In contrast, non-PR disk failovers cause the system to wait until the disk
lease expires.
GPFS allows file systems to have a mix of PR and non-PR disks. In this configuration, GPFS will fence PR
disks for node failures and recovery and non-PR disk will use disk leasing. If all of the disks are PR disks,
disk leasing is not used, so recovery times improve.
GPFS uses the mmchconfig command to enable PR. Issuing this command with the appropriate
usePersistentReserve option configures disks automatically. If this command fails, the most likely
cause is either a hardware or device driver problem. Other PR-related errors will probably be seen as file
system unmounts that are related to disk reservation problems. This type of problem should be debugged
with existing trace tools.

Understanding Persistent Reserve


The AIX server displays the value of reserve_policy and PR_key_value for Persistent Reserve. Use chdev
command to set the values for reserve_policy and PR_key_value.
Note: While Persistent Reserve (PR) is supported on both AIX and Linux, reserve_policy is applicable only
to AIX.
Persistent Reserve refers to a set of Small Computer Systems Interface-3 (SCSI-3) standard commands
and command options. These PR commands and command options give SCSI initiators the ability to
establish, preempt, query, and reset a reservation policy with a specified target disk. The functions
provided by PR commands are a superset of current reserve and release mechanisms. These functions
are not compatible with legacy reserve and release mechanisms. Target disks can only support
reservations from either the legacy mechanisms or the current mechanisms.
Note: Attempting to mix Persistent Reserve commands with legacy reserve and release commands will
result in the target disk returning a reservation conflict error.
Persistent Reserve establishes an interface through a reserve_policy attribute for SCSI disks. You can
optionally use this attribute to specify the type of reservation that the device driver will establish before
accessing data on the disk. For devices that do not support the reserve_policy attribute, the drivers will
use the value of the reserve_lock attribute to determine the type of reservation to use for the disk. GPFS
supports four values for the reserve_policy attribute:
no_reserve
Specifies that no reservations are used on the disk.
single_path
Specifies that legacy reserve/release commands are used on the disk.
PR_exclusive
Specifies that Persistent Reserve is used to establish exclusive host access to the disk.

Chapter 25. Disk issues 425


PR_shared
Specifies that Persistent Reserve is used to establish shared host access to the disk.
Persistent Reserve support affects both the parallel (scdisk) and SCSI-3 (scsidisk) disk device drivers and
configuration methods. When a device is opened (for example, when the varyonvg command opens the
underlying hdisks), the device driver checks the ODM for reserve_policy and PR_key_value and then opens
the device appropriately. For PR, each host attached to the shared disk must use unique registration key
values for reserve_policy and PR_key_value. On AIX, you can display the values assigned to reserve_policy
and PR_key_value by issuing:

lsattr -El hdiskx -a reserve_policy,PR_key_value

If needed, use the AIX chdev command to set reserve_policy and PR_key_value.
Note: GPFS manages reserve_policy and PR_key_value using reserve_policy=PR_shared when
Persistent Reserve support is enabled and reserve_policy=no_reserve when Persistent Reserve
is disabled.

Checking Persistent Reserve


For Persistent Reserve to function properly, follow the course of actions to determine the PR status.
For Persistent Reserve to function properly, you must have PR enabled on all of the disks that are
PR-capable. To determine the PR status in the cluster:
1. Determine if PR is enabled on the cluster
a) Issue mmlsconfig
b) Check for usePersistentReserve=yes
2. Determine if PR is enabled for all disks on all nodes
a) Make sure that GPFS has been started and mounted on all of the nodes
b) Enable PR by issuing mmchconfig
c) Issue the command mmlsnsd -X and look for pr=yes on all the hdisk lines
Notes:
1. To view the keys that are currently registered on a disk, issue the following command from a node that
has access to the disk:

/usr/lpp/mmfs/bin/tsprreadkeys hdiskx

2. To check the AIX ODM status of a single disk on a node, issue the following command from a node that
has access to the disk:

lsattr -El hdiskx -a reserve_policy,PR_key_value

Clearing a leftover Persistent Reserve reservation


You can clear leftover Persistent Reserve reservation.
Message number 6027-2202 indicates that a specified disk has a SCSI-3 PR reservation, which prevents
the mmcrnsd command from formatting it. The following example is specific to a Linux environment.
Output on AIX is similar but not identical.
Before trying to clear the PR reservation, use the following instructions to verify that the disk is
really intended for GPFS use. Note that in this example, the device name is specified without a prefix
(/dev/sdp is specified as sdp).
1. Display all the registration key values on the disk:

/usr/lpp/mmfs/bin/tsprreadkeys sdp

The system displays information similar to:

426 IBM Storage Scale 5.1.9: Problem Determination Guide


Registration keys for sdp
1. 00006d0000000001

If the registered key values all start with 0x00006d, which indicates that the PR registration was
issued by GPFS, proceed to the next step to verify the SCSI-3 PR reservation type. Otherwise, contact
your system administrator for information about clearing the disk state.
2. Display the reservation type on the disk:

/usr/lpp/mmfs/bin/tsprreadres sdp

The system displays information similar to:

yes:LU_SCOPE:WriteExclusive-AllRegistrants:0000000000000000

If the output indicates a PR reservation with type WriteExclusive-AllRegistrants, proceed to the


following instructions for clearing the SCSI-3 PR reservation on the disk.
If the output does not indicate a PR reservation with this type, contact your system administrator for
information about clearing the disk state.
To clear the SCSI-3 PR reservation on the disk, follow these steps:
1. Choose a hex value (HexValue); for example, 0x111abc that is not in the output of the tsprreadkeys
command run previously. Register the local node to the disk by entering the following command with
the chosen HexValue:

/usr/lpp/mmfs/bin/tsprregister sdp 0x111abc

2. Verify that the specified HexValue has been registered to the disk:

/usr/lpp/mmfs/bin/tsprreadkeys sdp

The system displays information similar to:

Registration keys for sdp


1. 00006d0000000001
2. 0000000000111abc

3. Clear the SCSI-3 PR reservation on the disk:

/usr/lpp/mmfs/bin/tsprclear sdp 0x111abc

4. Verify that the PR registration has been cleared:

/usr/lpp/mmfs/bin/tsprreadkeys sdp

The system displays information similar to:

Registration keys for sdp

5. Verify that the reservation has been cleared:

/usr/lpp/mmfs/bin/tsprreadres sdp

The system displays information similar to:

no:::

The disk is now ready to use for creating an NSD.

Chapter 25. Disk issues 427


Manually enabling or disabling Persistent Reserve
The PR status can be set manually with the help of IBM Support Center.
Attention: Manually enabling or disabling Persistent Reserve should only be done under the
supervision of the IBM Support Center with GPFS stopped on the node.
The IBM Support Center will help you determine if the PR state is incorrect for a disk. If the PR state is
incorrect, you may be directed to correct the situation by manually enabling or disabling PR on that disk.

GPFS is not using the underlying multipath device


You can view the underlying disk device where I/O is performed on an NSD disk by using the mmlsdisk
command with the -M option.
The mmlsdisk command output might show unexpected results for multipath I/O devices. For example
if you issue this command:

mmlsdisk dmfs2 -M

The system displays information similar to:

Disk name IO performed on node Device Availability


------------ ----------------------- ----------------- ------------
m0001 localhost /dev/sdb up

The following command is available on Linux only.

# multipath -ll
mpathae (36005076304ffc0e50000000000000001) dm-30 IBM,2107900
[size=10G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=8][active]
\_ 1:0:5:1 sdhr 134:16 [active][ready]
\_ 1:0:4:1 sdgl 132:16 [active][ready]
\_ 1:0:1:1 sdff 130:16 [active][ready]
\_ 1:0:0:1 sddz 128:16 [active][ready]
\_ 0:0:7:1 sdct 70:16 [active][ready]
\_ 0:0:6:1 sdbn 68:16 [active][ready]
\_ 0:0:5:1 sdah 66:16 [active][ready]
\_ 0:0:4:1 sdb 8:16 [active][ready]

The mmlsdisk output shows that I/O for NSD m0001 is being performed on disk /dev/sdb, but it should
show that I/O is being performed on the device-mapper multipath (DMM) /dev/dm-30. Disk /dev/sdb is
one of eight paths of the DMM /dev/dm-30 as shown from the multipath command.
This problem could occur for the following reasons:
• The previously installed user exit /var/mmfs/etc/nsddevices is missing. To correct this, restore
user exit /var/mmfs/etc/nsddevices and restart GPFS.
• The multipath device type does not match the GPFS known device type. For a list of known device
types, see /usr/lpp/mmfs/bin/mmdevdiscover. After you have determined the device type for
your multipath device, use the mmchconfig command to change the NSD disk to a known device type
and then restart GPFS.
The following output shows that device type dm-30 is dmm:

/usr/lpp/mmfs/bin/mmdevdiscover | grep dm-30


dm-30 dmm

To change the NSD device type to a known device type, create a file that contains the NSD name and
device type pair (one per line) and issue this command:

mmchconfig updateNsdType=/tmp/filename

428 IBM Storage Scale 5.1.9: Problem Determination Guide


where the contents of /tmp/filename are:

m0001 dmm

The system displays information similar to:

mmchconfig: Command successfully completed


mmchconfig: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Kernel panics with the message "GPFS deadman switch timer has
expired and there are still outstanding I/O requests"
This problem can be detected by an error log with a label of KERNEL_PANIC, and the PANIC MESSAGES
or a PANIC STRING.
For example:

GPFS Deadman Switch timer has expired, and there are still outstanding I/O requests

GPFS is designed to tolerate node failures through per-node metadata logging (journaling). The log file is
called the recovery log. In the event of a node failure, GPFS performs recovery by replaying the recovery
log for the failed node, thus restoring the file system to a consistent state and allowing other nodes to
continue working. Prior to replaying the recovery log, it is critical to ensure that the failed node has indeed
failed, as opposed to being active but unable to communicate with the rest of the cluster.
In the latter case, if the failed node has direct access (as opposed to accessing the disk with an NSD
server) to any disks that are a part of the GPFS file system, it is necessary to ensure that no I/O requests
submitted from this node complete once the recovery log replay has started. To accomplish this, GPFS
uses the disk lease mechanism. The disk leasing mechanism guarantees that a node does not submit any
more I/O requests once its disk lease has expired, and the surviving nodes use disk lease time out as a
guideline for starting recovery.
This situation is complicated by the possibility of 'hung I/O'. If an I/O request is submitted prior to the disk
lease expiration, but for some reason (for example, device driver malfunction) the I/O takes a long time
to complete, it is possible that it may complete after the start of the recovery log replay during recovery.
This situation would present a risk of file system corruption. In order to guard against such a contingency,
when I/O requests are being issued directly to the underlying disk device, GPFS initiates a kernel timer
that is referred to as the deadman switch timer. The deadman switch timer goes off in the event
of disk lease expiration, and checks whether there is any outstanding I/O requests. If there is any I/O
pending, a kernel panic is initiated to prevent possible file system corruption.
Such a kernel panic is not an indication of a software defect in GPFS or the operating system kernel, but
rather it is a sign of
1. Network problems (the node is unable to renew its disk lease).
2. Problems accessing the disk device (I/O requests take an abnormally long time to complete). See
“MMFS_LONGDISKIO” on page 276.

Chapter 25. Disk issues 429


430 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 26. GPUDirect Storage troubleshooting
The troubleshooting information for GPUDirect Storage (GDS) is primarily available in the NVIDIA
documentation. IBM Storage Scale also provides some troubleshooting options for GDS-related issues.
Run the NVIDIA GDS utility gdscheck -p before you run the GDS workloads to verify the setup. You
need Python3 installed on the node to run this utility. Verify the status of PCIe Access Control Services
(ACS) and PCIe Input/Output Memory Management Unit (IOMMU), as these components affect GDS
function and performance. The output of the gdscheck -p must display the following status for IOMMU
and ACS components:

IOMMU disabled
ACS disabled

If you want to enable tracing of your CUDA application, adopt the following corresponding settings that
are available in /etc/cufile.json.
Note: These settings may impact GDS performance.
• Log level: Level of information to be logged.
• Log location: By default, the trace is written into the current working directory of the CUDA application.
Troubleshooting information in NVIDIA documentation is available at GPUDirect Storage Troubleshooting.

Troubleshooting options available in IBM Storage Scale


GPUDirect Storage (GDS) in IBM Storage Scale is integrated with the system health monitoring.
You can use the mmhealth command to monitor the health status. Check for the following details in the
mmhealth command output:
• File system manager status: Examples - down and quorum.
• Network status: Examples - IB fabric broken and devices down.
• File system status: Examples - Not mounted and broken.
You can run the mmhealth command as shown in the following example:

# mmhealth node show

Node name: fscc-x36m3-32-hs


Node status: DEGRADED
Status Change: 19 days ago

Component Status Status Change Reasons


----------------------------------------------------------------------------
FILESYSMGR HEALTHY 17 days ago -
GPFS DEGRADED 19 days ago mmfsd_abort_warn
NETWORK HEALTHY 29 days ago -
FILESYSTEM HEALTHY 14 days ago -
GUI HEALTHY 29 days ago -
PERFMON HEALTHY 18 days ago -
THRESHOLD HEALTHY 29 days ago -

You can use the mmhealth node show GDS command to check the health status of the GDS
component. For more information about the various options that are available with mmhealth command,
see mmhealth command in IBM Storage Scale: Command and Programming Reference Guide.
Error recovery
CUDA retries failed GDS read and GDS write requests in the compatibility mode. As the retry is a regular
POSIX read() or write() system call, all GPFS limitations regarding error recovery apply in general.
Restriction counters

© Copyright IBM Corp. 2015, 2024 431


If performance is not as expected, this could indicate that one or more of the GDS restrictions for GDS
reads or writes have been encountered.
Each counter represents the number of GDS I/O requests that caused a fallback to the compatibility mode
for a particular reason.
In IBM Storage Scale 5.1.3, the mmdiag command is enhanced to print the diagnostic information for
GDS. The mmdiag --gds command displays a list of counters, representing the GDS operations returned
to CUDA due to a restriction. A restricted GDS operation is returned to the CUDA layer and retried in
compatibility mode. The following output shows the GDS restriction counters:

# mmdiag --gds

=== mmdiag: gds ===

GPU Direct Storage restriction counters:

file less than 4k 0


sparse file 0
snapshot file 0
clone file 0
encrypted file 0
memory mapped file 0
compressed file 0
append to file 0
increase file size 0
dioWanted fail 0
nsdServerDownlevel 0
nsdServerGdsRead 0
RDMA target port is down 0
RDMA initiator port is down 0
RDMA work request errors 0
no RDMA connection to NSD server (transient error) 0
no RDMA connection to NSD server (permanent error) 0

The following table describes the restriction counters.

Counter Name Description


file less than 4k GDS performs read on a file with a file size of less than 4096
bytes.
sparse file GDS performs read on a sparse section within a file.
snapshot file GDS performs read on a snapshot file.
clone file GDS performs read on a clone section within a file.
encrypted file GDS performs read on an encrypted file.
memory mapped file GDS performs read on a memory mapped file.
compressed file GDS performs read on a compressed file.
append to file A new block had to be allocated for appending data to a file.
increase file size The file size had to be increased, a new block had been
allocated.
dioWanted fail GDS performs read on a file where the internal function
dioWanted failed.
nsdServerDownlevel GDS performs read on file data, which is stored on an NSD server
that is running GPFS 5.1.1 or a previous version.
nsdServerGdsRead GDS performs read on a file data, which is stored on a disk
attached to the local GPFS node.
RDMA target port is down GDS performs read through an RDMA adapter port on a GDS
client, which is in the down state.

432 IBM Storage Scale 5.1.9: Problem Determination Guide


Counter Name Description
RDMA initiator port is down GDS performs read through an RDMA adapter port on an NSD
server, which is in the down state.
RDMA work request errors The RDMA operation for a GDS read request is failed.
no RDMA connection to NSD server Transient RDMA error.
(transient error)
no RDMA connection to NSD server Permanent RDMA error.
(permanent error)

mmfslog
The GDS feature IBM Storage Scale, provides specific entries in the mmfslog file indicating a successful
initialization.

# grep "VERBS DC" mmfs.log.latest


2021-05-05_15:55:32.729-0400: [I] VERBS DC RDMA library libmlx5.so loaded.
2021-05-05_15:55:32.729-0400: [I] VERBS DC API loaded.
2021-05-05_15:55:32.986-0400: [I] VERBS DC API initialized.

If the IBM Storage Scale log file contains the following warning message:

[W] VERBS RDMA open error verbsPort <port> due to missing support for atomic operations for
device <device>

Check the description of the verbsRdmaWriteFlush configuration variable in the mmchconfig command
topic in IBM Storage Scale: Command and Programming Reference Guide for possible options.
Syslog
Detailed information about the NVIDIA driver registration and de-registration can be found in the syslog in
case of errors. The corresponding messages look similar to:

Apr 12 00:48:53 c73u34 kernel: ibm_scale_v1_register_nvfs_dma_ops()


Apr 12 00:49:14 c73u34 kernel: ibm_scale_v1_unregister_nvfs_dma_ops()

Traces
Specific GDS I/O traces can be generated by using the mmtracectl command. For more details, see
mmtracectl command in IBM Storage Scale: Command and Programming Reference Guide.
Support data
If all previous steps do not help and support needs to collect debug data, use the gpfs.snap command
to download all relevant files and diagnostic data to analyze the potential issues. For more details about
the various options that are available with the gpfs.snap command, see gpfs.snap command in IBM
Storage Scale: Command and Programming Reference Guide.

Common errors
1. RDMA is not enabled.
GPUDirect Storage (GDS) requires RDMA to be enabled. If RDMA is not enabled, an I/O error
(EIO=-5) occurs as shown in the following example:

# gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1
Error: IO failed stopping traffic, fd :33 ret:-5 errno :1
io failed :ret :-5 errno :1, file offset :0, block size :1048576

When such an error occurs, verify that the system is configured correctly. For more information,
see the Configuring GPUDirect Storage for IBM Storage Scale topic in IBM Storage Scale RAID:
Administration.

Chapter 26. GPUDirect Storage troubleshooting 433


Important: Ensure that the verbsRDMA option is enabled (verbsRdma=enable).
2. RDMA device addresses are set incorrectly.
If the list of addresses of the RDMA devices in the /etc/cufile.json file is empty, the following
error occurs:

# gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1
Error: IO failed stopping traffic, fd :27 ret:-5008 errno :17
io failed : GPUDirect Storage not supported on current file, file offset :0, block size
:1048576

When such an error occurs, verify that the system is configured correctly. For more information, see
the Configuring GPUDirect Storage for IBM Storage Scale topic in IBM Storage Scale: Administration
Guide.
Important: Ensure that the rdma_dev_addr_list configuration parameter has the correct value in
the /etc/cufile.json file.

434 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 27. Security issues
This topic describes some security issues that you might encounter while using IBM Storage Scale.

Encryption issues
The topics that follow provide solutions for problems that might be encountered while setting up or using
encryption.

Unable to add encryption policies


If the mmchpolicy command fails when you are trying to add encryption policies, perform the following
diagnostic steps:
1. Confirm that the gpfs.crypto and gpfs.gskit packages are installed.
2. Confirm that the file system is at GPFS 4.1 or later and the fast external attributes (--fastea) option
is enabled.
3. Examine the error messages that are logged in the mmfs.log.latest file, which is located
at:/var/adm/ras/mmfs.log.latest.

Receiving "Permission denied" message


If you experience a "Permission denied" failure while creating, opening, reading, or writing to a file,
perform the following diagnostic steps:
1. Confirm that the key server is operational and correctly set up and can be accessed through the
network.
2. Confirm that the /var/mmfs/etc/RKM.conf file is present on all nodes from which the file is
supposed to be accessed. The /var/mmfs/etc/RKM.conf file must contain entries for all the RKMs
needed to access the file.
3. Verify that the master keys needed by the file and the keys that are specified in the encryption policies
are present on the key server.
4. Examine the error messages in the /var/adm/ras/mmfs.log.latest file.

"Value too large" failure when creating a file


If you experience a "Value too large to be stored in data type" failure when creating a file, follow these
diagnostic steps.
1. Examine error messages in /var/adm/ras/mmfs.log.latest to confirm that the problem is
related to the extended attributes being too large for the inode. The size of the encryption extended
attribute is a function of the number of keys used to encrypt a file. If you encounter this issue, update
the encryption policy to reduce the number of keys needed to access any given file.
2. If the previous step does not solve the problem, create a new file system with a larger inode size.

Mount failure for a file system with encryption rules


If you experience a mount failure for a file system with encryption rules, follow these diagnostic steps.
1. Confirm that the gpfs.crypto and gpfs.gskit packages are installed.
2. Confirm that the /var/mmfs/etc/RKM.conf file is present on the node and that the content
in /var/mmfs/etc/RKM.conf is correct.
3. Examine the error messages in /var/adm/ras/mmfs.log.latest.

© Copyright IBM Corp. 2015, 2024 435


"Permission denied" failure of key rewrap
If you experience a "Permission denied" failure of a key rewrap, follow these diagnostic steps.

When mmapplypolicy is invoked to perform a key rewrap, the command may issue messages like the
following:
[E] Error on gpfs_enc_file_rewrap_key(/fs1m/sls/test4,KEY-d7bd45d8-9d8d-4b85-a803-e9b794ec0af2:hs21n56_new,KEY-40a0b68b-
c86d-4519-9e48-3714d3b71e20:js21n92)
Permission denied(13)

If you receive a message similar to this, follow these steps:


1. Check for syntax errors in the migration policy syntax.
2. Ensure that the new key is not already being used for the file.
3. Ensure that both the original and the new keys are retrievable.
4. Examine the error messages in /var/adm/ras/mmfs.log.latest for additional details.

Authentication issues
This topic describes the authentication issues that you might experience while using file and object
protocols.

File protocol authentication setup issues


When trying to enable Active Directory Authentication for file (SMB, NFS), the operation might fail due to a
timeout. In some cases, the DNS server can return multiple IPs that cannot be queried within the allotted
timeout period and/or IPs that belong to networks inaccessible by the IBM Storage Scale nodes.
You can try the following workarounds to resolve this issue:
• Remove any invalid/unreachable IPs from the AD DNS.
If you removed any invalid/unreachable IPs, retry the mmuserauth service create command that
previously failed.
• You can also try to disable any adapters that might not be in use.
For example, on Windows 2008: Start -> Control Panel -> Network and Sharing Center -> Change
adapter settings -> Right-click the adapter that you are trying to disable and click Disable
If you disabled any adapters, retry the mmuserauth service create command that previously
failed.

Protocol authentication issues


You can use a set of GPFS commands to identify and rectify issues that are related to authentication
configurations.
To do basic authentication problem determination, perform the following steps:
1. Issue the mmces state show auth command to view the current state of authentication.
2. Issue the mmces events active auth command to see whether events are currently contributing
to make the state of the authentication component unhealthy.
3. Issue the mmuserauth service list command to view the details of the current authentication
configuration.
4. Issue the mmuserauth service check -N cesNodes --server-reachability command to
verify the state of the authentication configuration across the cluster.
5. Issue the mmuserauth service check -N cesNodes --rectify command to rectify the
authentication configuration.
Note: Server reachability cannot be rectified by using the --rectify parameter.

436 IBM Storage Scale 5.1.9: Problem Determination Guide


Authentication error events
This topic describes how to verify and resolve Authentication errors.
Following is a list of possible events that may cause a node to go into a failed state and possible solutions
for each of the issues. To determine what state a component is in, issue the mmces command.

SSD/YPBIND process not running (sssd_down)


Cause
The SSSD or the YPBIND process is not running.
Determination
To learn the authentication current state, run the following command:

mmces state show auth

To check the active events for authentication, run the following command:

mmces events active auth

To check the current authentication state, run the following command:

mmces state show auth

To check the current authentication configuration, run the following command:

mmuserauth service list

To check the current authentication configuration across the cluster, run the following command:

mmuserauth service check -N cesNodes --server-reachability

Solution
Rectify the configuration by running the following command:

mmuserauth service check -N cesNodes --rectify

Note: Server reachability cannot be rectified by using the --rectify flag.

Winbind process not running (wnbd_down)


Cause
The Winbind process is not running.
Determination
Run the same command as recommended in the section above, SSD/YPBIND process not running
(sssd_down).
Solution
Follow the steps in the previous section, SSD/YPBIND process not running (sssd_down). Then, run the
following command:

mmces service stop smb -N <Node on which the problem exists>

mmces service start smb -N <Node on which the problem existed>

Chapter 27. Security issues 437


Nameserver issues related to AD authentication
If the Active Directory (AD) is configured as the authentication method, then each declared nameserver in
the /etc/resolv.conf file is checked for the required entries in the DNS.
The AD servers must have the following entries:
• _ldap._tcp.<Realm>
• _ldap._tcp.dc._msdcs.<Realm>
• _kerberos._tcp.<Realm>
• _kerberos._tcp.dc._msdcs.<Realm>
A missing configuration setting triggers one of the following events:
• dns_ldap_tcp_down
• dns_ldap_tcp_dc_msdcs_down
• dns_krb_tcp_down
• dns_krb_tcp_dc_msdcs_down
These events alert the user that if the AD-enabled nameservers fail, some services might continue to
work, but the AD-authenticated connections stop working.
If the /etc/resolv.conf file also contains non-AD nameservers, then a dns_query_fail event is
triggered.

Authorization issues
You might receive an unexpected “access denied” error either for native access to file system or for using
the SMB or NFS protocols. Possible steps for troubleshooting the issue are described here.
Note: ACLs used in the object storage protocols are separate from the file system ACLs, and
troubleshooting in that area should be done differently. For more information, see “Object issues” on
page 462.

Verify authentication and ID mapping information


As a first step, verify that authentication and ID mapping are correctly configured. For more information,
see the Verifying the authentication services configured in the system topic in the IBM Storage Scale:
Administration Guide.

Verify authorization limitations


Ensure that Access Control Lists (ACLs) are configured as required by IBM Storage Scale. For more
information, see the Authorization limitation topic in the IBM Storage Scale: Administration Guide. Also,
check for more limitations of the NFSv4 ACLs stored in the file system. For more information, see the
GPFS exceptions and limitations to NFS V4 ACLs topic in the IBM Storage Scale: Administration Guide.

Verify stored ACL of file or directory


Read the native ACL stored in the file system by using this command:

mmgetacl -k native /path/to/file/or/directory

If the output does not report an NFSv4 ACL type in the first line, then consider changing the ACL
to the NFSv4 type. For more information on how to configure the file system for the recommended
NFSv4 ACL type for protocol usage, see the Authorizing file protocol users topic in the IBM Storage
Scale: Administration Guide. Also, review the ACL entries for permissions related to the observed “access
denied” issue.

438 IBM Storage Scale 5.1.9: Problem Determination Guide


Note: ACL entries are evaluated in the listed order for determining whether access is granted, and that
the evaluation stops when a “deny” entry is encountered. Also, check for entries that are flagged with
“InheritOnly”, since they do not apply to the permissions of the current file or directory.

Verify group memberships and ID mappings


Next review the group membership of the user and compare that to the permissions granted in the ACL.
If the cluster is configured with Active Directory authentication, then first have the user authenticate and
then check the group memberships of the user. With Active Directory, authentication is the only reliable
way to refresh the group memberships of the user if the cluster does not have the latest and complete list
of group memberships:

/usr/lpp/mmfs/bin/wbinfo -a 'domainname\username'
id 'domainname\username'

If the cluster is configured with a different authentication method, then query the group membership of
the user:

id 'username'

If the user is a member of many groups, compare the number of group memberships with the limitations
that are listed in the IBM Storage Scale FAQ. For more information, see https://www.ibm.com/docs/en/
STXKQY/gpfsclustersfaq.html.
If a group is missing, check the membership of the user in the missing group in the authentication server.
Also, check the ID mapping configuration for that group and check whether the group has an ID mapping
that is configured and if it is in the correct range. You can query the configured ID mapping ranges by
using this command:

/usr/lpp/mmfs/bin/mmuserauth service list

If the expected groups are missing in the output from the ID command and the authentication method
is Active Directory with trusted domains, check the types of the groups in Active Directory. Not all group
types can be used in all Active Directory domains.
If the access issue is sporadic, repeat the test on all protocol nodes. Since authentication and ID mapping
is handled locally on each protocol node, it might happen that a problem affects only one protocol node,
and hence only protocol connections that are handled on that protocol node are affected.

Verify SMB export ACL for SMB export


If the access issue occurs on an SMB export, consider that the SMB export ACL can also cause user
access to be denied. Query the current SMB export ACLs and review whether they are set up as expected
by using this command:

/usr/lpp/mmfs/bin/mmsmb exportacl list

Collect trace for debugging


Collect traces as a last step to determine the cause for authorization issues. When the access problem
occurs for a user using the SMB protocol, capture the SMB trace first while recreating the problem (the
parameter -c is used to specify the IP address of the SMB):

/usr/lpp/mmfs/bin/mmprotocoltrace start smb -c x.x.x.x

Re-create the access denied issue

/usr/lpp/mmfs/bin/mmprotocoltrace stop smb

For analyzing the trace, extract the trace and look for the error code NT_STATUS_ACCESS_DENIED in the
trace.

Chapter 27. Security issues 439


If the access issue occurs outside of SMB, collect a file system trace:

/usr/lpp/mmfs/bin/mmtracectl –start

Re-create the access denied issue

/usr/lpp/mmfs/bin/mmtracectl --stop

The IBM Security Lifecycle Manager prerequisites cannot be


installed
This topic provides troubleshooting references and steps for resolving system errors when the IBM
Security Lifecycle Manager prerequisites cannot be installed.

Description
When the user tries to install the IBM Security Lifecycle Manager prerequisites, the system displays the
following error:

JVMJ9VM011W Unable to load j9dmp24: libstdc++.so.5: cannot open shared


object file: No such file or directory
JVMJ9VM011W Unable to load j9jit24: libstdc++.so.5: cannot open shared
object file: No such file or directory
JVMJ9VM011W Unable to load j9gc24: libstdc++.so.5: cannot open shared
object file: No such file or directory
JVMJ9VM011W Unable to load j9vrb24: libstdc++.so.5: cannot open shared
object file: No such file or directory

Cause
The system displays this error when the system packages are not upgraded.

Proposed workaround
• All system packages must be upgraded, except the kernel that should be 6.3 in order for encryption to
work correctly.
• Update all packages excluding kernel:

yum update --exclude=kernel*

• Modify: /etc/yum.conf

[main]

exclude=kernel* redhat-release*

IBM Security Lifecycle Manager cannot be installed


This topic provides troubleshooting references and steps for resolving system errors when IBM Security
Lifecycle Manager cannot be installed.

Description
When the user tries to install IBM Security Lifecycle Manager, the system displays the following errors:

eclipse.buildId=unknownjava.fullversion=JRE 1.6.0 IBM J9 2.4 Linux x86-32


jvmxi3260sr9-20110203_74623 (JIT enabled, AOT enabled)J9VM -
20110203_074623JIT - r9_20101028_17488ifx3GC - 20101027_AABootLoader
constants: OS=linux, ARCH=x86, WS=gtk, NL=enFramework arguments: -toolId
install -accessRights admin input @osgi.install.area/install.xmlCommand-
line arguments: -os linux -ws gtk -arch x86 -toolId install -accessRights

440 IBM Storage Scale 5.1.9: Problem Determination Guide


admin input @osgi.install.area/install.xml!ENTRY com.ibm.cic.agent.ui 4 0
2013-07-09 14:11:47.692!MESSAGE Could not load SWT library.
Reasons:/home/tklm-v3/disk1/im/configuration/org.eclipse.osgi/bundles/207/1/
.cp/libswt-pi-gtk-4234.so (libgtk-x11-2.0.so.0: cannot open shared object
file: No such file or directory)
swt-pi-gtk (Not found in java.library.path)/root/.swt/lib/linux/x86/libswt-
pi-gtk-4234.so (libgtk-x11-2.0.so.0: cannot open shared object file: No
such file or directory)
/root/.swt/lib/linux/x86/libswt-pi-gtk.so (/root/.swt/lib/linux/x86/liblib
swt-pi-gtk.so.so:cannot open shared object file: No such file or directory)"

Cause
The system displays this error when the system packages are not upgraded.

Proposed workaround
• All system packages must be upgraded, except the kernel that should be 6.3 in order for encryption to
work correctly.
• Run through the following checklist before installing IBM Security Lifecycle Manager:

Table 60. IBM Security Lifecycle Manager preinstallation checklist


System components Minimum values Header
System memory (RAM) 4 GB 4 GB
Processor speed Linux and Windows systems Linux and Windows systems 3.0
GHz dual processors AIX and
Sun Solaris systems 1.5 GHz (4-
way)
Disk space free for IBM Security 3.0 GHz single processor AIX 5 GB
Key Lifecycle Manager and and Sun Solaris systems 1.5 GHz
prerequisite products such as (2-way)
DB2®
Disk space free in /tmp or 5 GB 2 GB
C:\temp
Disk space free in /home 2 GB 6 GB
directory for DB2
Disk space free in /var directory 5 GB 512 MB on Linux and UNIX 512 MB on Linux and UNIX
for DB2 operating systems operating systems

Chapter 27. Security issues 441


442 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 28. Protocol issues
This topic describes the protocol-related issues (NFS, SMB, and Object) that you might come across while
using IBM Storage Scale.

NFS issues
This topic describes some of the possible problems that can be encountered when GPFS interacts with
NFS.
If you encounter server-side issues with NFS:
1. Identify which NFS server or CES node is being used.
2. Run and review output of mmhealth.
3. Check whether all required file systems are mounted on the node that is being used, including the CES
shared root.
4. Review /var/log/ganesha.log. Messages tagged as CRIT, MAJ, or EVENT are about the state of
the NFS server.
5. Use ganesha_stats utility to monitor the NFS performance.
This utility can capture statistics for the NFS server. For example, for all NFS v3 and NFS v4 operations,
export, authentication related etc. This utility is not cluster aware and provides information only on the
NFS server that is running on the NFS node.
When GPFS interacts with NFS, you can encounter the following problems:
• “NFS client with stale inode data” on page 443
• “NFSv4 ACL problems” on page 405

CES NFS failure due to network failure


This topic provides information on how to resolve a CES NFS failure caused by a network failure.
When a network failure occurs because a cable is disconnected, a switch fails, or an adapter fails, CES
NFS I/O operations will not complete. To resolve the failure, run the systemctl restart network
command on the CES node to which the IP is failing back (where the failure occurred). This clears the
client suspension and refreshes the network.

NFS client with stale inode data


The NFS client may have stale inode data due to caching and the course of action to be followed to correct
this issue.
For performance reasons, some NFS implementations cache file information on the client. Some of the
information (for example, file state information such as file size and timestamps) is not kept up-to-date in
this cache. The client may view stale inode data (on ls -l, for example) if exporting a GPFS file system
with NFS. If this is not acceptable for a given installation, caching can be turned off by mounting the file
system on the client using the appropriate operating system mount command option (for example, -o
noac on Linux NFS clients).
Turning off NFS caching will result in extra file systems operations to GPFS, and negatively affect its
performance.
The clocks of all nodes in the GPFS cluster must be synchronized. If this is not done, NFS access to the
data, as well as other GPFS file system operations, may be disrupted. NFS relies on metadata timestamps
to validate the local operating system cache. If the same directory is either NFS-exported from more than
one node, or is accessed with both the NFS and GPFS mount point, it is critical that clocks on all nodes

© Copyright IBM Corp. 2015, 2024 443


that access the file system (GPFS nodes and NFS clients) are constantly synchronized using appropriate
software (for example, NTP). Failure to do so may result in stale information seen on the NFS clients.

NFS mount issues


This section provides information on how to verify and resolve the different NFS mount error conditions.

Mount times out


PROBLEM
The user is trying to do an NFS mount and receives a timeout error.
DETERMINATION
When a timeout error occurs, perform the following steps.
1. Check whether the server is reachable by using either one or both the following commands.

ping <server-ip>
ping <server-name>

The expected result is that the server responds.


2. Check whether portmapper, NFS, and mount daemons are running on the server.
a. On an IBM Storage Scale CES node, issue the following command.

mmces service list

The expected results are that the output indicates that the NFS service is running as in this
example:
Enabled services: SMB NFS
SMB is running, NFS is running
b. On the NFS server node, issue the following command:

rpcinfo -p

The expected result is that portmapper, mountd, and NFS are running as shown in the
following sample output.

program vers proto port service


100000 4 tcp 111 portmapper
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 upd 111 portmapper
100000 3 upd 111 portmapper
100000 2 upd 111 portmapper
100024 1 upd 53111 status
100024 1 tcp 58711 status
100003 3 upd 2049 nfs
100003 3 tcp 2049 nfs
100003 4 upd 2049 nfs
100003 4 tcp 2049 nfs
100005 1 upd 59149 mountd
100005 1 tcp 54013 mountd
100005 3 upd 59149 mountd
100005 3 tcp 54013 mountd
100021 4 upd 32823 nlockmgr
100021 4 tcp 33397 nlockmgr
100011 1 upd 36650 rquotad
100011 1 tcp 36673 rquotad
100011 2 upd 36650 rquotad
100011 2 tcp 36673 rquotad

444 IBM Storage Scale 5.1.9: Problem Determination Guide


3. Check whether the firewall is blocking NFS traffic on Linux systems by using the following
command on the NFS client and the NFS server.

iptables -L

Then, check whether any hosts or ports that are involved with the NFS connection are blocked
(denied).
If the client and the server are running in different subnets, then a firewall might be running on the
router.
4. Check whether the firewall is blocking NFS traffic on the client or router by using the appropriate
commands.

NFS mount fails with a "No such file or directory" error


PROBLEM
The user is trying to do an NFS mount on Linux and receives this message:
No such file or directory
DETERMINATION
Following might be the root causes of this error.
• Root cause #1 - Access type is none
An NFS export was created on the server without a specified access type. Therefore, for security
reasons, the default access is set to NONE and NFS mounting fails.

Solution
On the NFS server, specify an access type (for example, RW for Read and Write) for export. If
the export is already created, then you can change the access type by using the mmnfs export
change command. See the following example. The backslash (\) is a line continuation character.

mmnfs export change /mnt/gpfs0/nfs_share1 \


--nfschange "*(Access_Type=RW,Squash=NO_ROOT_SQUASH)"

Verification
Verify that the access type is specified for NFS export by using the mmnfs export list command
on the NFS server. For example,

mmnfs export list --nfsdefs /mnt/gpfs0/nfs_share1

The system displays an output similar to this sample output.

Path Delegations Clients Access Protocols Transports Squash Anonymous Anonymous SecType
PrivilegedPort Export Default Manage NFS_Commit
_Type _uid
_gid _id Delegation _Gids
-------------------------------------------------------------------------------------------------------
--------------------------------------------------
/mnt/gpfs0/ none * RW 3,4 TCP NO_ROOT -2 -2 KRB5
FALSE 2 none FALSE FALSE
_share1 _SQUASH

NONE indicates the root cause; the access type is none.


RO or RW indicates that the solution was successful.

• Root cause # 2 - Protocol version that is not supported by the server

Solution

Chapter 28. Protocol issues 445


On the NFS server, specify the protocol version that is needed by the client for export (for example,
3:4). If the export already exists, then you can change the export by using the mmnfs export
change command command. For example,

mmnfs export change /mnt/gpfs0/nfs_share1 --nfschange "* (Protocols=3:4)"

Verification
Verify the protocols that are specified for the export by using the mmnfs export change command. For
example,

mmnfs export list --nfsdefs /mnt/gpfs0/nfs_share1

The system displays output similar to this:

Path Delegations Clients Access Protocols Transports Squash Anonymous Anonymous SecType PrivilegedPort Default Manage NFS_Commit
_Type _uid _gid Delegation _Gids
---------------------------------------------------------------------------------------------------------------------------------------------------------
/mnt/gpfs0/ none * RW 3,4 TCP NO_ROOT -2 -2 SYS FALSE none FALSE FALSE
nfs_share1 _SQUASH

NFSv4 client cannot mount NFS exports


PROBLEM
The NFS client cannot mount NFS exports. The mount command on the client either returns an error
or times out. Mounting the same export by using NFSv3 succeeds.
DETERMINATION
The export is hidden by a higher-level export. Try mounting the server root \ and navigate through the
directories.
SOLUTION
You must not create nested exports, such as /path/to/folder or /path/to/folder/subfolder
as these exports might lead to serious issues in data consistency. Remove the higher-level export
that prevents the NFSv4 client from descending through the NFSv4 virtual file system path. In case,
nested exports cannot be avoided, ensure that the export with the common path, called as the
top-level export, has all the permissions for this NFSv4 client. Also, NFSv4 client that mounts the
parent /path/to/folder export cannot see the child export subtree /path/to/folder/inside/
subfolder unless the same client is explicitly allowed to access the child export as well.
VERIFICATION
1. Ensure that the NFS server is running correctly on all of the CES nodes and that the CES IP address
used to mount is active in the CES cluster. Check the CES IP address and the NFS server status by
using the following command.

mmlscluster --ces
mmces service list -a

2. Ensure that the firewall allows NFS traffic to pass through. To allow the NFS traffic, the CES
NFS service must be configured with explicit NFS ports so that discrete firewall rules can be
established. Issue the following command on the client.

rpcinfo -t <CES_IP_ADDRESS> nfs

3. Verify that the NFS client is allowed to mount the export. In NFS terms, a definition exists for this
client for the export to be mounted. Check NFS export details by using the following command.

mmnfs export list --nfsdefs <NFS_EXPORT_PATH>

446 IBM Storage Scale 5.1.9: Problem Determination Guide


The system displays output similar to this sample output.

Path Delegations Clients Access_Type Protocols Transports Squash Anonymous_uid Anonymous_gid SecType PrivilegedPort
DefaultDelegations Manage_Gids NFS_Commit
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------
/mnt/gpfs0/nfs_share1 none * RW 3,4 TCP NO_ROOT_SQUASH -2 -2 SYS FALSE none
FALSE FALSE

Issue the following command on the NFSv3 client.

showmount -e <CES_IP_ADDRESS>

Mount the server virtual file-system root / on an NFSv4 client. Navigate through the virtual file
system to the export.
If you have a remote cluster environment with an owning cluster and an accessing cluster, and
the accessing cluster exports the file system of the owning cluster through CES NFS, IP failback
might occur before the remote file systems are mounted. This action can cause I/O failures with
existing CES NFS client mounts and new mount request failures. To avoid I/O failures, stop and
start CES NFS on the recovered node after you run the mmstartup and mmmount <remote FS>
commands. Stop and restart the CES NFS by using the following commands.

mmces service stop nfs


mmces service start nfs

NFS error events


This topic provides information on how to verify and resolve NFS errors.
Following is a list of possible events that might cause a node to go into a failed state and possible
solutions for each of the issues. To determine what state a component is in, run the mmces events
active nfs command.

NFS is not active (nfs_not_active)


Cause
Statistics query indicates that CES NFS is not responding.
Determination
Call the CES NFS statistics command with some delay and compare the NFS server timestamp, then
determine if the NFS operation counts are increasing. Run this command:

/usr/bin/ganesha_stats ; sleep 5 ; /usr/bin/ganesha_stats


Timestamp: Wed Apr 27 19:27:22 201634711407 nsecs
Total NFSv3 ops: 0
Total NFSv4.0 ops: 86449
Total NFSv4.1 ops: 0
Total NFSv4.2 ops: 0
Timestamp: Wed Apr 27 19:27:27 201687146242 nsecs
Total NFSv3 ops: 0
Total NFSv4.0 ops: 105271
Total NFSv4.1 ops: 0
Total NFSv4.2 ops: 0

Solution
Restart CES NFS on the local CES node by using commands mmces service stop nfs and mmces
service start nfs.

CES NFSD process not running (nfsd_down)


Cause
CES NFS server protocol is no longer running.

Chapter 28. Protocol issues 447


Determination
1. Check to see whether the CES NFS daemon is running:

ps -C gpfs.ganesha.nfsd

2. Check whether d-bus is alive. Run:

/usr/bin/ganesha_stats

If either CES NFS or d-bus is down, you will receive an error:

ERROR: Can't talk to ganesha service on d-bus. Looks like Ganesh is down.

Solution
Restart CES NFS on the local CES node by using commands mmces service stop nfs and mmces
service start nfs.

RPC statd process is not running (statd_down)


This applies only if NFS version 3 is enabled in the CES NFS configuration.
Cause
The rpc.statd process is no longer running.
Determination
Check rpc.statd by running:

ps -C rpc.statd

Solution
Restart CES NFS on the local CES node by using commands mmces service stop nfs and mmces
service start nfs.

Portmapper port 111 is not active (portmapper_down)


Cause
RPC call to port 111 failed or timed out.
Determination
Check portmapper output by running:

rpcinfo -n 111 -t localhost portmap


rpcinfo -t localhost nfs 3
rpcinfo -t localhost nfs 4

Solution
Check to see whether portmapper is running and if portmapper (rpcbind) is configured to automatically
start on system startup.

NFS client cannot mount NFS exports from all protocol nodes
Cause
The NFS client can mount NFS exports from some but not all protocol nodes, because the exports are not
seen when doing a showmount against those protocol nodes where this problem surfaces.
Determination

448 IBM Storage Scale 5.1.9: Problem Determination Guide


The error itself occurs on the NFS server side and is related to a Red Hat problem with netgroup caching,
which makes caching unreliable.
Solution
Disable caching netgroups in nscd for AD values. For more information about how to disable nscd caching,
see the nsd.conf man page in https://linux.die.net/man/5/nscd.conf.

The rpc.statd service fails with a "Permission denied" error


Cause
The rpc.statd service fails with a "Permission denied" error when you attempt to create a directory
under the CNFS shared directory. The rpc.statd service runs as an rpcuser user that does not
have the read permission to the CNFS shared directory. Therefore, the user cannot create a file in a
subdirectory under the CNFS shared directory.
Determination
Check messages similar to the following, in the syslog file on server nodes where CNFS or KNFS is
enabled.

rpc.statd[<pid>]: Failed to insert: creating /var/lib/nfs/statd/sm/<client node>:Permission


denied
rpc.statd[<pid>]: STAT_FAIL to <server node> for SM_MON of <client node>
kernel:lockd: cannot monitor <client node>

Solution
On the CNFS shared directory, change the permission to 755 so that the shared directory is readable to all
users.

chmod 755 <CNFS shared directory>


systemctl restart nfs

You might need to reboot a node, if this problem persists after the NFS service restart.
For more information about NFS events, see “Events” on page 559.

NFS error scenarios


This topic provides information on how to verify and resolve NFS errors.

NFS client cannot access exported NFS data


Problem
The NFS client cannot access the exported data even though the export is mounted. This often results in
errors to occur while writing data, creating files, or traversing the directory hierarchy (permission denied).
Determination
The error itself occurs on the NFS client side. Additionally, and based on the nature of the problem, the
server-side NFS logs can provide more details about the origin of the error.
Solution
There are multiple reasons for this problem:
The ACL definition in the file system does not allow the requested operation
The export and/or client definition of the export do not allow that operation (such as a "read only"
definition)
1. Verify the ACL definition of the export path in the file system. To check ACL definitions, run:

mmgetacl Path

Chapter 28. Protocol issues 449


2. Verify the definition of the export and the client (especially the access type). To check the NFS export
details, run:

mmnfs export list -n Path

3. Unmount and remount the file system on the NFS client:

umount <Path>
mount <mount_options> CES_IP_address:<export_path> <mount_point>

NFS client I/O temporarily stalled


Problem
The NFS client temporarily encounters stalled I/O or access requests to the export. The problem goes
away after a short time (about 1 minute.)
Determination
The error itself occurs on the NFS client side, but due to an action on the NFS server side. The server-side
NFS logs can provide more details about the origin of the error (such as a restart of the NFS server) along
with the CES logs (such as manual move of a CES IP or a failover condition).
Origin
In grace period, the NFS server might temporarily suspend further access to the export from the NFS
client (depending on the type of request). During the grace period, certain NFS operations are not allowed.
The NFS server enters in the grace period due to any one of the following conditions:
1. An explicit restart triggered manually through the CLI by running: mmces service stop /
start ...
2. An explicit move of CES IPs manually through the CLI by running: mmces address move ...
3. A change in the definition of an existing export.
Note: Adding or removing NFS exports does not initiate a restart.
4. The creation of the first export.
5. A critical error condition that triggers CES failover, which in turn causes IP addresses to move.
6. A failback of CES IPs (depending on the setting of the address distribution policy).

Collecting diagnostic data for NFS


This topic describes the procedure for collecting diagnostic data for NFS services.
Diagnostic data can be generated by increasing the logging of the NFS server.
To change the logging temporarily on a single CES node without restarting the server, run the following the
command on the CES node where you want to enable the tracing:

ganesha_mgr set_log COMPONENT_ALL FULL_DEBUG

CES NFS log levels can be user adjusted to select the amount of logging by the server. Every increase in
log setting will add additional messages. So the default of "Event" will include messages tagged as EVENT,
WARN, CRIT, MAJ, FATAL, but will not show INFO, DEBUG, MID_DEBUG, FULL_DEBUG:

Table 61. CES NFS log levels


Log name Description
NULL No logging
FATAL Only asserts are logged
MAJ Only major events are logged

450 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 61. CES NFS log levels (continued)
Log name Description
CRIT Only critical events are logged where there is
malfunction, that is, for a single request
WARN Events are logged that may be intended but may
otherwise be harmful
EVENT Default. level, which includes some events that are
expected during normal operation (that is, start,
grace period),
INFO Enhanced level
DEBUG Further enhanced, which includes events-relevant
problem determination
MID_DEBUG Further enhanced, which includes some events for
developers
FULL_DEBUG Maximal logging, which is mainly used for
development purposes

These levels can be applied to a single component or to all components.


Note: The ganesha_mgr command requires that the CES NFS server be active on the node where the
command is executed.
The CES NFS log file (default is /var/log/ganesha.log) will see a lot more updates eventually
generating very large files or even filling up all of your disk.
To avoid issues with space usage, revert to the default logging by using the ganesha_mgr set_log
COMPONENT_ALL EVENT command or reduce the set of components by using "FULL_DEBUG" to
a reasonable subset of server components. For example, by replacing "COMPONENT_ALL" with
"COMPONENT_DISPATCH".
Other possible components can be listed by using ganesha_mgr getall_logs. The ganesha_mgr
changes are not persistent. A server restart will reset these settings to the settings in the cluster
configuration as described in the mmnfs config list command.
Note: The mmnfs config list will show the persisted log level for all CES nodes (default: EVENT).
Any log setting changes by using mmnfs config change LOG_LEVEL command will automatically do
a server restart, possibly preventing to find a cause for the current issue. See the mmnfs topic in IBM
Storage Scale: Command and Programming Reference Guide for more information.

NFS startup warnings


The log file contains warnings that are related to parameters removed from Ganesha 2.7.
When Ganesha 2.7 is started and the gpfs.ganesha.main.conf file has NB_Worker,
Dispatch_Max_Reqs, and Dispatch_Max_Reqs_Xprt parameters, the following warnings are
reported in the ganesha.log file:
• config_errs_to_log :CONFIG :WARN :Config File (/var/mmfs/ces/nfs-config/gpfs.ganesha.main.conf:9):
Unknown parameter (Nb_Worker)
• config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:19): Unknown parameter
(Dispatch_Max_Reqs_Xprt)
• config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:20): Unknown parameter
(Dispatch_Max_Reqs)

Chapter 28. Protocol issues 451


You can ignore these warnings because Ganesha 2.7 does not support these parameters. Or you can
remove these parameters from the configuration file to avoid the recurrence of warnings at Ganesha 2.7
startup time.

Customizing failover behavior for unresponsive NFS


The NFS monitor raises a nfs_not_active error event if it detects a potential NFS hung situation.
This triggers a failover, which impacts the system's overall performance. However, the NFS might not
be hung in such a situation. It is possible that the NFS is fully working even if it does not react on the
monitor checks at that specific time. In these cases, it would be better to trigger a warning instead of a
failover. The NFS monitor can be configured to send an nfs_unresponsive warning event instead of the
nfs_not_active event if it detects a potential hung situation.
Important: The type of NFS workload the cluster experiences determines whether or not the tunable
setting is needed or not.
The nfs_unresponsive event configuration can be done in the monitor configuration file /var/mmfs/
mmsysmon/mmsysmonitor.conf. In the mmsysmonitor.conf file, the nfs section contains the new
flag failoverunresponsivenfs
Setting the failoverunresponsivenfs flag to false triggers the WARNING event,
nfs_unresponsive, if the NFS does not respond to NULL requests or has no measurable NFS operation
activity. Setting the warning event instead of an error event ensures that the NFS service is not
interrupted. This allows the system to avoid an unnecessary failover in case the monitor cycles detect
a healthy state again for the NFS later. However, if the NFS is hung, there is no automatic recovery even if
the NFS remains hung for a long time. It is the user's responsibility to check the system and to restart NFS
manually, if needed.
Setting the failoverunresponsivenfs flag to true triggers the ERROR event, nfs_not_active, if
the NFS does not respond to NULL requests or has no measurable NFS operation activity. Setting the error
event instead of a warning event ensures that a failover is triggered when the system detects a potential
NFS hung situation. However, if the flag is set to true, a failover might be triggered even if the NFS server
is not hung, but just overloaded.
Note: You must restart the system health monitor by using the mmsysmoncontrol restart command
to make the changes effective.

SMB issues
This topic describes SMB-related issues that you might come across while using the IBM Storage Scale
system.

Determining the health of integrated SMB server


There are some IBM Storage Scale commands to determine the health of the SMB server.
The following commands can be used to determine the health of SMB services:
• To check the overall CES cluster state, issue the following command:

mmlscluster --ces

The system displays output similar to this:


GPFS cluster information
========================
GPFS cluster name: boris.nsd001st001
GPFS cluster id: 3992680047366063927

Cluster Export Services global parameters


-----------------------------------------
Shared root directory: /gpfs/fs0
Enabled Services: NFS SMB
Log level: 2
Address distribution policy: even-coverage

452 IBM Storage Scale 5.1.9: Problem Determination Guide


Node Daemon node name IP address CES IP address list
-----------------------------------------------------------------------
4 prt001st001 172.31.132.1 10.18.24.25 10.18.24.32 10.18.24.34 10.18.24.36
9.11.102.89
5 prt002st001 172.31.132.2 9.11.102.90 10.18.24.19 10.18.24.21 10.18.24.23
10.18.24.30
6 prt003st001 172.31.132.3 10.18.24.38 10.18.24.39 10.18.24.41 10.18.24.42
9.11.102.43
7 prt004st001 172.31.132.4 9.11.102.37 10.18.24.26 10.18.24.28 10.18.24.18
10.18.24.44
8 prt005st001 172.31.132.5 9.11.102.36 10.18.24.17 10.18.24.33 10.18.24.35
10.18.24.37
9 prt006st001 172.31.132.6 9.11.102.41 10.18.24.24 10.18.24.20 10.18.24.22
10.18.24.40
10 prt007st001 172.31.132.7 9.11.102.42 10.18.24.31 10.18.24.27 10.18.24.29
10.18.24.43

This shows at a glance whether nodes are failed or whether they host public IP addresses. For
successful SMB operation at least one CES node must be HEALTHY and hosting at least one IP address.
• To show which services are enabled, issue the following command:

mmces service list

The system displays output similar to this:

Enabled services: NFS SMB


NFS is running, SMB is running

For successful SMB operation, SMB needs to be enabled and running.


• To determine the overall health state of SMB on all CES nodes, issue the following command:

mmces state show SMB -a

The system displays output similar to this:

NODE SMB
prt001st001 HEALTHY
prt002st001 HEALTHY
prt003st001 HEALTHY
prt004st001 HEALTHY
prt005st001 HEALTHY
prt006st001 HEALTHY
prt007st001 HEALTHY

• To show the reason for a currently active (failed) state on all nodes, issue the following command:

mmces events active SMB -a

The system displays output similar to this:

NODE COMPONENT EVENT NAME SEVERITY DETAILS

In this case nothing is listed because all nodes are healthy and so there are no active events. If a node
was unhealthy it would look similar to this:

NODE COMPONENT EVENT NAME SEVERITY DETAILS


prt001st001 SMB ctdb_down ERROR CTDB process not running
prt001st001 SMB smbd_down ERROR SMBD process not running

• To show the history of events generated by the monitoring framework, issue the following command:

mmces events list SMB

The system displays output similar to this:

NODE TIMESTAMP EVENT NAME SEVERITY DETAILS


prt001st001 2015-05-27 14:15:48.540577+07:07MST smbd_up INFO SMBD process now
running
prt001st001 2015-05-27 14:16:03.572012+07:07MST smbport_up INFO SMB port 445 is now
active
prt001st001 2015-05-27 14:28:19.306654+07:07MST ctdb_recovery WARNING CTDB Recovery
detected

Chapter 28. Protocol issues 453


prt001st001 2015-05-27 14:28:34.329090+07:07MST ctdb_recovered INFO CTDB Recovery
finished
prt001st001 2015-05-27 14:33:06.002599+07:07MST ctdb_recovery WARNING CTDB Recovery
detected
prt001st001 2015-05-27 14:33:19.619583+07:07MST ctdb_recovered INFO CTDB Recovery
finished
prt001st001 2015-05-27 14:43:50.331985+07:07MST ctdb_recovery WARNING CTDB Recovery
detected
prt001st001 2015-05-27 14:44:20.285768+07:07MST ctdb_recovered INFO CTDB Recovery
finished
prt001st001 2015-05-27 15:06:07.302641+07:07MST ctdb_recovery WARNING CTDB Recovery
detected
prt001st001 2015-05-27 15:06:21.609064+07:07MST ctdb_recovered INFO CTDB Recovery
finished
prt001st001 2015-05-27 22:19:31.773404+07:07MST ctdb_recovery WARNING CTDB Recovery
detected
prt001st001 2015-05-27 22:19:46.839876+07:07MST ctdb_recovered INFO CTDB Recovery
finished
prt001st001 2015-05-27 22:22:47.346001+07:07MST ctdb_recovery WARNING CTDB Recovery
detected
prt001st001 2015-05-27 22:23:02.050512+07:07MST ctdb_recovered INFO CTDB Recovery
finished

• To retrieve monitoring state from health monitoring component, issue the following command:

mmces state show

The system displays output similar to this:

NODE AUTH NETWORK NFS OBJECT SMB CES


prt001st001 DISABLED HEALTHY HEALTHY DISABLED DISABLED HEALTHY

• To check the monitor log, issue the following command:

grep smb /var/adm/ras/mmsysmonitor.log | head -n 10

The system displays output similar to this:


2016-04-27T03:37:12.2 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:37:27.2 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:37:42.3 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:37:57.2 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:38:12.4 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:38:27.2 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:38:42.5 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:38:57.2 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:39:12.2 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596
2016-04-27T03:39:27.6 prt2st1 I Monitor smb service LocalState:HEALTHY Events:0 Entities:0 -
Service.monitor:596

• The following logs can also be checked:

/var/adm/ras/*
/var/log/messages

File access failure from an SMB client with sharing conflict


The SMB protocol includes sharemodes, which are locks that can be held on the whole file while a SMB
client has the file open. The IBM Storage Scale file system also enforces these sharemode locks for other
access.
File systems that are exported with the CES SMB service need to be configured with the -D nfs4
flags to provide the support for SMB sharemodes. Otherwise the SMB server will report a sharing
violation on any file access, since the file system cannot grant the sharemode. This is also indicated
by a message in the /var/adm/ras/log.smbd log, similar to this. Samba requested GPFS
sharemode for for /ibm/fs1/smb_sharemode/ALLOW_N but the GPFS file system is

454 IBM Storage Scale 5.1.9: Problem Determination Guide


not configured accordingly. Configure file system with mmchfs -D nfs4 or set
gpfs:sharemodes=no in Samba.
The other case is that SMB access to a file is denied with a sharing violation, because there is currently
concurrent access to the same file that prevents the sharemode from being granted. One way to debug
this issue would be to recreate the access while capturing a SMB protocol trace. For more information,
see “CES tracing and debug data collection” on page 287. If the trace contains the message "GPFS
share mode denied for ...", then concurrent access is denied due to the file being open in the file
system with a conflicting access. For more information, see the Multiprotocol export considerations topic
in the IBM Storage Scale: Administration Guide.

SMB client on Linux fails with an "NT status logon failure"


This topic describes how to verify and resolve an "NT status logon failure" on the SMB client on Linux.
Description
The user is trying to log on to the SMB client using AD authentication on Linux and receives this message:
NT STATUS LOGON FAILURE

Following are the root causes of this error.


Description of Root cause #1
The user is trying to log on to the SMB client using AD authentication on Linux and receives this message:
Password Invalid

Cause
The system did not recognize the specified password.
Verification
Verify the password by running the following command on an IBM Storage Scale protocol node:

/usr/lpp/mmfs/bin/wbinfo -a '<domain>\<user>'

The expected result is that the following messages display:


plaintext password authentication succeeded.
challenge/response password authentication succeeded.

If this message displays:


plaintext password authentication failed.
Could not authenticate user USER with plain text password
the password for that user was not specified correctly.
Resolution
To resolve the error, enter the correct password.
If you do not know the correct password, follow your IT procedures to request a new password.
Description of root cause # 2
The user is trying to log on to the SMB client using AD authentication on Linux and receives this message:

The Userid is not recognized

Cause

The system did not recognize the specified password.


Verification
Verify the password by running the following command on an IBM Storage Scale protocol node:

/usr/lpp/mmfs/bin wbinfo -a '<domain>\<user>'

Chapter 28. Protocol issues 455


The expected result is that the following messages display:

plaintext password authentication succeeded.

challenge/response password authentication succeeded

If this message displays:

Could not authenticate user USER with challenge/response password

the specified user is not known by the system.


Resolution
To resolve the error, enter the correct userid.
If you think the correct user was specified, contact your IT System or AD Server administrator to get your
userid verified.

SMB client on Linux fails with the NT status password must change
error message
This topic describes how to verify and resolve an NT status password must change error on the
SMB client on Linux.
Description
The user is trying to access the SMB client on Linux and receives this error message:
NT_STATUS_PASSWORD_MUST_CHANGE

Cause
The specified password expired.
Verification
Verify the password by running the following command on an IBM Storage Scale protocol node:

/usr/lpp/mmfs/bin/wbinfo -a '<domain>\<user>'

The expected result is that the following messages display:


plaintext password authentication succeeded.
challenge/response password authentication succeeded.

If this message displays:


Could not authenticate user mzdom\aduser1 with challenge/response
the specified password probably expired.
Resolution
Log on to a Windows client, and when prompted, enter a new password. If the problem persists, ask the
AD administrator to unlock the account.

SMB mount issues


This topic describes how to verify and resolve SMB mount errors.
Possible SMB mount error conditions include:
• mount.cifs on Linux fails with mount error (13) "Permission denied"
• mount.cifs on Linux fails with mount error (127) "Key expired"
• Mount on Mac fails with an authentication error.

456 IBM Storage Scale 5.1.9: Problem Determination Guide


If you receive any of these errors, verify your authentication settings. For more information, see “Protocol
authentication issues” on page 436

Mount.Cifs on Linux fails with mount error (13) "Permission denied"


Description
The user is trying to mount CIFS on Linux and receives the following error message:
Permission Denied

The root causes for this error are the same as for “SMB client on Linux fails with an NT status logon
failure” on page 455.

Mount.Cifs on Linux fails with mount error (127) "Key has expired"
Description
The user is trying to access a CIFS share and receives the following error message:
key has expired

The root causes for this error are the same as for “SMB client on Linux fails with an NT status logon
failure” on page 455.

Mount on Mac fails with an authentication error


Description
The user is attempting a mount on a Mac and receives this error message:
mount_smbfs: server rejected the connection: Authentication error

The root causes for this error are the same as for “SMB client on Linux fails with an NT status logon
failure” on page 455.

Net use on Windows fails with "System error 86"


This topic describes how to verify and solve a "System error 86" when the user is attempting to access net
use on Windows.
Description
While accessing the network the following error message displays:
System error 86 has occurred.
The specified password is not correct.

Solution
The root causes for this error are the same as that for the failure of SMB client on Linux. For more
information on the root cause, see “SMB client on Linux fails with an NT status logon failure” on page 455.

Net use on Windows fails with "System error 59" for some users
This topic describes how to resolve a "System error 59" when some users attempt to access /usr/lpp/
mmfs/bin/net use on Windows.
Description:
Additional symptoms include
NT_STATUS_INVALID_PARAMETER
errors in the log.smbd file when net use command was invoked on the Windows client for the user with
this problem.
Solution:

Chapter 28. Protocol issues 457


Invalid idmapping entries in gencache might be the cause. To resolve the error, delete these entries
in gencache on all nodes. Run the following commands: /usr/lpp/mmfs/bin/net cache del
IDMAP/UID2SID/<UID> and net cache del IDMAP/SID2XID/<SID>. You can run the mmadquery
command to know the <UID> and the <SID>. Alternatively, you can find the <SID> from the log.smbd
file. See the following message
Could not convert sid <SID>: NT_STATUS_INVALID_PARAMETER
in the log.smbd file. Here, <SID> is the SID of the user.

Winbindd causes high CPU utilization


This topic describes the issues that can happen due to the winbindd component.
Cause
One possible reason is that winbind is not able to find domain controllers for a given domain.
NT_STATUS_NO_LOGON_SERVERS is seen in log file log.winbindd-dc-connect in that case. One
possible issue here is that the DNS does not provide this information. Usually the local DCs have to be
configured as DNS servers on the protocol nodes, as AD stores additional information for locating DCs in
the DNS.
Solution
The problem is also known to go away after upgrading to IBM Storage Scale 4.2.2 or higher versions.

SMB error events


This topic describes how to verify and resolve SMB errors.

CTDB process is not running (ctdb_down)


Cause
CTDB process is not running.
Determination
Check /var/log/messages for CTDB error messages or crashes.
Solution
Fix any obvious issues and run this command:

mmces service stop SMB


mmces service start SMB

CTDB recovery detected (ctdb_recovery)


Cause
CTDB status is stuck in Recovery mode for an extended amount of time.
Determination
If the service status is Degraded for a while, there is an issue. The service status should be Transient.
Check the logs for a possible issue.
Solution
Run:

mmces service stop SMB && mmces service start SMB

If still not fixed, run:

gpfs.snap

458 IBM Storage Scale 5.1.9: Problem Determination Guide


and contact IBM support.

CTDB state is not healthy (ctdb_state_down)


Determination
1. Check /var/log/messages for errors and correct any that you find.
2. Check CTDB status by running the ctdb status command.
3. Check the network connectivity.
Solution
After the error is resolved, the CTDB node should recover. If you have not resolved the error, restart SMB
by running this command:

mmces service stop SMB && mmces service start SMB

SMBD process not running


Determination
1. Check /var/log/messages and /var/adm/ras/log.smbd for errors and correct if found.
2. Restart by running this command:

mmces service stop SMB && mmces services start SMB

SMB port (?) is not active (smbport_down_)


Cause
The SMB port (?) is not listening for connections.
Determination
Check the network connectivity.
Solution
Restart by running:

mmces service stop SMB && mmces services start SMB

SMB access issues


This topic describes how to analyze and resolve SMB access issues.

Analyzing SMB access issues


The most common issue with ACLs is getting an unexpected Access Denied message. Check the
following points:
1. ID Mapping: Check whether a user that gets an Access Denied message, has proper ID mappings:
• Username <-> user SID <-> uid
• Primary group name <-> group SID <-> gid
• And the same for all other groups the user must be in.
2. Export ACLs: Check whether the share allows access for the logged in user by using the MMC tool or
the mmsmb exportacl command.
3. File system object ACLs: Use the Windows Explorer ACL dialog or use the mmgetacl command to
make sure the correct ACLs are in place on all components in the path.

Chapter 28. Protocol issues 459


4. Make sure that the READ_ATTR is set correctly on the folders to be traversed.
5. Make sure that even when the READ_NAMED and WRITE_NAMED are not enforced by the file system,
the SMB server enforces them.
6. Export settings: Check the export settings by running mmsmb export list --all so that export
options like read only = no or available = no do not restrict access.
7. Make sure your clients try to negotiate a supported protocol level.
8. For smbclient: Make sure the option -m SMB2 is used and supported by your version of smbclient
(smbclient -L localhost -U<user>%<password> -m SMB2).
9. Windows XP, Windows Server 2003 and older Windows versions are not supported because they
support only SMB1.
10. For the Linux kernel client, make sure you check the version option to use smb2.
Note: For known issues in the Linux kernel client, see the documentation for your Linux distribution.
If the root cause cannot be narrowed down, then the following steps help to do a detailed analysis.
1. Provide exact information about what happened.
2. Provide screen captures of Windows ACL dialogs.
3. Provide the output of mmgetacl for all files and folders that are related to the ACL or permission
problem.
4. You can force reconnect by stopping the smbd process that serves that connection.
5. Describe how the user mounted the export.
6. List all users and groups that are in the test along with their memberships.
7. Collect export information by running: mmsmb export list --all.
8. Provide the version of Windows used for each client.
9. Provide a Samba level 10 trace for the test by running the mmprotocoltrace tool command.
10. Provide IBM Storage Scale traces for the test by running the mmtracectl --start and --stop
command.
11. Collect the network trace of the re-create by running the mmprotocoltrace command.

Resolving SMB access issues


Check the health status of all protocol nodes as incoming SMB connections can be handled by any
protocol node. The health status of all protocol nodes can be checked by using the following command:

mmhealth node show -N CesNodes

If GPFS, network, or file system are reported as DEGRADED, then investigate the issue and fix the problem.
In addition, you can also check the /var/adm/ras/log.smbd log file on all protocol nodes.
An entry of vfs_gpfs_connect: SMB share fs1, path /ibm/fs1 not in GPFS file
system. statfs magic: 0x58465342 in the log file indicates that the SMB share path does not
point to a GPFS file system or that the file system is not mounted. If the file system is not mounted, then
you must mount the file system again on the affected nodes.

Slow access to SMB caused by contended access to files or directories


This topic describes the reason behind the slow access to SMB server and the troubleshooting steps to
handle it.
If the access through the SMB server is slower than expected, then there might be an issue with the
highly contended access to the same file or directory through the SMB server. This happens because
of the internal record keeping process of the SMB server. The internal record keeping process requires
that the record for each open file or directory must be transferred to different protocol nodes for every
open and close operation, which at times, overloads the SMB server. This delay in access is experienced

460 IBM Storage Scale 5.1.9: Problem Determination Guide


in extreme cases, where many clients are opening and closing the same file or directory. However, note
that concurrent access to the same file or directory is handled correctly in the SMB server and it usually
causes no problems.
The following procedure can help tracking the files or directories of the contended records in the
database statistics using CTDB track. When a "hot" record is detected, it is recorded in the database
statistic and a message is printed to syslog.
When this message refers to the locking.tdb database, this can point to the problem of concurrent access
to the same file or directory. The same reference might be seen in the ctdb dbstatistics for locking.tdb

# ctdb dbstatistics locking.tdb


DB Statistics locking.tdb
db_ro_delegations 0
db_ro_revokes 0
locks
num_calls 15
num_current 0
num_pending 0
num_failed 0
db_ro_delegations 0
hop_count_buckets: 139 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0
lock_buckets: 0 9 6 0 0 0 0 0 0 0 0 0 0 0 0 0
locks_latency MIN/AVG/MAX 0.002632/0.016132/0.061332 sec out of 15
vacuum_latency MIN/AVG/MAX 0.000408/0.003822/0.082142 sec out of 817
Num Hot Keys: 10
Count:1 Key: 6a4128e3ced4681b017c0600000000000000000000000000
Count:0 Key:
Count:0 Key:
Count:0 Key:
Count:0 Key:
Count:0 Key:
Count:0 Key:
Count:0 Key:
Count:0 Key:
Count:0 Key:

When ctdb points to a hot record in locking.tdb, then use the "net tdb locking" command to determine the
file behind this record:

# /usr/lpp/mmfs/bin/net tdb locking 6a4128e3ced4681b017c0600000000000000000000000000


Share path: /ibm/fs1/smbexport
Name: testfile
Number of share modes: 2

If this happens on the root directory of an SMB export, then a workaround can be to exclude that from
cross-node locking:

mmsmb export change smbexport --option fileid:algorithm=fsname_norootdir

Note: fsname_norootdir is set as default.


If this happens on files, the recommendation would be to access that SMB export only through one CES IP
address, so that the overhead of transferring the record between the nodes is avoided.
If the SMB export contains only sub directories with home directories where the sub directory names
match the user name, the recommended configuration would be an SMB export uses the %U sub situation
to automatically map the user with the corresponding home directory:

mmsmb export add smbexport /ibm/fs1/%U

CTDB issues
CTDB is a database layer for managing SMB and Active Directory specific information and provides it
consistently across all CES nodes.
CTDB requires network connections to TCP port 4379 between all CES nodes. Internally, CTDB elects
a recovery master among all available CTDB nodes. The elected node then acquires a lock on a
recovery lock file in the CES shared root file system to ensure that no other CES node tries to do

Chapter 28. Protocol issues 461


the same in a network problem. The usage of the CTDB recovery lock is introduced with IBM Storage
Scale 5.0.5.
If there is a problem with SMB or Active Directory integration or a specific CTDB problem is reported in
the health check, the following steps must be taken:
1. Check the status of CTDB on all CES nodes:

/usr/lpp/mmfs/bin/mmdsh -N CesNodes -f1 /usr/lpp/mmfs/bin/ctdb status

If a status is reported as DISCONNECTED, ensure that all the CES nodes are up and running and
network connections to TCP port 4379 are allowed.
If a status is reported as BANNED check the logs files.
2. Check the CTDB log files on all nodes:
CTDB logs in to the standard syslog. The default syslog file name varies among the Linux
distributions, for example:

/var/log/messages

/var/log/syslog

or the journalctl command must be used to show the system messages.


This message sequence indicates that a node might not acquire the recovery lock:

ctdb-recoverd[28458]: Unable to take recovery lock - contention


ctdb-recoverd[28458]: Unable to take recovery lock
ctdb-recoverd[28458]: Abort recovery, ban this node for 300 seconds
ctdb-recoverd[28458]: Banning node 3 for 300 seconds

This usually indicates a communication problem between CTDB on different CES nodes. Check the
node local firewall settings, any network firewalls, and routing to ensure that connections to TCP port
4379 are possible between the CES nodes.

smbd start issue


In a configuration where the current LDAP server is decommissioned, but the mmuserauth was not
changed to remove the service, smbd does not start because it cannot contact the decommissioned
LDAP server. smbd only starts when it can reach the LDAP server. Authentication also cannot be removed
because it requires a healthy samba.
To solve this issue, complete the following steps:
1. Remove the LDAP configuration from the samba backend.

/usr/lpp/mmfs/bin/net conf delparm global 'passdb backend'

2. Start smbd.
3. After samba is started, remove the authentication configuration and create authentication.

# mmuserauth service remove

Object issues
The following information describes some of the Object-related issues that you might come across while
using IBM Storage Scale.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.

462 IBM Storage Scale 5.1.9: Problem Determination Guide


• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.

Getting started with troubleshooting object issues


Use the following checklists to troubleshoot object issues.

Checklist 1
This checklist must be referred to before using an object service.
1. Check the cluster state by running the mmgetstate -a command.
The cluster state must be Active.
2. Check the status of the CES IP by running the mmlscluster -ces command.
The system displays the all the CES nodes along with their assigned IP addresses.
3. Check the service states of the CES by running the mmces state show -a or mmhealth node
show ces -N cesnodes command.
The overall CES state and object service states must be Healthy.
4. Check the service listing of all the service states by running the mmces service list -verbose
command.
5. Check the authentication status by running the mmuserauth service check command.
6. Check the object auth listing by running the source $HOME/openrc ; openstack user list
command.
The command lists the users in the OpenStack Keystone service.

Checklist 2
This checklist must be referred to before using the keystone service.
1. Check if object authentication has been configured by running the mmuserauth service list
--data-access-method object command.
2. Check the state of object authentication by running the mmces state show AUTH_OBJ -a
command.
3. Check if the protocol node is serving the CES IP by running the mmlscluster --ces command.
4. Check if the object_database_node tag is present in one of the CES IP by running the mmces address
list command.
5. Check if httpd is running on all the CES nodes and postgres is running on the node that has CES IP with
the object_database_node tag by running the mmces service list -v -a command.
6. Check if authentication configuration is correct on all nodes by running the mmuserauth service
check --data-access-method object -N cesNodes command.
7. If the mmuserauth service check reports an error, run the mmuserauth service check --data-
access-method object --rectify -N <node> command where node is the number of the node
on which the error is reported.

Chapter 28. Protocol issues 463


Authenticating the object service
Refer to the following troubleshooting references and steps for resolving system errors when you are
authenticating the object service.

Description
When you authenticate or run any create, update, or delete operation, the system displays one of the
following errors:

{"error": {"message": "An unexpected error prevented the server from fulfilling your request.",
"code": 500, "title": "Internal Server Error"}}

ERROR: openstack An unexpected error prevented the server from fulfilling your request.
(HTTP 500)(Request-ID: req-11399fd1-a601-4615-8f70-6ba275ec3cd6)

Cause
The system displays this error under one or all three of the following conditions:
• The authentication service is not running.
• The system is unable to reach the authentication server.
• The user credentials for Keystone have been changed or have expired.

Proposed workaround
• Finish the steps in Checklist 1.
• Make sure that the IP addresses of the Keystone endpoints are correct and reachable. If you are using a
local Keystone, check if the postgresql-obj service is running.

Authenticating or using the object service


Refer to the following troubleshooting references and steps for resolving system errors when
authenticating or using the object service.

Description
When the user is authenticating the object service or running the create, update, retrieve, and delete
operations, the system displays the following error:

Error: {"error": {"message": "The request you have made requires authentication.",
"code": 401, "title": "Unauthorized"}}

Cause
The system displays this error under one or both of the following conditions:
• The password, user ID, or service ID entered is incorrect.
• The token being used has expired.

Proposed workaround
• Check your user ID and password. The user IDs in the system can be viewed in the OpenStack user list.
• Check to make sure a valid service ID is provided in the /etc/swift/proxy-server.conf file, in the
filter:authtoken section. Also, check if the password for the service ID is still valid. The service ID
can be viewed in the OpenStack service, project, and endpoint lists.

464 IBM Storage Scale 5.1.9: Problem Determination Guide


Accessing resources
This topic provides troubleshooting references and steps for resolving system errors when you are
accessing resources.

Description
When an unauthorized user is accessing an object resource, the system displays the following error:
Error: Error: HTTP/1.1 403 Forbidden
Content-Length: 73 Content-Type: text/html; charset=UTF-8 X-Trans-Id:
tx90ad4ac8da9242068d111-0056a88ff0 Date: Wed, 27 Jan 09:37:52 GMT
<html><h1>Forbidden</h1><p>Access was denied to this resource.</p>

Cause
The system displays this error under one or all of the following conditions:
• The user is not authorized by the system to access the resources for a certain operation.
• The endpoint, authentication URL, service ID, keystone version, or API version is incorrect.

Proposed workaround
• To gain authorization and access the resources, contact your system administrator.
• Check your service ID. The service ID can be viewed in the OpenStack service, project, and endpoint
lists.

Connecting to the object services


Refer to the following troubleshooting references and steps for resolving system errors when you are
connecting to the object services.

Description
When the user is unable to connect to the object services, the system displays the following error:

curl: (7) Failed connect


to specscale.ibm.com:8080; No route to host

Cause
The system displays this error because of one or both of the following conditions:
• The firewall is running.
• The firewall rules are configured incorrectly.

Proposed workaround
Set up the firewall rules correctly in your system.
For more information, see Installation prerequisites in IBM Storage Scale: Concepts, Planning, and
Installation Guide.

Chapter 28. Protocol issues 465


Creating a path
This topic provides troubleshooting references and steps for resolving system errors when you are
creating a path.

Description
While you perform a create, update, retrieve, or delete task, if you attempt to create a non-existent path
the system displays the following error:

Error: HTTP/1.1 404 Not Found


Content-Length: 70 Content-Type: text/html; charset=UTF-8 X-Trans-Id:
tx88ec3b783bc04b78b5608-0056a89b52 Date: Wed, 27 Jan 10:26:26
GMT <html><h1>Not Found</h1><p>The resource could not be found.</p></html>

Cause
The system displays this error because the path you are creating does not exist.

Proposed workaround
Recreate the object or the container before you perform the GET operation.

Constraints for creating objects and containers


Refer to the following constraints that need to be kept in mind while creating objects and containers.

Description
When the user is trying to create objects and containers for unified file and object access, the system
displays the 400 Bad request error.

Cause
The system displays this error under one or all five of the following conditions:
• The name of the container is longer than 255 characters.
• The name of the object is longer than 214 characters.
• The name of any container in the object hierarchy is longer than 214 characters.
• The path name of the object includes successive forward slashes (/).
• The name of the container and the object is a single period (.) or a double period (..).

Proposed workaround
Keep in mind the following constraints while creating objects and containers for unified file and object
access:
• Make the name of the container no more than 255 characters.
• Make the name of the object no more than 214 characters.
• Make the name of any container in the object hierarchy no more than 214 characters.
• Do not include multiple consecutive forward slashes (///) in the path name of the object.
• Do not make the name of the container or the object a single period (.) or a double period (..). However, a
single period or a double period can be part of the name of the container and the object.

466 IBM Storage Scale 5.1.9: Problem Determination Guide


The Bind password is used when the object authentication configuration has
expired
Refer to the following troubleshooting references and steps for resolving system errors when you use the
Bind password and the object authentication configuration has expired.

Description
When object is configured with the AD/LDAP authentication and the bind password is being used for LDAP
communication, the system displays the following error:
[root@SSClusterNode3 ~]# openstack user list
ERROR: openstack An unexpected error prevented the server from fulfilling
your request. (HTTP 500) (Request-ID: req-d2ca694a-31e3-46cc-98b2-93556571aa7d)
Authorization Failure. Authorization failed: An unexpected error prevented
the server from fulfilling your request. (HTTP 500) (Request-ID: req-d6ccba54-
baea-4a42-930e-e9576466de3c)

Cause
The system displays this error when the Bind password has been changed on the AD/LDAP server.

Proposed workaround
1. Get the new password from the AD or LDAP server.
2. Run the following command to update the password and restart keystone on any protocol nodes:

mmobj config change --ccrfile keystone.conf --section ldap --property password --value
'<password>'

The value for <password> needs to be the value for new password that is obtained in Step 1.
Note: This command restarts Keystone on any protocol nodes.

The password used for running the keystone command has expired or is
incorrect
Refer to the following troubleshooting references and steps for resolving system errors when you are
using an expired or incorrect password for running the keystone command.

Description
When you try to run the keystone command by using a password that has expired or is incorrect, the
system displays the following error: [root@specscale ~]# openstack user list
ERROR: openstack The request you have made requires authentication. (HTTP 401)
(Request-ID: req-9e8d91b6-0ad4-42a8-b0d4-797a08150cea)

Cause
The system displays the following error when you change the password but are still using the expired
password to access Keystone.

Proposed workaround
Use the correct password to access Keystone.

Chapter 28. Protocol issues 467


The LDAP server is not reachable
Refer to the following troubleshooting references and steps for resolving system errors when attempting
to reach an LDAP server.

Description
When object authentication is configured with AD/LDAP and the user is trying to run the keystone
commands, the system displays the following error:[root@specscale ~]# openstack user list
ERROR: openstack An unexpected error prevented the server from fulfilling your
request. (HTTP 500) (Request-ID: req-d3fe863e-da1f-4792-86cf-bd2f4b526023)

Cause
The system displays this error under one or all of the following conditions:
• Network issues make the LDAP server unreachable.
• The system firewall is running so the LDAP server is not reachable.
• The LDAP server is shut down.
Note:
When the LDAP server is not reachable, the keystone logs can be viewed in the /var/log/keystone
directory.
The following example is an LDAP error found in /var/log/keystone/keystone.log:
/var/log/keystone/keystone.log:2016-01-28 14:21:00.663 25720 TRACE
keystone.common.wsgi result = func(*args,**kwargs)2016-01-28 14:21:00.663 25720
TRACE keystone.common.wsgi SERVER_DOWN: {'desc': "Can't contact LDAP server"}.

Proposed workaround
• Check your network settings.
• Configure your firewall correctly.
• Repair the LDAP server.

The TLS certificate has expired


Refer to the following troubleshooting references and steps for resolving system errors when the
Transport Layer Security (TLS) certificate expires.

Description
You might want to configure object authentication with Active Directory (AD) or Lightweight Directory
Access Protocol (LDAP) by using the TLS certificate for configuration. When you configure object
authentication with AD or LDAP, the system displays the following error:

[E] Failed to execute command


ldapsearchldap_start_tls: Connect error (-11)additional info: TLS error -8174:security library
: bad database.mmuserauth service create: Command failed.
Examine previous error messages to determine cause.

Cause
The system displays this error because the TLS certificate has expired.

468 IBM Storage Scale 5.1.9: Problem Determination Guide


Proposed workaround
1. Update the TLS certificate on the AD/LDAP server.
2. Rerun the command.

The TLS CACERT certificate has expired


This topic provides troubleshooting references and steps for resolving system errors when the Transport
Layer Security (TLS) CACERT certificate has expired.

Description
You can configure the system with Active Directory (AD) or Lightweight Directory Access Protocol (LDAP)
and TLS. When you configure the system this way:
• The TLS CACERT expires after configuration.
• The user is trying to run the keystone command.
The system displays the following error:

[root@specscale ~]# openstack user list


ERROR: openstack An unexpected error prevented the server from fulfilling your request.
(HTTP 500) (Request-ID: req-dfd63d79-39e5-4c4a-951d-44b72e8fd9ef)
Logfile /var/log/keystone/keystone.log2045-01-14 10:50:40.809 30518
TRACE keystone.common.wsgi CONNECT_ERROR:
{'info': "TLS error -8162:The certificate issuer's certificate has expired.
Check your system date and time.", 'desc': 'Connect error'}

Note:
The log files for this error can be viewed in /var/log/keystone/keystone.log.

Cause
The system displays this error because the TLS CACERT certificate has expired.

Proposed workaround
1. Obtain the updated TLS CACERT certificate on the system.
2. Rerun the object authentication command.
Note:
If you run the following command while doing the workaround steps, you might lose existing data:

-idmapdelete

The TLS certificate on the LDAP server has expired


This topic provides troubleshooting references and steps for resolving system errors when the Transport
Layer Security (TLS) certificate on the Lightweight Directory Access Protocol (LDAP) server expires.

Description
You can configure the system with AD or LDAP by using TLS. If the certificate on AD or LDAP expires, the
system displays the following error when the user is trying to run the Keystone commands:

[root@specscale ~]# openstack user list


ERROR: openstack An unexpected error prevented the server from fulfilling your request.
(HTTP 500) (Request-ID: req-5b3422a1-fc43-4210-b092-1201e38b8cd5)2017-05-08 22:08:35.443 30518
TRACE keystone.common.wsgi CONNECT_ERROR: {'info': 'TLS error -8157:Certificate extension not
found.',

Chapter 28. Protocol issues 469


'desc': 'Connect error'}
2017-05-08 22:08:35.443 30518 TRACE keystone.common.wsgi

Cause
The system displays this error because the TLS certificate on the LDAP server has expired.

Proposed workaround
Update the TLS certificate on the LDAP server.

The SSL certificate has expired


This topic provides troubleshooting references and steps for resolving system errors when the Secure
Sockets Layer (SSL) certificate has expired.

Description
When object authentication is configured with SSL and you try to run the authentication commands, the
system displays the following error:

[root@specscale ~]# openstack user list


ERROR: openstack SSL exception connecting to https://SSCluster:35357/v3/auth/tokens:
[Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate
verify failed

Cause
The system displays this error because the SSL certificate has expired. The user may have used the same
certificate earlier for keystone configuration, but now the certificate has expired.

Proposed workaround
1. Remove the object authentication.
2. Reconfigure the authentication with the new SSL certificate.
Note:
Do not run the following command during removing and reconfiguring the authentication:

mmuserauth service remove --data-access-method object --idmapdelete

Users are not listed in the OpenStack user list


Refer to the following troubleshooting references and steps for resolving system errors when the user is
not listed in the OpenStack user list.

Description
When the authentication type is Active Directory (AD) or Lightweight Directory Access Protocol (LDAP),
users are not listed in the OpenStack user list.

Cause
The system displays this error under one or both of the following conditions:
• Only the users under the specified user distinguished name (DN) are visible to Keystone.
• The users do not have the specified object class.

470 IBM Storage Scale 5.1.9: Problem Determination Guide


Proposed workaround
Change the object authentication or modify the AD or LDAP anyone who does not have the specified
object class.

The error code signature does not match when using the S3 protocol
Refer to the following troubleshooting references and steps for resolving system errors when the error
code signature does not match.

Description
When there is an error code signature mismatch, the system displays the following error:
<?xml version="1.0" encoding="UTF-8"?><Error> <Code>SignatureDoesNotMatch</
Code> <Message>The request signature we calculated does not match the
signature you provided. Check your key and signing method.</Message>
<RequestId>tx48ae6acd398044b5b1ebd-005637c767</RequestId></Error>

Cause
The system displays this error when the specified user ID does not exist and the user ID does not have
the defined credentials or has not assigned a role to the account.

Proposed workaround
• For role assignments, review the output of these commands to identify the role assignment for the
affected user:
– OpenStack user list
– OpenStack role assignment list
– OpenStack role list
– OpenStack project list
• For credential issues, review the credentials assigned to that user ID:
– OpenStack credential list
– OpenStack credential show ID

The swift-object-info output does not display


Refer to the following troubleshooting references and steps for resolving errors when the Swift object
command (swift-object-info) command displays no output.

The swift-object-info command displays no output


On IBM Storage Scale clusters with SELinux enabled and enforced, the system displays no output for the
Swift object command (swift-object-info).

Cause
The Boolean daemons_use_tty default setting is preventing the Swift object command output to
display.

Proposed workaround
Allow daemons to use TTY (teletypewriter) by running the following command:

setsebool -P daemons_use_tty 1

Chapter 28. Protocol issues 471


Output of any further entries for the Swift object command displays correctly:

swift-object-info

Swift PUT returns the 202 error and S3 PUT returns the 500 error due to the
missing time synchronization
Refer to the following troubleshooting references and steps for resolving system errors when Swift PUT
returns the 202 error and S3 PUT returns the 500 error due to the missing time synchronization.

Description
The swift object servers require monotonically increasing timestamps on the PUT requests. If the time
between all the nodes is not synchronized, the PUT request can be rejected, resulting in the object server
returning a 409 status code that is turned into 202 in the proxy-server. When the s3api middleware
receives the 202 code, it returns a 500 to the client. When enabling DEBUG logging, the system displays
the following message:
From the object server:
Feb 9 14:41:09 prt001st001 object-server: 10.0.5.6
- - [09/Feb/2016:21:41:09 +0000] "PUT /z1device119/14886/
AUTH_bfd953e691c4481d8fa0249173870a56/mycontainers12/myobjects407"
From the proxy server:
Feb 9 14:14:10 prt003st001 proxy-server: Object PUT returning
202 for 409: 1455052450.83619 <= '409 (1455052458.12105)' (txn:
txf7611c330872416aabcc1-0056ba56a2) (client_ip:
If S3 is used, the following error is displayed from Swift3:
Feb 9 14:25:52 prt005st001 proxy-server: 500 Internal Server Error:
#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/
site-packages/swift3/middleware.py", line 81, in __call__#012 resp =
self.handle_request(req)#012 File "/usr/lib/python2.7/site-packages/swift3/
middleware.py", line 104, in handle_request#012 res = getattr(controller,
req.method)(req)#012 File "/usr/lib/python2.7/site-packages/swift3/controllers/
obj.py", line 97, in PUT#012 resp = req.get_response(self.app)#012
File "/usr/lib/python2.7/site-packages/swift3/request.py", line 825, in
get_response#012 headers, body, query)#012 File "/usr/lib/python2.7/site-
packages/swift3/request.py", line 805, in get_acl_response#012 app,
method, container, obj, headers, body, query)#012 File "/usr/lib/python2.7/
site-packages/swift3/request.py", line 669, in _get_response#012 raise
InternalError('unexpected status code %d' % status)#012InternalError: 500
Internal Server Error (txn: tx40d4ff7ca5b94b1bb6881-0056ba5960) (client_ip:
10.0.5.1) Feb 9 14:25:52 prt005st001 proxy-server: 500 Internal
Server Error: #012Traceback (most recent call last):#012 File "/usr/lib/
python2.7/site-packages/swift3/middleware.py", line 81, in __call__#012 resp
= self.handle_request(req)#012 File "/usr/lib/python2.7/site-packages/swift3/
middleware.py", line 104, in handle_request#012 res = getattr(controller,
req.method)(req)#012 File "/usr/lib/python2.7/site-packages/swift3/controllers/
obj.py", line 97, in PUT#012 resp = req.get_response(self.app)#012
File "/usr/lib/python2.7/site-packages/swift3/request.py", line 825, in
get_response#012 headers, body, query)#012 File "/usr/lib/python2.7/site-
packages/swift3/request.py", line 805, in get_acl_response#012 app,
method, container, obj, headers, body, query)#012 File "/usr/lib/python2.7/
site-packages/swift3/request.py", line 669, in _get_response#012 raise
InternalError('unexpected status code %d' % status)#012InternalError: 500

472 IBM Storage Scale 5.1.9: Problem Determination Guide


Internal Server Error (txn: tx40d4ff7ca5b94b1bb6881-0056ba5960) (client_ip:
10.0.5.1)

Cause
The system displays these errors when the time is not synchronized.

Proposed workaround
• To check whether this problem is occurring, run the following command:

mmdsh date

• Enable the NTPD service on all protocol nodes and synchronize the time from a Network Time Protocol
(NTP) server.

Unable to generate the accurate container listing by performing a GET


operation for unified file and object access container
Refer to the following information for troubleshooting references and steps for resolving system errors
when your system is unable to generate the accurate container listing. Use the GET operation for unified
file and object access containers.

Description
The accurate container listing for a unified file or an object access-enabled container is not displayed on
the system.

Cause
This error occurs under one or both of the following conditions:
• A longer time is taken to update and display the listing because the ibmobjectizer interval is too long.
• Objectization phis not supported for the files that you create on the file system.

Proposed workaround
Tune the ibmobjectizer interval configuration by running the following command to set up the
objectization interval:

mmobj config change --ccrfile spectrum-scale-objectizer.conf \


--section DEFAULT --property objectization_interval --value 2400

This command sets an interval of 40 minutes between the completion of an objectization cycle and the
start of the next cycle.

Fatal error in object configuration during deployment


Refer to the following troubleshooting references and steps for resolving unrecoverable system errors in
object configuration during deployment.

Description
When you enable object by using installation toolkit, the system displays the following error:

[ FATAL ] Required option 'endpoint_hostname' missing in section:


'object'. To set this, use: ./spectrumscale config object –endpoint

[ FATAL ] Invalid configuration for setting up Object Store.

Chapter 28. Protocol issues 473


Cause
The system displays this error when you authenticate object without the necessary parameters.

Proposed workaround
Run the spectrumscale config obj command with the mandatory arguments.

Object authentication configuration fatal error during deployment


Refer to the following troubleshooting references and steps for resolving unrecoverable system errors in
object authentication configuration during deployment.

Description
When you configure the authentication by using the installation toolkit, the system displays the following
error:
2016-02-16 13:48:07,799 [ FATAL ] <nodename> failure whilst: Configuring object
authentication (SS98)

Cause
The system displays this error under one or both of the following conditions:
• Only the users under the specified user DN are visible to Keystone.
• The users do not have the specified object class.

Proposed workaround
You can change the object authentication or modify the AD or LDAP for anyone who has the specified
object class.

Unrecoverable error in object authentication during deployment


Refer to the following troubleshooting references and steps for resolving unrecoverable errors in object
authentication during deployment.

Description
When the user configures authentication by using installation toolkit, the system displays the following
error:
02-16 13:48:07,799 [ FATAL ] <nodename> failure whilst: Configuring object
authentication (SS98)

Cause
The system displays this error under one or all three of the following conditions:
• IBM Storage Scale for the object storage program is running.
• Parameters that are provided in the configuration.txt and authconfig.txt files are incorrect.
• The system is unable to connect to the authentication server with the given credentials or network
issues.

Proposed workaround
1. Shut down IBM Storage Scale for the object storage program before continuing.
2. Check the connectivity of protocol nodes with the authentication server by using valid credentials.

474 IBM Storage Scale 5.1.9: Problem Determination Guide


3. Stop the service manually with the mmces service stop obj -a command.
4. Manually run the mmuserauth service create command to finish the authentication configuration
for object.
5. Fix the configuration.txt and authconfig.txt files and rerun the IBM Storage Scale
deployment with the spectrumscale deploy command.

Chapter 28. Protocol issues 475


476 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 29. Disaster recovery issues
As with any type of problem or failure, obtain the GPFS log files (mmfs.log.*) from all nodes in the
cluster and, if available, the content of the internal dumps.
For more information, see:
• The Data mirroring and replication topic in the IBM Storage Scale: Administration Guide for detailed
information about GPFS disaster recovery
• “Creating a master GPFS log file” on page 258
• “Information to be collected before contacting the IBM Support Center” on page 555
The following two messages might appear in the GPFS log for active/active disaster recovery scenarios
with GPFS replication. The purpose of these messages is to record quorum override decisions that are
made after the loss of most of the disks:
6027-435 [N]
The file system descriptor quorum has been overridden.
6027-490 [N]
The descriptor replica on disk diskName has been excluded.
A message similar to these appear in the log on the file system manager, node every time it reads the file
system descriptor with an overridden quorum:

...
6027-435 [N] The file system descriptor quorum has been overridden.
6027-490 [N] The descriptor replica on disk gpfs23nsd has been excluded.
6027-490 [N] The descriptor replica on disk gpfs24nsd has been excluded.
...

For more information on node override, see the section on Quorum, in the IBM Storage Scale: Concepts,
Planning, and Installation Guide
For PPRC and FlashCopy®-based configurations, more problem determination information can be
collected from the ESS log file. This information and the appropriate ESS documentation must be referred
while working with various types disk subsystem-related failures. For instance, if users are unable to
perform a PPRC failover (or failback) task successfully or unable to generate a FlashCopy of a disk volume,
they should consult the subsystem log and the appropriate ESS documentation. For more information, see
the following topics:
• IBM TotalStorage™ Enterprise Storage Server® Web Interface User's Guide (publibfp.boulder.ibm.com/
epubs/pdf/f2bui05.pdf).

Disaster recovery setup problems


The following setup problems might impact disaster recovery implementation:
1. Considerations of data integrity require proper setup of PPRC consistency groups in PPRC
environments. Additionally, when using the FlashCopy facility, make sure to suspend all I/O activity
before generating the FlashCopy image. See “Data integrity” on page 404.
2. In certain cases, it might not be possible to restore access to the file system even after relaxing
the node and disk quorums. For example, in a three failure group configuration, GPFS tolerates and
recovers from a complete loss of a single failure group (and the tiebreaker with a quorum override).
However, all disks in the remaining failure group must remain active and usable in order for the file
system to continue its operation. A subsequent loss of at least one of the disks in the remaining failure
group would render the file system unusable and trigger a forced unmount. In such situations, users
might still be able to perform a restricted mount and attempt to recover parts of their data from the
damaged file system. For more information on restricted mounts, see “Restricted mode mount” on
page 318.

© Copyright IBM Corp. 2015, 2024 477


3. When you issue mmfsctl syncFSconfig, you might get an error similar to the following:

mmfsctl: None of the nodes in the peer cluster can be reached

In such scenarios, check the network connectivity between the peer GPFS clusters and verify their
remote shell setup. This command requires full TCP/IP connectivity between the two sites, and all
nodes must be able to communicate by using ssh or rsh without the use of a password.

Other problems with disaster recovery


You might encounter the following issues that are related to disaster recovery in IBM Storage Scale:
1. Currently, users are advised to always specify the all option when you issue the mmfsctl
syncFSconfig command, rather than the device name of one specific file system. Issuing this
command enables GPFS to detect and correctly resolve the configuration discrepancies that might
occur as a result of the manual administrative action in the target GPFS cluster to which the
configuration is imported.
2. The optional SpecFile parameter to the mmfsctl syncFSconfigthat is specified with the -S flag
must be a fully qualified path name that defines the location of the spec data file on nodes in the target
cluster. It is not the local path name to the file on the node from which the mmfsctl command is
issued. A copy of this file must be available at the provided path name on all peer contact nodes that
are defined in the RemoteNodesFile.

478 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 30. Performance issues
The performance issues might occur because of the system components or configuration or maintenance
issues.

Issues caused by the low-level system components


This section discusses the issues caused by the low-level system components used in the IBM Storage
Scale cluster.

Suboptimal performance due to high utilization of the system level


components
In some cases, the CPU or memory utilization on an IBM Storage Scale node is higher than 90%. Such
heavy utilization can adversely impact the system performance as it affects the cycles allocated to the
IBM Storage Scale daemon service.

Problem identification
On the node, issue an Operating System command such as top or dstat to verify whether the system
level resource utilization is higher than 90%. The following example shows the sample output for the
dstat command:
# dstat 1 10

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--


usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 100 0 0 0|7308k 9236k| 0 0 | 0 0 | 812 3691
0 0 100 0 0 0| 0 0 |3977B 1038B| 0 0 | 183 317
1 2 98 0 0 0| 0 0 |2541B 446B| 0 0 | 809 586
0 1 99 0 0 0| 0 0 |4252B 346B| 0 0 | 427 405
0 0 100 0 0 0| 0 0 |3880B 346B| 0 0 | 196 349
0 0 100 0 0 0| 0 0 |3594B 446B| 0 0 | 173 320
1 1 98 0 0 0| 0 0 |3969B 446B| 0 0 | 692 662
0 0 100 0 0 0| 0 116k|3120B 346B| 0 0 | 189 312
0 0 100 0 0 0| 0 0 |3050B 346B| 0 0 | 209 342
0 0 100 0 0 0| 0 4096B|4555B 346B| 0 0 | 256 376
0 0 100 0 0 0| 0 0 |3232B 346B| 0 0 | 187 340

Problem resolution and verification


If the system level resource utilization is high, determine the process or application that contributes to the
performance issue and take appropriate action to minimize the utilization to an acceptable level.

Suboptimal performance due to long IBM Storage Scale waiters


Low-level system issues, like slow disks, or slow network, might cause long GPFS waiters. These long
waiters cause performance degradation. You can use the mmdiag --waiters command to display the
mmfsd threads waiting for events. This information can help resolve deadlocks and improve the system
performance.

Problem identification
On the node, issue the mmdiag --waiters command to check whether any long waiters are present.
The following example shows long waiters that are contributed by the slow disk, dm-14:

© Copyright IBM Corp. 2015, 2024 479


#mmdiag --waiters

0x7FF074003530 waiting 25.103752000 seconds, WritebehindWorkerThread: for I/O completion on disk dm-14
0x7FF088002580 waiting 30.025134000 seconds, WritebehindWorkerThread: for I/O completion on disk dm-14

Problem resolution and verification


Resolve any system-level or software issues that exist. When you verify that no system or software issues
are present, issue the #mmdiag --waiters command again to verify whether any long waiters exist.
One possible reason for long waiters, among many, can be that Samba lock directory has been configured
to be located in GPFS.

Suboptimal performance due to networking issues caused by faulty system


components
The system might face networking issues, like significant network packet drops or packet errors, due to
faulty system components like NIC, drivers, cables and network switch ports. This can impact the stability
and the quality of the GPFS communication between the nodes, degrading the system performance.

Problem identification and verification


If IBM Storage Scale is configured over TCP/IP network interfaces like 10GigE or 40GigE, can use the
netstat –in and ifconfig <GPFS_iface> commands to confirm whether any significant TX/RX
packet errors or drops are happening.
In the following example, the 152326889 TX packets are dropped for the networking interface
corresponding to the ib0 device:
# netstat -in

Kernel Interface table


Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
ib0 65520 157606763073 0 0 0 165453186948 0 152326889 0 BMRU

#ifconfig ib0

ib0 Link encap:InfiniBand HWaddr


80:00:00:49:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.1.100 Bcast:192.168.1.255
Mask:255.255.255.0
inet6 addr: fe80::f652:1403:10:bb72/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:157606763073 errors:0 dropped:0 overruns:0 frame:0
TX packets:165453186948 errors:0 dropped:152326889 overruns:0
carrier:0

Problem resolution and verification


Resolve low-level networking issues like bad NIC cable, or improper driver setting. If possible, shut down
GPFS on the node with networking issues until the low-level networking problem is resolved. This is done
so that GPFS operations on other nodes are not impacted. Issue the # netstat -in command to verify
that the networking issues are resolved. Issue the mmstartup command to start GPFS on the node again.
Monitor the network interface to ensure that it is operating optimally.
In the following example, no packet errors or drops corresponding to the ib0 network interface exist.
# netstat -in

Kernel Interface table


Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
ib0 65520 0 313534358 0 0 0 301875166 0 0 0 BMRU

480 IBM Storage Scale 5.1.9: Problem Determination Guide


#ifconfig ib0

ib0 Link encap:InfiniBand HWaddr


80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.168.3.17 Bcast:10.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::211:7500:78:a42a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:313534450 errors:0 dropped:0 overruns:0 frame:0
TX packets:301875212 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:241364128830 (224.7 GiB) TX bytes:197540627923 (183.9 GiB)

Issues caused by the suboptimal setup or configuration of the IBM


Storage Scale cluster
This section discusses the issues caused due to the suboptimal setup or configuration of the IBM Storage
Scale cluster.

Suboptimal performance due to unbalanced architecture and improper


system level settings
The system performance depends on the IBM Storage Scale cluster architecture components like
servers, network, storage, disks, topology, and balance-factor. The performance is also dependent on
the performance of the low-level components like network, node, and storage subsystems that make up
the IBM Storage Scale cluster.

Problem identification
Verify whether all the layers of the IBM Storage Scale cluster are sized properly to meet the necessary
performance requirements. The things to be considered in the IBM Storage Scale cluster include:
• The servers
• The network connectivity and the number of connections between the NSD client and servers
• The I/O connectivity and number of connections between the servers to the storage controller or
subsystem
• The storage controller
• The disk type and the number of disks in the storage subsystem
In addition, get the optimal values for the low-level system components used in the IBM Storage Scale
stack from the vendor, and verify whether these components are set to their optimal value. The low-level
components must be tuned according to the vendor specifications for better performance.

Problem resolution and verification


It is recommended that the customer involves an IBM Storage Scale architect during the setup to ensure
that the underlying layers of IBM Storage Scale cluster are capable of delivering the necessary I/O
performance for the expected I/O workload.

Suboptimal performance due to low values assigned to IBM Storage Scale


configuration parameters
Most GPFS configuration parameters have default values. For example, the pagepool attribute defaults
to either one-third of the physical memory on the node or 1 GiB (whichever is smaller), maxMBpS
defaults to 2048 and maxFilesToCache defaults to 4000. However, if the IBM Storage Scale configuration
parameters are explicitly set to values lower than their default values by the user, it can impact the I/O
performance.

Chapter 30. Performance issues 481


Problem identification
On the GPFS node, issue the mmdiag --config command to display and verify the values of the GPFS
configuration parameters. Check whether these values match the optimal values set for IBM Storage
Scale system configuration. For more information on optimal values for configuration parameter see
Tuning Parameters.

Problem resolution and verification


Issue the mmchconfig Attribute=value -i command to set the configuration parameters to their
optimal values based on the best practices followed for an IBM Storage Scale system configuration.
You might need to restart GPFS for certain configuration parameter values to take effect. Issue the
mmshutdown command, followed by the mmstartup command to restart GPFS. Issue the mmdiag --
config command to verify the configuration changes and updates.

Suboptimal performance due to new nodes with default parameter values


added to the cluster
When new nodes are added to the IBM Storage Scale cluster, ensure that the GPFS configuration
parameter values on the new nodes are not set to default values, unless explicitly set so by the user
based on the GPFS node class. Instead, the GPFS configuration parameter values on the new nodes must
be similar to the values of the existing nodes of similar type for optimal performance. The necessary
system level component settings, like BIOS, network and others on the new nodes, also need to match
the system level component settings of the existing nodes.

Problem identification
The mmlsconfig command can be used to display and verify the configuration values for an IBM Storage
Scale cluster.
Issue the mmdiag --config command on the newly added GPFS nodes to verify whether the
configuration parameter values for the new nodes are same as values for the existing nodes. If the
newly added nodes have special roles or higher capability, then the configuration values must be adjusted
accordingly.
Certain applications like SAS benefit from a larger GPFS page pool. The GPFS page pool is used to cache
user file data and file system metadata. The default size of the GPFS page pool is 1 GiB in GPFS version
3.5 and higher. For SAS application, a minimum of 4 GiB page pool size is recommended. When new SAS
application nodes are added to the IBM Storage Scale cluster, ensure that the pagepool attribute is set
to at least 4 GiB. If left to its default value, the pagepool attribute is set to 1 GiB. This negatively impacts
the application performance.

Problem resolution and verification


Issue the mmchconfig Attribute=value –N <new_nodes> -i command to set the configuration
parameters to either their optimal values based on the best practices, or values similar to the existing
nodes. It might be necessary to restart the GPFS daemon for the values to take effect. Issue the
mmshutdown command, followed by the mmstartup command to restart the GPFS daemon. Verify
the changes by running the mmlsconfig on a node that is part of the GPFS cluster, and the mmdiag
--config command on the new nodes.
The following sample output shows that the value for the pagepool attribute on the existing application
nodes c25m3n03-ib and c25m3n04-ib is set to 2G.
Note: Here Application node refers to NSD or SAN GPFS client nodes where applications are
executed. These nodes have GPFS RPM installed for good performance.

482 IBM Storage Scale 5.1.9: Problem Determination Guide


#mmlsconfig

[c25m3n03-ib,c25m3n04-ib]
pagepool 2G

If you add new application nodes c25m3n05-ib and c25m3n06-ib to the cluster, the pagepool
attribute and other GPFS parameter values for the new node must be set according to the corresponding
parameter values for the existing nodes c25m3n03-ib and c25m3n04-ib. Therefore, the pagepool
attribute on these new nodes must also be set to 2G by using the mmchconfig command.

mmchconfig pagepool=2G -N c25m3n05-ib,c25m3n06-ib –i

Note: The -i option specifies that the changes take effect immediately and are permanent. This option is
valid only for certain attributes. For more information on block allocation, see the mmchconfig command
in the IBM Storage Scale: Command and Programming Reference Guide.

Issue the mmlsconfig command to verify whether all the nodes have similar values. The following
sample output shows that all the nodes have pagepool attribute set to 2G:

[c25m3n03-ib,c25m3n04-ib,c25m3n05-ib,c25m3n06-ib]
pagepool 2G

Note: If the pagepool attribute is set to a custom value (2G for this example), then the pagepool
attribute value is listed when you issue the mmlsconfig command. If the pagepool attribute is set to a
default value (1G) then this will be listed when you issue the mmlsconfig pagepool command.
On the new node, issue the mmdiag --config command to verify that the new values are in effect. The
sample output displays that the pagepool attribute value has been effectively set to 2G for the nodes
c25m3n03-ib, c25m3n04-ib,c25m3n05-ib, c25m3n06-ib:

! pagepool 2147483648

Note: The exclamation (!) in the front of the parameter denotes that the value of this parameter was set
by the user, and is not the default value for the parameter.

Suboptimal performance due to low value assigned to QoSIO operation


classes
If Quality of Service for I/O (QoSIO) feature is enabled on the file system, verify whether any of the
storage pools are assigned low values for other and maintenance class. Assigning low values for other
and maintenance class can impact the performance when I/O is performed on that specific storage
pool.

Problem identification
On the GPFS node, issue the mmlsqos <fs> command and check the other and maintenance class
settings. In the sample output below, the maintenance class IOPS for datapool1 storage pool is set
to 200 IOPS, and the other class IOPS for datapool2 storage pool is set to 400 IOPS. This IOPS value
might be low for an environment with high performing storage subsystem.
# mmlsqos gpfs1b

QOS config:: enabled --


pool=*,other=inf,maintenance=inf:pool=datapool1,other=inf,maintenance=200Iops:pool=datapool2,oth
er=400Iops,maintenance=inf
QOS values::
pool=system,other=inf,maintenance=inf:pool=datapool1,other=inf,maintenance=200Iops:pool=datapool
2,other=400Iops,maintenance=inf
QOS status:: throttling active, monitoring
active

Chapter 30. Performance issues 483


Problem resolution and verification
On the GPFS node, issue the mmchqos command to change the QoS values for a storage pool in the file
system. Issue the mmlsqos command to verify whether the changes are reflected in the QoS settings.
For example, if the IOPS corresponding to datapool2 other class must be set to unlimited then issue
the following command.
mmchqos gpfs1b --enable pool=datapool2,other=unlimited
Issue the # mmlsqos gpfs1b command to verify whether the change is reflected.
# mmlsqos gpfs1b
QOS config:: enabled -- pool=*,other=inf,maintenance=inf:pool=datapool1,other=inf,maintenance=200Iops:pool=datapool2,other=inf,maintenance=inf
QOS values:: pool=system,other=inf,maintenance=inf:pool=datapool1,other=inf,maintenance=200Iops:pool=datapool2,other=inf,maintenance=inf
QOS status:: throttling active, monitoring active

Suboptimal performance due to improper mapping of the file system NSDs


to the NSD servers
The NSDs in a file system need to be optimally assigned to the NSD servers so that the client I/O is
equally distributed across all the NSD servers. For example, consider a file system with 10 NSDs and
2 NSD servers. The NSD-to-server mapping must be done in such a way that each server acts as the
primary server for 5 of the NSD in the file system. If the NSD-to-server mapping is unbalanced, it can
result in hot spots in one or more of the NSD servers. Presence of hot spots within a system can cause
performance degradation.

Problem identification
Issue the mmlsnsd command, and verify that the primary NSD server allocated to a file system is evenly
distributed.
Note: The primary server is the first server listed under the NSD server column for a particular file
system.
On the NSD client, issue the mmlsdisk <fs> -m command to ensure that the NSD client I/O is
distributed evenly across all the NSD servers.
In the following sample output, all the NSDs are assigned to the same primary server c80f1m5n03ib0.
# mmlsnsd

File system Disk name NSD servers


---------------------------------------------------------------------------
gpfs2 Perf2a_NSD01 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD02 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD03 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD04 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD05 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD06 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD07 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD08 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD09 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD10 c80f1m5n03ib0,c80f1m5n02ib0

In this case, all the NSD client I/O for the gpfs2 file system are processed by the single NSD server
c80f1m5n03ib0, instead of being equally distributed across both the NSD servers c80f1m5n02ib0 and
c80f1m5n03ib0. This can be verified by issuing the mmlsdisk <fs> -m command on the NSD client,
as shown in the following sample output:
# mmlsdisk gpfs2 -m

Disk name IO performed on node Device Availability


------------ ----------------------- ----------------- ------------
Perf2a_NSD01 c80f1m5n03ib0 - up
Perf2a_NSD02 c80f1m5n03ib0 - up
Perf2a_NSD03 c80f1m5n03ib0 - up
Perf2a_NSD04 c80f1m5n03ib0 - up

484 IBM Storage Scale 5.1.9: Problem Determination Guide


Perf2a_NSD05 c80f1m5n03ib0 - up
Perf2a_NSD06 c80f1m5n03ib0 - up
Perf2a_NSD07 c80f1m5n03ib0 - up
Perf2a_NSD08 c80f1m5n03ib0 - up
Perf2a_NSD09 c80f1m5n03ib0 - up
Perf2a_NSD10 c80f1m5n03ib0 - up

Problem resolution and verification


If the NSD-to-primary mapping is unbalanced, issue the mmchnsd command to balance the NSD
distribution across the NSD servers. Issue the mmlsnsd command or the mmlsdisk <fs> -m command
on the NSD client to ensure that the NSD distribution across the servers is balanced.
In the following sample output, there are 10 NSDs in the gpfs2 file system. The NSDs are evenly
distributed between the two servers, such that both servers, c80f1m5n03ib0 and c80f1m5n02ib0 act
as primary servers for 5NSDs each.
# mmlsnsd

File system Disk name NSD servers


---------------------------------------------------------------------------
gpfs2 Perf2a_NSD01 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD02 c80f1m5n02ib0,c80f1m5n03ib0
gpfs2 Perf2a_NSD03 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD04 c80f1m5n02ib0,c80f1m5n03ib0
gpfs2 Perf2a_NSD05 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD06 c80f1m5n02ib0,c80f1m5n03ib0
gpfs2 Perf2a_NSD07 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD08 c80f1m5n02ib0,c80f1m5n03ib0
gpfs2 Perf2a_NSD09 c80f1m5n03ib0,c80f1m5n02ib0
gpfs2 Perf2a_NSD10 c80f1m5n02ib0,c80f1m5n03ib0

The NSD client I/O is also evenly distributed across the two NSD servers, as seen in the following sample
output:
# mmlsdisk gpfs2 -m

Disk name IO performed on node Device Availability


------------ ----------------------- ----------------- ------------
Perf2a_NSD01 c80f1m5n03ib0 - up
Perf2a_NSD02 c80f1m5n02ib0 - up
Perf2a_NSD03 c80f1m5n03ib0 - up
Perf2a_NSD04 c80f1m5n02ib0 - up
Perf2a_NSD05 c80f1m5n03ib0 - up
Perf2a_NSD06 c80f1m5n02ib0 - up
Perf2a_NSD07 c80f1m5n03ib0 - up
Perf2a_NSD08 c80f1m5n02ib0 - up
Perf2a_NSD09 c80f1m5n03ib0 - up
Perf2a_NSD10 c80f1m5n02ib0 - up

Suboptimal performance due to incompatible file system block allocation


type
In some cases, proof-of-concept (POC) is done on a smaller setup that consists of clusters with eight or
fewer nodes and file system with eight or fewer disks. When the necessary performance requirements are
met, the production file system is deployed on a larger cluster and storage setup. It is possible that on
a larger cluster, the file performance per NSD is less compared to the smaller POC setup, even if all the
cluster and storage component are healthy and performing optimally. In such cases, it is likely that the file
system is configured with the default cluster block allocation type during the smaller POC setup and the
larger file system setup are configured with scatter block allocation type.

Problem identification
Issue the mmlsfs command to verify the block allocation type that is in effect on the smaller and larger
setup file system.
In the sample output below, the Block allocation type for the gpfs2 file system is set to scatter.

Chapter 30. Performance issues 485


# mmlsfs gpfs2 | grep 'Block allocation type'

-j scatter Block allocation type

Problem resolution and verification


layoutMap={scatter|cluster} specifies the block allocation map type. When allocating blocks for a
file, GPFS first uses a round robin algorithm to spread the data across all disks in the storage pool. After a
disk is selected, the location of the data block on the disk is determined by the block allocation map type.
For cluster block allocation map type, GPFS attempts to allocate blocks in clusters. Blocks that belong
to a particular file are kept adjacent to each other within each cluster. For scatter block allocation map
type, the location of the block is chosen randomly. For production setup, where performance consistency
throughout the lifetime of the file system is paramount, scatter block allocation type is recommended.
The IBM Storage Scale storage I/O performance sizing also needs to be performed by using the scatter
block allocation.
The cluster allocation method might provide better disk performance for some disk subsystems in
relatively small installations. However, the benefits of clustered block allocation diminish when the
number of nodes in the cluster or the number of disks in a file system increases, or when the file system’s
free space becomes fragmented. The cluster allocation is the default allocation method for GPFS clusters
with eight or fewer nodes and for file systems with eight or fewer disks.
The scatter allocation method provides more consistent file system performance by averaging out
performance variations. This is so because for many disk subsystems, the location of the data relative
to the disk edge has a substantial effect on the performance. This allocation method is appropriate in
most cases and is the default allocation type for GPFS clusters with more than eight nodes or file systems
with more than eight disks.
The block allocation map type cannot be change after the storage pool is created. For more information
on block allocation, see the mmcrfs command in the IBM Storage Scale: Command and Programming
Reference Guide.
Attention: Scatter block allocation is recommended for a production setup where performance
consistency is paramount throughout the lifetime of the file system. However, in an FPO
environments (Hadoop or Big Data), cluster block allocation is recommended.

Issues caused by the unhealthy state of the components used


This section discusses the issues caused due to the unhealthy state of the components used in the IBM
Storage Scale stack

Suboptimal performance due to failover of NSDs to secondary server - NSD


server failure
In a shared storage configuration, failure of an NSD server might result in the failover of its NSDs to
the secondary server, if the secondary server is active. This can reduce the total number of NSD servers
actively serving the file system, which in turn impacts the file system's performance.

Problem identification
In IBM Storage Scale, the system-defined node class “nsdnodes” contains all the NSD server nodes in
the IBM Storage Scale cluster. Issue the mmgetstate –N nsdnodes command to verify the state of the
GPFS daemon. The GPFS file system performance might degrade if one or more NSD servers are in the
down or arbitrating or unknown state.
The following example displays two nodes: one in active state and the other in down state

486 IBM Storage Scale 5.1.9: Problem Determination Guide


# mmgetstate -N nsdnodes

Node number Node name GPFS state


------------------------------------------
1 c25m3n07-ib active
2 c25m3n08-ib down

Problem resolution and verification


Resolve any system-level or software issues that exist. For example, confirm that NSD server have no
network connectivity problems, or that the GPFS portability modules are correctly built for the kernel
that is running. Also, perform necessary low-level tests to ensure that both the NSD server and the
communication to the node are healthy and stable.
Verify that no system or software issues exist, and start GPFS on the NSD server by using the mmstartup
–N <NSD_server_to_revive> command. Use the mmgetstate –N nsdnodes command to verify
that the GPFS daemon is in active state as shown:
# mmgetstate -N nsdnodes

Node number Node name GPFS state


-----------------------------------------
1 c25m3n07-ib active
2 c25m3n08-ib active

Suboptimal performance due to failover of NSDs to secondary server - Disk


connectivity failure
In a shared storage configuration, disk connectivity failure on an NSD server might result in failover of its
NSDs to the secondary server, if the secondary server is active. This can reduce the total number of NSD
servers actively serving the file system, which in turn impacts the overall performance of the file system.

Problem identification
The mmlsnsd command displays information about the currently defined disks in a cluster. In the
following sample output, the NSD client is configured to perform file system I/O on the primary NSD
server c25m3n07-ib for odd-numbered NSDs like DMD_NSD01, DMD_NSD03. In this case, c25m3n08-ib
acts as a secondary server.
The NSD client is configured to perform file system I/O on the NSD server c25m3n08-ib for even-
numbered NSDs like DMD_NSD02,DMD_NSD04. In this case, c25m3n08-ib is the primary server, while
c25m3n07-ib acts as the secondary server.
Issue the #mmlsnsd command to display the NSD server information for the disks in a file system. The
following sample output shows the various disks in the gpfs1b file system and the NSD servers that are
supposed to act as primary and secondary servers for these disks.
# mmlsnsd

File system Disk name NSD servers


---------------------------------------------------------------------------
gpfs1b DMD_NSD01 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD02 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD03 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD04 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD05 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD06 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD07 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD08 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD09 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD10 c25m3n08-ib,c25m3n07-ib

However, the mmlsdisk <fsdevice> -m command that is issued on the NSD client indicates that the
NSD client is currently performing all the file system I/O on a single NSD server, c25m3n07-ib.

Chapter 30. Performance issues 487


# mmlsdisk <fsdevice> -m

Disk name IO performed on node Device Availability


------------ ----------------------- ----------------- ------------
DMD_NSD01 c25m3n07-ib - up
DMD_NSD02 c25m3n07-ib - up
DMD_NSD03 c25m3n07-ib - up
DMD_NSD04 c25m3n07-ib - up
DMD_NSD05 c25m3n07-ib - up
DMD_NSD06 c25m3n07-ib - up
DMD_NSD07 c25m3n07-ib - up
DMD_NSD08 c25m3n07-ib - up
DMD_NSD09 c25m3n07-ib - up
DMD_NSD10 c25m3n07-ib - up

Problem resolution and verification


Resolve any system-level or disk-level software issues that exist. For example, storage connectivity issues
on the NSD server, or driver issues. Rediscover the NSD disk paths by using the mmnsddiscover –a –N
all command. On the NSD client, first issue the mmlsnsd command to obtain the primary NSD server
configured for the NSD pertaining to a file system. The echo "NSD-Name Primary-NSD-Server";
mmlsnsd | grep <fsdevice> | awk command parses the output that is generated by the mmlsnsd
command and displays the primary NSD server for each of the NSDs. Perform file I/O on the NSD client
and issue the mmlsdisk <fs> -m command to verify that the NSD client is performing file system I/O
by using all the configured NSD servers. On the NSD client, first issue the mmlsnsd command to obtain
the configured primary NSD server for the NSD pertaining to a file system. The # echo "NSD-Name
Primary-NSD-Server"; mmlsnsd | grep <fsdevice> | awk command parses the output that is
generated by the mmlsnsd command and displays the primary NSD server for each of the NSDs.
# echo "NSD-Name Primary-NSD-Server"; mmlsnsd | grep <gpfs1b> | awk -F ','
'{print $1}' | awk '{print $2 " " $3}'

NSD-Name Primary-NSD-Server
DMD_NSD01 c25m3n07-ib
DMD_NSD02 c25m3n08-ib
DMD_NSD03 c25m3n07-ib
DMD_NSD04 c25m3n08-ib
DMD_NSD05 c25m3n07-ib
DMD_NSD06 c25m3n08-ib
DMD_NSD07 c25m3n07-ib
DMD_NSD08 c25m3n08-ib
DMD_NSD09 c25m3n07-ib
DMD_NSD10 c25m3n08-ib

Suboptimal performance due to file system being fully utilized


As a file system nears full utilization, it becomes difficult to find free space for new blocks. This impacts
the performance of the write, append, and create operations.

Problem identification
On the GPFS node, issue the mmdf <fs> command to determine the available space.
# mmdf gpfs1b

disk disk size failure holds holds free KB free KB


name in KB group metadata data in full blocks in
fragments
--------------- ------------- -------- ---------- -------- --------------------
-------------------
Disks in storage pool: system (Maximum disk size allowed is 18 TB)
DMD_NSD01 1756094464 101 Yes Yes 1732298752 ( 99%)
18688 ( 0%)
DMD_NSD09 1756094464 101 Yes Yes 1732296704 ( 99%)
13440 ( 0%)
DMD_NSD03 1756094464 101 Yes Yes 1732304896 ( 99%)
17728 ( 0%)
DMD_NSD07 1756094464 101 Yes Yes 1732300800 ( 99%)
14272 ( 0%)

488 IBM Storage Scale 5.1.9: Problem Determination Guide


DMD_NSD05 1756094464 101 Yes Yes 1732298752 ( 99%)
13632 ( 0%)
DMD_NSD06 1756094464 102 Yes Yes 1732300800 ( 99%)
13632 ( 0%)
DMD_NSD04 1756094464 102 Yes Yes 1732300800 ( 99%)
15360 ( 0%)
DMD_NSD08 1756094464 102 Yes Yes 1732294656 ( 99%)
13504 ( 0%)
DMD_NSD02 1756094464 102 Yes Yes 1732302848 ( 99%)
18688 ( 0%)
DMD_NSD10 1756094464 102 Yes Yes 1732304896 ( 99%)
18560 ( 0%)
------------- --------------------
-------------------
(pool total) 17560944640 17323003904 ( 99%)
157504 ( 0%)

============= ====================
===================
(total) 17560944640 17323003904 ( 99%)
157504 ( 0%)

Inode Information
-----------------
Number of used inodes: 4048
Number of free inodes: 497712
Number of allocated inodes: 501760
Maximum number of inodes: 17149440

The UNIX command df also can be used to determine the use percentage (Use%) of a file system. The
following sample output displays a file system with 2% capacity used.
# df -h

Filesystem Size Used Avail Use% Mounted on


/dev/gpfs1b 17T 227G 17T 2% /mnt/gpfs1b

Problem resolution and verification


Use the mmadddisk command to add new disks or NSDs to increase the GPFS file system capacity. You
can also delete unnecessary files from the file system by using the rm command in UNIX environments to
free up space.
In the sample output below, the df –h and mmdf commands show the file system use percentage to be
around 2%. This indicates that the file system has sufficient capacity available.
# df -h

Filesystem Size Used Avail Use% Mounted on


/dev/gpfs1b 17T 211G 17T 2% /mnt/gpfs1b

# mmdf gpfs1b

disk disk size failure holds holds free KB


free KB
name in KB group metadata data in full blocks
in fragments
--------------- ------------- -------- ----------- ------ -------------------
-------------------
Disks in storage pool: system (Maximum disk size allowed is 18 TB)
DMD_NSD01 1756094464 101 Yes Yes 1734092800 ( 99%)
12992 ( 0%)
DMD_NSD09 1756094464 101 Yes Yes 1734094848 ( 99%)
14592 ( 0%)
DMD_NSD03 1756094464 101 Yes Yes 1734045696 ( 99%)
15360 ( 0%)
DMD_NSD07 1756094464 101 Yes Yes 1734043648 ( 99%)
10944 ( 0%)
DMD_NSD05 1756094464 101 Yes Yes 1734053888 ( 99%)
11584 ( 0%)
DMD_NSD06 1756094464 102 Yes Yes 1734103040 ( 99%)
11584 ( 0%)

Chapter 30. Performance issues 489


DMD_NSD04 1756094464 102 Yes Yes 1734096896 ( 99%)
10048 ( 0%)
DMD_NSD08 1756094464 102 Yes Yes 1734053888 ( 99%)
14592 ( 0%)
DMD_NSD02 1756094464 102 Yes Yes 1734092800 ( 99%)
13504 ( 0%)
DMD_NSD10 1756094464 102 Yes Yes 1734062080 ( 99%)
13632 ( 0%)
------------- --------------------
-------------------
(pool total) 17560944640 17340739584 ( 99%)
128832 ( 0%)

============= ====================
===================
(total) 17560944640 17340739584
( 99%) 128832 ( 0%)

Inode Information
-----------------
Number of used inodes: 4075
Number of free inodes: 497685
Number of allocated inodes: 501760
Maximum number of inodes: 17149440

CAUTION: Exercise extreme caution when you delete files. Ensure that the files are no longer
required for any purpose or are backed up before you delete them.

Suboptimal performance due to VERBS RDMA being inactive


IBM Storage Scale for Linux supports InfiniBand Remote Direct Memory Access (RDMA) by using the
Verbs API for data transfer between an NSD client and the NSD server. If InfiniBand VERBS RDMA is
enabled on the IBM Storage Scale cluster and there is drop in the file system performance, then verify
whether the NSD client nodes are using VERBS RDMA for communication to the NSD server nodes. If the
nodes are not using RDMA, then the communication switches to using the GPFS node’s TCP/IP interface,
which can cause performance degradation.
Note:
• For information about tuning RDMA parameters, see the topic RDMA tuning in the IBM Storage Scale:
Administration Guide.
• In IBM Storage Scale 5.0.4 and later, the GPFS daemon startup service waits for a specified
time period for the RDMA ports on a node to become active. You can adjust the length of
the timeout period and choose the action that the startup service takes if the timeout expires.
For more information, see the descriptions of the verbsPortsWaitTimeout attribute and the
verbsRdmaFailBackTCPIfNotAvailable attribute in the help topic mmchconfig command in the
IBM Storage Scale: Command and Programming Reference Guide.

Problem identification
Issue the mmlsconfig | grep verbsRdma command to verify whether VERBS RDMA is enabled on the
IBM Storage Scale cluster.
# mmlsconfig | grep verbsRdma

verbsRdma enable

If VERBS RDMA is enabled, check whether the status of VERBS RDMA on a node is Started by running
the mmfsadm test verbs status command.
# mmfsadm test verbs status

VERBS RDMA status: started

The following sample output shows the various disks in the gpfs1b file system and the NSD servers that
are supposed to act as primary and secondary servers for these disks.

490 IBM Storage Scale 5.1.9: Problem Determination Guide


# mmlsnsd

File system Disk name NSD servers


---------------------------------------------------------------------------
gpfs1b DMD_NSD01 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD02 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD03 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD04 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD05 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD06 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD07 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD08 c25m3n08-ib,c25m3n07-ib
gpfs1b DMD_NSD09 c25m3n07-ib,c25m3n08-ib
gpfs1b DMD_NSD10 c25m3n08-ib,c25m3n07-ib

Issue the mmfsadm test verbs conn command to verify whether the NSD client node is
communicating with all the NSD servers that use VERBS RDMA. In the following sample output, the
NSD client node has VERBS RDMA communication active on only one of the two NSD servers.
# mmfsadm test verbs conn

RDMA Connections between nodes:


destination idx cook sta cli peak cli RD cli WR cli RD KB cli WR KB srv wait serv RD serv WR serv RD
KB serv WR KB vrecv vsend vrecv KB vsend KB
----------- --- --- --- --- --- ------ -------- --------- --------- --- --- -------- --------
----------- ----------- ------- ----- --------- --------
c25m3n07-ib 1 2 RTS 0 24 198 16395 12369 34360606 0 0 0 0
0 0 0 0 0 0

Problem resolution
Resolve any low-level InfiniBand RDMA issue like loose InfiniBand cables or InfiniBand fabric issues.
When the low-level RDMA issues are resolved, issue system commands like ibstat or ibv_devinfo to
verify whether the InfiniBand port state is active. The following system output displays the output
for an ibstat command issued. In the sample output, the port state for Port 1 is Active, while that for
Port 2 is Down.
# ibstat

CA 'mlx5_0'
CA type: MT4113
Number of ports: 2
Firmware version: 10.100.6440
Hardware version: 0
Node GUID: 0xe41d2d03001fa210
System image GUID: 0xe41d2d03001fa210
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 29
LMC: 0
SM lid: 1
Capability mask: 0x26516848vverify
Port GUID: 0xe41d2d03001fa210
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x26516848
Port GUID: 0xe41d2d03001fa218
Link layer: InfiniBand

Restart GPFS on the node and check whether the status of VERBS RDMA on a node is Started by running
the mmfsadm test verbs status command.
In the following sample output, the NSD client (c25m3n03-ib) and the two NSD servers all show VERBS
RDMA status as started.

Chapter 30. Performance issues 491


# mmdsh -N nsdnodes,c25m3n03-ib '/usr/lpp/mmfs/bin/mmfsadm test verbs status'

c25m3n03-ib: VERBS RDMA status: started


c25m3n07-ib: VERBS RDMA status: started
c25m3n08-ib: VERBS RDMA status: started

Perform a large I/O activity on the NSD client, and issue the mmfsadm test verbs conn command to
verify whether the NSD client node is communicating with all the NSD servers that use VERBS RDMA.
In the following sample output, the NSD client node has VERBS RDMA communication active on all the
active NSD servers.
# mmfsadm test verbs conn

RDMA Connections between nodes:


destination idx cook sta cli peak cli RD cli WR cli RD KB cli WR KB srv wait serv RD serv WR serv RD KB
serv WR KB vrecv vsend vrecv KB vsend KB
------------ --- --- --- --- --- ------- ------- --------- --------- --- --- ------- ------- -----------
----------- ----- ------ --------- ---------
c25m3n08-ib 0 3 RTS 0 13 8193 8205 17179930 17181212 0 0 0 0
0 0 0 0 0 0
c25m3n07-ib 1 2 RTS 0 14 8192 8206 17179869 17182162 0 0 0 0
0 0 0 0 0 0

Issues caused by the use of configurations or commands related to


maintenance and operation
This section describes the issues, which are caused due to the unhealthy state of the components that are
used in the IBM Storage Scale stack.

Suboptimal performance due to maintenance commands in progress


When in progress, long-running GPFS maintenance operations like mmrestripefs, mmapplypolicy,
mmadddisk, and mmdeldisk, consume some percentage of the system resources. Significant
consumption of the system resources can impact the I/O performance of the application.

Problem identification
Check the GPFS log file /var/adm/ras/mmfs.log.latest on the File System Manager node mmlsmgr
to verify whether any GPFS maintenance operations are in progress.
The following sample output shows that the mmrestripefs operation was initiated on Jan 19 at
14:32:41, and the operation was successfully completed at 14:45:42. The I/O performance of the
application is impacted during this time frame due to the execution of the mmrestripefs command.

Tue Jan 19 14:32:41.625 2016: [I] Command: mmrestripefs /dev/gpfs2 -r -N all


Tue Jan 19 14:45:42.975 2016: [I] Command: successful mmrestripefs /dev/gpfs2 -r -N all

Problem resolution and verification


The Quality of Service (QoS) feature for I/O operations is used to allocate appropriate maintenance IOPS
to reduce the impact of the maintenance operation on the application. In the following sample output, the
file system consists of a single storage pool – the default ‘system’ pool. The QoS feature is disabled and
inactive.

# mmlsqos gpfs1a
QOS config:: disabled
QOS status:: throttling inactive, monitoring inactive

You can use the mmchqos command to allocate appropriate maintenance IOPS to the IBM Storage Scale
system. For example, consider that the storage system has 100 K IOPS. If you want to allocate 1000
IOPS to the long running GPFS maintenance operations for the system storage pool, use the mmchqos
command to enable the QoS feature, and allocate the IOPS as shown:

492 IBM Storage Scale 5.1.9: Problem Determination Guide


# mmchqos gpfs1a --enable pool=system,maintenance=1000IOPS

Adjusted QOS Class specification: pool=system,other=inf,maintenance=1000Iops


QOS configuration has been installed and broadcast to all nodes.

Verify the QoS setting and values on a file system by using the mmlsqos command.
# mmlsqos gpfs1aQOS config:: enabled --

pool=system,other=inf,maintenance=1000IopsQOS status:: throttling active, monitoring active

Note: Allocating a small share of IOPS, for example 1000 IOPS, to the long running GPFS maintenance
operations can increase the maintenance command execution times. So, depending on the operation's
needs, the IOPS assigned to the ‘other’ and ‘maintenance’ class must be adjusted by using the mmchqos
command. This balances the application as well as the I/O requirements for the GPFS maintenance
operation.
For more information on setting the QoS for I/O operations, see the mmlsqos command section in the
IBM Storage Scale: Command and Programming Reference Guide and Setting the Quality of Service for I/O
operations (QoS) section in the IBM Storage Scale: Administration Guide.

Suboptimal performance due to frequent invocation or execution of


maintenance commands
When the GPFS maintenance operations like mmbackup, mmapplypolicy, mmdf, mmcrsnapshot,
mmdelsnapshot, and others are in progress, they can consume some percentage of system resources.
This can impact the I/O performance of applications. If these maintenance operations are scheduled
frequently, for example within every few seconds or minutes, the performance impact can be significant,
unless the I/O subsystem is sized adequately to handle both the application and the maintenance
operation I/O load.

Problem identification
Check the GPFS log file /var/adm/ras/mmfs.log.latest on the file system manager node mmlsmgr
to verify whether any GPFS maintenance operations are being invoked frequently by a cron job or other
cluster management software like Nagios.
In the sample output below, the mmdf command is being invoked periodically every 3-4 seconds.

Tue Jan 19 15:13:47.389 2016: [I] Command: mmdf /dev/gpfs2


Tue Jan 19 15:13:47.518 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:13:51.109 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:13:51.211 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:13:54.816 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:13:54.905 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:13:58.481 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:13:58.576 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:14:02.164 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:14:02.253 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:14:05.850 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:14:05.945 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:14:09.536 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:14:09.636 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:14:13.210 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:14:13.299 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:14:16.886 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:14:16.976 2016: [I] Command: successful mmdf /dev/gpfs2
Tue Jan 19 15:14:20.557 2016: [I] Command: mmdf /dev/gpfs2
Tue Jan 19 15:14:20.645 2016: [I] Command: successful mmdf /dev/gpfs2

Problem resolution and verification


Adjust the frequency of the GPFS maintenance operations so that it does not impact the applications
performance. The I/O subsystem must be designed in such a way that it is able to handle both the
application and the maintenance operation I/O load.

Chapter 30. Performance issues 493


You can also use the mmchqos command to allocate appropriate maintenance IOPS, which can reduce the
impact of the maintenance operations on the application.

Suboptimal performance when a tracing is active on a cluster


Tracing is usually enabled on the IBM Storage Scale cluster for troubleshooting purposes. However,
running a trace on a node might cause performance degradation.

Problem identification
Issue the mmlsconfig command and verify whether GPFS tracing is configured. The following sample
output displays a cluster in which tracing is configured:
# mmlsconfig | grep trace

trace all 4 tm 2 thread 1 mutex 1 vnode 2 ksvfs 3 klockl 2 io 3 pgalloc 1 mb 1 lock 2 fsck 3
tracedevOverwriteBufferSize 1073741824
tracedevWriteMode overwrite 268435456

Issue the # ps -aux | grep lxtrace | grep mmfs command to determine whether GPFS tracing
process is running on a node. The following sample output shows that GPFS tracing process is running on
the node:
# ps -aux | grep lxtrace | grep mmfs

root 19178 0.0 0.0 20536 128 ? Ss 14:06 0:00


/usr/lpp/mmfs/bin/lxtrace-3.10.0-229.el7.x86_64 on
/tmp/mmfs/lxtrace.trc.c80f1m5n08ib0 --overwrite-mode --buffer-size
268435456

Problem resolution and verification


When the traces have met their purpose and are no longer needed, use one of the following commands to
stop the tracing on all nodes:
• Use this command to stop tracing:
mmtracectl --stop –N all
• Use this command to clear all the trace setting variables and stop the tracing:
mmtracectl --off –N all

Suboptimal performance due to replication settings being set to 2 or 3


The file system write performance depends on the write performance of the storage volumes and its
RAID configuration. However, in case the backend storage write performance is on par with its read
performance, but the file system write performance is just 50% (half) or 33% (one-third) of the read
performance, check if the file system replication is enabled.

Problem identification
When file system replication is enabled and set to 2, effective write performance becomes 50% of the
raw write performance, since for every write operation, there are two internal write operation due to
replication. Similarly, when file system replication is enabled and set to 3, effective write performance
becomes approximately 33% of the raw write performance, since for every write operation, there are
three internal write operation.
Issue the mmlsfs command and verify the default number of metadata and data replicas enabled on the
file system. In the following sample output the metadata and data replication on the file system is set to
2:

494 IBM Storage Scale 5.1.9: Problem Determination Guide


# mmlsfs <fs> | grep replica | grep -i default

-m 2 Default number of metadata replicas


-r 2 Default number of data replicas

Issue the mmlsattr command to check whether replication is enabled at file level
# mmlsattr -L largefile.foo | grep replication

metadata replication: 2 max 2


data replication: 2 max 2

Problem resolution and verification


The GPFS placement policy can be enforced to set the replication factor of temporary files for non-critical
datasets to one. For example, temporary files like log files that can be recreated if necessary.
Follow these steps to set the replication value for log files to 1:
1. Create a placement_policy.txt file by using the following rule:

rule 'non-replicate-log-files' SET POOL 'SNCdata' REPLICATE (1) where lower(NAME) like
'%.log'
rule 'default' SET POOL 'SNCdata'

2. Install the placement policy on the file system by using the following command:

mmchpolicy <fs> placement_policy.txt

Note: You can test the placement policy before installing it by using the following command:

mmchpolicy <fs> placement_policy.txt -I test

3. Issue one of the following commands to remount the file system for the policy to take effect:
Remount the file system on all the nodes by using one of the following commands:
• mmumount <fs> -N all
• mmmount <fs> -N all
4. Issue the mmlspolicy <fs> -L command to verify whether the output is as shown:

rule 'non-replicate-log-files' SET POOL 'SNCdata' REPLICATE (1) where lower(NAME) like
'%.log'
rule 'default' SET POOL 'SNCdata'

Suboptimal performance due to updates made on a file system or fileset


with snapshot
If a file is modified after its snapshot creation, the system can face performance degradation due to the
copy-on-write property enforced on updated data files.

Problem identification
Updating a file that has a snapshot might create unnecessary load on a system because each application
update or write operation goes through the following steps:
1. Read the original data block pertaining to the file region that must be updated.
2. Write the data block read in the step 1 above to the corresponding snapshot location.
3. Perform the application write or update operation on the desired file region.

Chapter 30. Performance issues 495


Issue the mmlssnapshot to verify whether the snapshot was created before the file data update
operation.
In the following sample output, the gpfs2 file system contains a snapshot.
# mmlssnapshot gpfs2

Snapshots in file system gpfs2:


Directory SnapId Status Created
snap1 2 Valid Mon Jan 25 12:42:30 2016

Problem resolution and verification


Use the mmdelsnapshot command to delete the file system snapshot, if it is no longer necessary.
For more information on the mmdelsnapshot command, see the mmdelsnapshot command in the IBM
Storage Scale: Command and Programming Reference Guide.

Delays and deadlocks


The first item to check when a file system appears hung is the condition of the networks including the
network used to access the disks.
Look for increasing numbers of dropped packets on all nodes by issuing:
• The netstat -D command on an AIX node.
• The ifconfig interfacename command, where interfacename is the name of the interface being used
by GPFS for communication.
When using subnets ( see the Using remote access with public and private IP addresses topic in the IBM
Storage Scale: Administration Guide .), different interfaces may be in use for intra-cluster and intercluster
communication. The presence of a hang or dropped packed condition indicates a network support issue
that should be pursued first. Contact your local network administrator for problem determination for your
specific network configuration.
If file system processes appear to stop making progress, there may be a system resource problem or an
internal deadlock within GPFS.
Note: A deadlock can occur if user exit scripts that will be called by the mmaddcallback facility are
placed in a GPFS file system. The scripts should be placed in a local file system, so that the scripts are
accessible even when the networks fail.
To debug a deadlock, do the following:
1. Check how full your file system is by issuing the mmdf command. If the mmdf command does not
respond, contact the IBM Support Center. Otherwise, the system displays information similar to:

disk disk size failure holds holds free KB free


KB
name in KB group metadata data in full blocks in
fragments
--------------- ------------- -------- -------- ----- --------------------
-------------------
Disks in storage pool: system (Maximum disk size allowed is 1.1 TB)
dm2 140095488 1 yes yes 136434304 ( 97%) 278232
( 0%)
dm4 140095488 1 yes yes 136318016 ( 97%) 287442
( 0%)
dm5 140095488 4000 yes yes 133382400 ( 95%) 386018
( 0%)
dm0nsd 140095488 4005 yes yes 134701696 ( 96%) 456188
( 0%)
dm1nsd 140095488 4006 yes yes 133650560 ( 95%) 492698
( 0%)
dm15 140095488 4006 yes yes 140093376 (100%) 62
( 0%)
------------- --------------------
-------------------
(pool total) 840572928 814580352 ( 97%) 1900640
( 0%)

496 IBM Storage Scale 5.1.9: Problem Determination Guide


============= ====================
===================
(total) 840572928 814580352 ( 97%) 1900640
( 0%)

Inode Information
-----------------
Number of used inodes: 4244
Number of free inodes: 157036
Number of allocated inodes: 161280
Maximum number of inodes: 512000

GPFS operations that involve allocation of data and metadata blocks (that is, file creation and writes)
will slow down significantly if the number of free blocks drops below 5% of the total number. Free
up some space by deleting some files or snapshots (keeping in mind that deleting a file will not
necessarily result in any disk space being freed up when snapshots are present). Another possible
cause of a performance loss is the lack of free inodes. Issue the mmchfs command to increase the
number of inodes for the file system so there is at least a minimum of 5% free. If the file system is
approaching these limits, you may notice the following error messages:
6027-533 [W]
Inode space inodeSpace in file system fileSystem is approaching the limit for the maximum number
of inodes.
operating system error log entry
Jul 19 12:51:49 node1 mmfs: Error=MMFS_SYSTEM_WARNING, ID=0x4DC797C6, Tag=3690419:
File system warning. Volume fs1. Reason: File system fs1 is approaching the limit for the maximum
number of inodes/files.
2. If automated deadlock detection and deadlock data collection are enabled, look in the latest GPFS log
file to determine if the system detected the deadlock and collected the appropriate debug data. Look
in /var/adm/ras/mmfs.log.latest for messages similar to the following:
Thu Feb 13 14:58:09.524 2014: [A] Deadlock detected: 2014-02-13 14:52:59: waiting 309.888 seconds on node
p7fbn12: SyncHandlerThread 65327: on LkObjCondvar, reason 'waiting for RO lock'
Thu Feb 13 14:58:09.525 2014: [I] Forwarding debug data collection request to cluster manager p7fbn11 of
cluster cluster1.gpfs.net
Thu Feb 13 14:58:09.524 2014: [I] Calling User Exit Script gpfsDebugDataCollection: event
deadlockDebugData,
Async command /usr/lpp/mmfs/bin/mmcommon.
Thu Feb 13 14:58:10.625 2014: [N] sdrServ: Received deadlock notification from 192.168.117.21
Thu Feb 13 14:58:10.626 2014: [N] GPFS will attempt to collect debug data on this node.
mmtrace: move /tmp/mmfs/lxtrace.trc.p7fbn12.recycle.cpu0
/tmp/mmfs/trcfile.140213.14.58.10.deadlock.p7fbn12.recycle.cpu0
mmtrace: formatting /tmp/mmfs/trcfile.140213.14.58.10.deadlock.p7fbn12.recycle to
/tmp/mmfs/trcrpt.140213.14.58.10.deadlock.p7fbn12.gz

This example shows that deadlock debug data was automatically collected in /tmp/mmfs. If deadlock
debug data was not automatically collected, it would need to be manually collected.
To determine which nodes have the longest waiting threads, issue this command on each node:

/usr/lpp/mmfs/bin/mmdiag --waiters waitTimeInSeconds

For all nodes that have threads waiting longer than waitTimeInSeconds seconds, issue:

mmfsadm dump all

Notes:
a. Each node can potentially dump more than 200 MB of data.
b. Run the mmfsadm dump all command only on nodes that you are sure the threads are really
hung. An mmfsadm dump all command can follow pointers that are changing and cause the node
to crash.
3. If the deadlock situation cannot be corrected, follow the instructions in “Additional information to
collect for delays and deadlocks” on page 556, then contact the IBM Support Center.

Chapter 30. Performance issues 497


Failures using the mmpmon command
The mmpmon command manages performance monitoring and displays performance information.
The mmpmon command is thoroughly documented in “Monitoring I/O performance with the mmpmon
command” on page 59 and the mmpmon command page in the IBM Storage Scale: Command and
Programming Reference Guide. Before proceeding with mmpmon problem determination, review all of this
material to ensure that you are using the mmpmon command correctly.

Setup problems using mmpmon


The issues associated with the set up of mmpmon command and limitations of this command.
Remember these points when using the mmpmon command:
• You must have root authority.
• The GPFS daemon must be active.
• The input file must contain valid input requests, one per line. When an incorrect request is detected by
mmpmon, it issues an error message and terminates.
Input requests that appear in the input file before the first incorrect request are processed by mmpmon.
• Do not alter the input file while mmpmon is running.
• Output from mmpmon is sent to standard output (STDOUT) and errors are sent to standard (STDERR).
• Up to five instances of mmpmon may run on a given node concurrently. See “Monitoring I/O performance
with the mmpmon command” on page 59. For the limitations regarding concurrent usage of mmpmon,
see “Running mmpmon concurrently from multiple users on the same node” on page 62.
• The mmpmon command does not support:
– Monitoring read requests without monitoring writes, or the other way around.
– Choosing which file systems to monitor.
– Monitoring on a per-disk basis.
– Specifying different size or latency ranges for reads and writes.
– Specifying different latency values for a given size range.

Incorrect output from mmpmon


The analysis of incorrect output of mmpmon command.
If the output from mmpmon is incorrect, such as zero counters when you know that I/O activity is taking
place, consider these points:
1. Someone may have issued the reset or rhist reset requests.
2. Counters may have wrapped due to a large amount of I/O activity, or running mmpmon for an extended
period of time. For a discussion of counter sizes and counter wrapping, see Counter sizes and counter
wrapping section in “Monitoring I/O performance with the mmpmon command” on page 59.
3. See the Other information about mmpmon output section in “Monitoring I/O performance with the
mmpmon command” on page 59. This section gives specific instances where mmpmon output may be
different than what was expected.

Abnormal termination or hang in mmpmon


The course of action which must be followed if mmpmon command hangs or terminates.
If mmpmon hangs, perform these steps:
1. Ensure that sufficient time has elapsed to cover the mmpmon timeout value. It is controlled using the -t
flag on the mmpmon command.
2. Issue the ps command to find the PID for mmpmon.

498 IBM Storage Scale 5.1.9: Problem Determination Guide


3. Issue the kill command to terminate this PID.
4. Try the function again.
5. If the problem persists, issue this command:

mmfsadm dump eventsExporter

6. Copy the output of mmfsadm to a safe location.


7. Follow the procedures in “Information to be collected before contacting the IBM Support Center” on
page 555, and then contact the IBM Support Center.
If mmpmon terminates abnormally, perform these steps:
1. Determine if the GPFS daemon has failed, and if so, restart it.
2. Review your invocation of mmpmon, and verify the input.
3. Try the function again.
4. If the problem persists, follow the procedures in “Information to be collected before contacting the
IBM Support Center” on page 555, and then contact the IBM Support Center.

Tracing the mmpmon command


The course of action to be followed if the mmpmon command does not perform as expected.
When the mmpmon command does not work properly, there are two trace classes used to determine the
cause of the problem. Use these only when requested by the IBM Support Center.
eventsExporter
Reports attempts to connect and whether or not they were successful.
mmpmon
Shows the command string that came in to the mmpmon command, and whether it was successful or
not.
Note: Do not use the perfmon trace class of the GPFS trace to diagnose mmpmon problems. This trace
event does not provide the necessary data.

Chapter 30. Performance issues 499


500 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 31. GUI and monitoring issues
The topics listed in this section provide the list of most frequent and important issues reported with the
IBM Storage Scale GUI.

Understanding GUI support matrix and limitations


It is important to understand the supported versions and limitations to analyze whether you are facing a
real issue in the system.
The IBM Storage Scale FAQ in IBM Documentation contains the GUI support matrix. The
IBM Storage Scale FAQ is available at https://www.ibm.com/docs/en/spectrum-scale?topic=STXKQY/
gpfsclustersfaq.html.
To know more about GUI limitations, see GUI limitations in IBM Storage Scale: Administration Guide.

Known GUI issues


There are a set of known issues in the GUI. Most of the problems get fixed when you upgrade the GUI to
the version that contains the fix. The following topics cover some of the examples for the most frequent
GUI issues and their resolutions.
Related concepts
“Monitoring system health by using IBM Storage Scale GUI” on page 1
The IBM Storage Scale system provides background monitoring capabilities to check the health of a
cluster and each node of the cluster, including all the services that are hosted on a node. You can view
the system health states or corresponding events for the selected health state on the individual pages,
widgets or panels of the IBM Storage Scale GUI. You can also view system health details by issuing the
mmhealth command options like mmhealth cluster show, mmhealth node show, or other similar
options.
“Collecting diagnostic data through GUI” on page 294
IBM Support might ask you to collect logs, trace files, and dump files from the system to help them
resolve a problem. You can perform this task from the management GUI or by using the gpfs.snap
command. Use the Support > Diagnostic Data page in the IBM Storage Scale GUI to collect details of the
issues reported in the system.

GUI fails to start


This issue is primarily because of the database issue. In ideal scenarios, the service script automatically
initializes and starts postgreSQL. However, in rare cases, the database might be either inconsistent or
corrupted.
If the postgreSQL database is corrupted, it might be because of the following reasons:
• The additional (non-distro) postgreSQL package is installed and it occupies the port 5432.
• Details that are stored in the /etc/hosts file are corrupted so the "localhost" is not listed as the first
item for the IP127.0.0.1.
• An incompatible schema exists in the database from a previous release.
• The GUI version is downgraded to an older version.
If the GUI logs show any of the database errors, try the following steps:
1. Issue systemctl stop gpfsgui to stop GUI services.
2. Issue 'su postgres -c 'psql -d postgres -c "DROP SCHEMA FSCC CASCADE"''.
3. If the previous step does not help, issue 'rm -rf /var/lib/pgsql/data'.
4. Issue systemctl start gpfsgui to start GUI.

© Copyright IBM Corp. 2015, 2024 501


If the problem still persists, it might be because of a corrupted GUI installation, missing GUI dependency,
or some other unknown issue. In this scenario, you can remove and reinstall the GUI rpm. For more
information on how to install and uninstall GUI rpms, see Manually installing IBM Storage Scale
management GUI in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
You can collect the logs that are available in the /var/log/cnlog/mgtsrv folder to investigate further.
You can also use the gpfs.snap command as shown in the following example to collect logs and dumps
in case of a GUI issue:

gpfs.snap -N GUI_MGMT_SERVERS

Collecting logs and dumps through the gpfs.snap command also collects the GPFS logs. So, manually
getting the logs from the folder /var/log/cnlog/mgtsrv is quicker and provides only the required data
that is required to search for the details of the GUI issue.

GUI fails to restart after upgrade of all other GPFS packages on


SLES 15 SP3
This issue might occur as a result of version conflicts in PostgreSQL packages.
Try the following procedure to resolve version conflicts and successfully restart the GUI.
1. Delete all PostgreSQL packages from the node.
2. Install the PostgreSQL modules by specifying the specific version. For example,

# zypper install postgresql-13-8.30.noarch

# zypper install postgresql-contrib-13-8.30.noarch

3. Issue the following command to check that the installed packages do not have any conflict in versions.
For example,

# rpm -qa | grep "postgresql

If there is a version conflict in the installed packages, the output displays the details as shown in the
following example:

postgresql13-contrib-13.6-5.25.1.s390x
postgresql13-server-13.6-5.25.1.s390x
postgresql-server-14-10.6.2.noarch
postgresql13-13.6-5.25.1.s390x
postgresql-contrib-14-10.6.2.noarch
postgresql-14-10.6.2.noarch

In the following example the output indicates there is no conflict mismatch:

postgresql13-contrib-13.6-5.25.1.s390x
postgresql13-server-13.6-5.25.1.s390x
postgresql-server-13-8.30.noarch
postgresql13-13.6-5.25.1.s390x
postgresql-contrib-13-8.30.noarch
postgresql-13-8.30.noarch

4. If the GUI service failed to validate the checkpoint record then issue the following command to reset
the transaction logs.

# pg_resetwal -f /usr/local/var/postgres/

Note: If the pg_resetwal: error: cannot be executed by "root" error occurs then issue
the command in sudo mode. For example,

su postgres

502 IBM Storage Scale 5.1.9: Problem Determination Guide


5. Issue the following commands to clean up the data, initialize the database and start the PostgreSQL
service.

# rm -rf /var/lib/pgsql/data
# su postgres -c 'initdb -D /var/lib/pgsql/data'
#systemctl start postgresql

6. Issue the following command to check if the service status is in running state.

# systemctl status postgresql

7. Issue the following command to restart the GUI service.

# systemctl restart gpfsgui

GUI fails to start after manual installation or upgrade on Ubuntu


nodes
The IBM Storage Scale management GUI startup might fail after manual installation or upgrade on Ubuntu
18.04 nodes.
The following error message is displayed.

Job for gpfsgui.service failed because the control process exited with error code

When you upgrade from Ubuntu 16.x to Ubuntu 18.x, the PostgreSQL database server might be running
with cluster versions (9.x and 10.x) with the default set to version 9.x. In this scenario, after manual
installation or upgrade of the IBM Storage Scale management GUI, the GUI restarts once before it starts
to run successfully. Systemd reports a startup error of the gpfsgui.service unit. The IBM Storage
Scale GUI clears the database and creates it on the PostgreSQL 10.x instance. This error message can be
ignored.

GUI login page does not open


The management GUI is accessible through the following URL after the installation: https://<ip or
host name>.
If the GUI login page does not open, try out the following:
1. Issue the following command to verify the status:

systemctl status gpfsgui

2. Check the status of java components by issuing the following command:

netstat -lnp | grep java

There can be more lines in the output as given in the following example. The GUI does a self-check on
443 and is automatically redirected to 47443:

tcp6 0 0 :::47443 :::* LISTEN 22363/java


tcp6 0 0 :::127.0.0.1:4444 :::* LISTEN 22363/java
tcp6 0 0 47080 :::* LISTEN 22363/java

Note:
• The IBM Storage Scale GUI WebSphere® Java process no longer runs as root but as a user named
scalemgmt. The GUI process now runs on port 47443 and 47080 and uses iptables rules to forward
port 443 to 47443 and 80 to 47080.
• The port 4444 is used by the GUI CLI to interact with the GUI back-end service. Other ports that are
listed here are used by Java internally.

Chapter 31. GUI and monitoring issues 503


If you find that the port 47443 is not opened by WebSphere Liberty, restart the GUI service by issuing
the systemctl restart gpfsgui command. The GUI uses the default HTTPS port 443. If some
other application or process listens to this port, it causes a port conflict and the GUI does not work.

GUI performance monitoring issues


The sensor gets the performance data for the collector. The collector application that is called pmcollector
runs on every GUI node to display the performance details in the GUI. A sensor application is running on
every node of the system.
If GUI is not displaying the performance data, the following might be the reasons:
1. Collectors are not enabled
2. Sensors are not enabled
3. NTP failure

Collectors are not enabled


Do the following to verify whether collectors are working properly:
1. Issue systemctl status pmcollector on the GUI node to confirm that the collector is running.
2. If collector service is not started already, start the collector on the GUI nodes by issuing
the systemctl restart pmcollector command. Depending on the system requirement, the
pmcollector service can be configured to be run on the nodes other than GUI nodes. You need to verify
the status of pmcollector service on all nodes where collector is configured.
3. If you cannot start the service, verify its log file that is located at /var/log/zimon/
ZIMonCollector.log to see whether it logs any other details of the issues related to the collector
service status.
4. Use a sample CLI query to test if data collection works properly. For example:

mmperfmon query cpu_user

Note: After migrating from release 4.2.0.x or later to 4.2.1 or later, you might see the pmcollector service
critical error on GUI nodes. In this case, restart the pmcollector service by running the systemctl
restart pmcollector command on all GUI nodes.

Sensors are not enabled or not correctly configured


The following table lists sensors that are used to get the performance data for each resource type:

Table 62. Sensors available for each resource type


Resource type Sensor name Candidate nodes
Network Network All
CPU
System Resources Load All
Memory
NSD Server GPFSNSDDisk NSD Server nodes
GPFSFilesystem
IBM Storage Scale Client GPFSVFS IBM Storage Scale Client nodes
GPFSFilesystemAPI
NFSIO Protocol nodes running NFS
NFS
service

504 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 62. Sensors available for each resource type (continued)
Resource type Sensor name Candidate nodes
SMBStats Protocol nodes running SMB
SMB
SMBGlobalStats service

CTDBStats Protocol nodes running SMB


CTDB
service
SwiftAccount
SwiftContainer Protocol nodes running Object
Object
SwiftObject service

SwiftProxy
MCStoreGPFSStats
Transparent Cloud Tiering MCStoreIcstoreStats Cloud gateway nodes
MCStoreLWEStats
DiskFree All nodes
Capacity GPFSFilesetQuota Only a single node
GPFSDiskCap Only a single node

The IBM Storage Scale GUI lists all sensors in the Services > Performance Monotoring > Sensors page.
You can use this view to enable sensors and set appropriate periods and restrictions for sensors. If the
configured values are different from recommended values, such sensors are highlighted with a warning
symbol.
You can query the data displayed in the performance charts through CLI as well. For more information
on how to query performance data displayed in GUI, see “Querying performance data shown in the GUI
through CLI” on page 167.

NTP failure
The performance monitoring fails if the clock is not properly synchronized in the cluster. Issue the ntpq
-c peers command to verify the NTP state.

GUI Dashboard fails to respond


The system log displays the error as shown.

mgtsrv-system-log shows this error:


2020-12-26T21:12:21 >ZiMONUtility.runQuery:150< FINE: WARNING: PERF_SLOWQUERY -
the following query took at least 10000ms. to execute (executed at 1609045931):
get -j metrics max(df_free) group_by mountPoint now bucket_size 86400
2020-12-26T21:12:49 >MMEvent.handle:32< FINE: Received mmantras event:
fullMessage:2020.12.26 21.12.49 cliaudit nodeIP:172.16.0.11 GUI-CLI root SYSTEM [EXIT, CHANGE]
'mmaudit all list -Y' RC=1 pid=48922
eventTime:2020.12.26 21.12.49
eventName:cliaudit
nodeIP:172.16.0.11
Event Handler:com.ibm.fscc.gpfs.events.CliAuditEvent
2020-12-26T21:14:24 >ServerCompDAO.getStorageServerForComponent:494< FINE:
No StorageServer found for Component: GenericComponent [clusterId=10525186418337620125,
compType=SERVER, componentId=10, partNumber=5148-21L, serialNumber=78816BA,
name=5148-21L-78816BA, gpfsNodeId=3, displayId=null]

This can occur owing to a large number of keys being unnecessarily collected. You can check the total
number of keys by issuing the following command:

Chapter 31. GUI and monitoring issues 505


mmperfmon query --list keys | wc -l

To resolve, you need to delete the obsolete or expired keys by issuing the following command.

mmperfmon delete --expiredKeys

If there are a large number of keys that are queued for deletion, the command may fail to respond. As an
alternative, issue the following command.

mmsysmonc -w 3600 perfkeys delete --expiredkeys

The command waits up to one hour for the processing. If you use Docker or something similar which
creates short lived network devices or mount points, those entities can be ignored by using a filter as
shown:

mmperfmon config update Network.filter="netdev_name=veth.*|docker.*|flannel.*|cali.*|cbr.*"


mmperfmon config update DiskFree.filter="mountPoint=/var/lib/docker.*|/foo.*"

Related concepts
“Performance monitoring using IBM Storage Scale GUI” on page 158
The IBM Storage Scale GUI provides a graphical representation of the status and historical trends of the
key performance indicators. The manner in which information is displayed on the GUI, helps Users to
make quick and effective decisions, easily.
“Performance issues” on page 479
The performance issues might occur because of the system components or configuration or maintenance
issues.

GUI is showing “Server was unable to process the request” error


The GUI might not respond on user actions or it might show “Server was unable to process the request”
error. This might be because of an issue in the JavaScript layer, which runs on the browser. JavaScript
errors are not collected in the diagnostic data. The IBM Support might need the JavaScript error details to
troubleshoot this issue.
The location where the JavaScript console can be accessed depends on the web browser.
• For Google Chrome: Select menu item Tools > Javascript Console.
• For Mozilla Firefox: Select Web Console from the Web Developer submenu in the Firefox Menu.

GUI is displaying outdated information


The IBM Storage Scale GUI caches configuration data in an SQL database. Refresh tasks update the
cached information. Many refresh tasks are invoked by events when the configuration is changed in the
cluster. In those cases, the GUI pages reflect changes in a minute. For certain types of data, events are
not raised to invoke the refresh tasks. In such cases, the system must poll the data on a regular interval to
reflect up-to-date information in the GUI pages. All the refresh tasks run on a schedule. The system also
polls the data frequently even for those tasks that are triggered by events.
If the GUI shows stale data and the user does not want to wait until the next issue of refresh task, you can
run those refresh tasks manually as shown in the following example:

/usr/lpp/mmfs/gui/cli/runtask <task_name>

Note: Many file system-related tasks require the corresponding file system to be mounted on the GUI to
collect data.
For some stale GUI events, to complete the recovery procedure you must run the following commands on
every GUI node that is configured for the cluster:
1. First, run this command: systemctl restart gpfsgui

506 IBM Storage Scale 5.1.9: Problem Determination Guide


2. After the command completes, run this command: /usr/lpp/mmfs/gui/cli/lshealth --reset
The following table lists the details of the available GUI refresh tasks.
Table 63. GUI refresh tasks

Prerequisite -
File system
must be
Refresh task Frequency Collected information mounted Invoked by event CLI commands used

AFM_FILESET_STATE 60 The AFM fileset status Yes Any event for mmafmctl getstate -Y
component AFM

AFM_NODE_MAPPING 720 The AFM target map definitions No On execution of mmafmconfig show -Y
mmafmconfig

ALTER_HOST_NAME 12 h Host names and IP addresses in mmremote networkinfo


Monitor > Nodes page

CALLBACK 6h Checks and registers callbacks Yes mmlscallback and


that are used by GUI mmaddcallback

CALLHOME 1h Call home configuration data No On execution mmcallhome capability


of mmcallhome list -Y, mmcallhome
command info list -Y, mmcallhome
proxy list -Y, and
mmcallhome group list
-Y

CALLHOME_STATUS 1m Callhome status information No mmcallhome status list


-Y

CES_ADDRESS 1h CES IP addresses in Monitor > Yes mmces node list


Nodes page

CES_STATE 10 min CES state in Monitor > Nodes Yes


mmces state show -N
cesNodes

mmces events active -N


cesNodes (used for the
information field)

CES_SERVICE_STATE 1h CES service state in Monitor > Yes mmces service list -N
Nodes page cesNodes -Y

CES_USER_AUTH_ SERVICE 1h Not displayed Yes mmuserauth service list


-Y

CHECK_FIRMWARE 6h Monitor > Events page Checks whether the reported


firmware is up to date

CLUSTER_CONFIG 1h List of nodes and node classes in Yes mmsdrquery and


Monitoring > Nodes mmlsnodeclass

CONNECTION_ STATUS 10 min Connections status in Monitoring Nodes reachable through SSH
> Nodes page

DAEMON_ CONFIGURATION 1h Not displayed Yes mmlsconfig

DF 1h Not directly displayed; used to Yes Yes df, df -i, mmlspool


generate low space events

DIGEST_NOTIFICATION_ Once a day Sends daily event reports if No


at 04:15AM configured
TASK

DISK_USAGE 3:00 AM Disk usage information. Not Yes mmdf, mmsdrquery


directly displayed; used to (mmlsnsd and mmremote
generate low space events getdisksize for non-GNR-
NSDs that is not assigned to the
file system)

DISKS 1h NSD list in Monitoring > NSDs Yes mmsqrquery, mmlsnsd, and
mmlsdisk

FILESYSTEM_MOUNT 1h Mount state in Files > File Yes mmlsmount


Systems

FILESYSTEMS 1h List of file systems in Files > File Yes Yes mmsdrquery, mmlsfs, and
Systems mmlssnapdir

Chapter 31. GUI and monitoring issues 507


Table 63. GUI refresh tasks (continued)

Prerequisite -
File system
must be
Refresh task Frequency Collected information mounted Invoked by event CLI commands used

GUI_CONFIG_CHECK 12 h Checks that cluster configuration Yes mmsdrquery, mmgetstate,


is compatible with GUI and getent
requirements

HEALTH_STATES 10 min Health events in Monitoring > Yes mmhealth node show
Events {component}

-v -N {nodes} -Y

mmhealth node eventlog


-Y

HOST_STATES 1h GPFS state in Monitoring > Yes mmgetstate


Nodes

HOST_STATES_CLIENTS 3h Information about GPFS clients No On quorum-related mmgetstate -N


events 'clientLicense' -Y

LOG_REMOVER 6h Deletes aged database entries No

MASTER_GUI_ELECTION 1m Checks if all GUIs in the cluster No HTTP call to other GUIs
are running and elects a new
master GUI if needed.

MOUNT_CONFIG 12 h Mount configuration No On execution of any Internal commands


mm*fs command

NFS_EXPORTS 1h Exports in Protocols > NFS Yes mmcesservice list and


Exports mmcesnfslsexport

NFS_EXPORTS_ DEFAULTS 1h Not displayed Yes mmcesservice list and


mmcesnfslscfg

NFS_SERVICE 1h NFS settings in Settings > NFS Yes mmcesservice list and
Service mmcesnfslscfg

NODECLASS 6h Node classes in Monitor>Nodes Yes mmlsnodeclass

NODE_LICENSE 6h Node license information No On execution mmlslicense -Y


of mmchlicense
command

OBJECT_STORAGE_ POLICY 6h Storage policies of containers in Yes mmobj policy list


Object > Accounts

OS_DETECT 6h Information about operating Yes mmremote nodeinfo


system, cpu architecture,
hardware vendor, type, serial in
Monitoring > Nodes

PM_MONITOR 10 min Checks if the performance systemctl status


collector is up and running and pmcollector and zimon
also checks the CPU data for query
each node

PM_SENSORS 6h Performance monitoring sensor No On execution of mmperfmon config show


configuration mmperfmon

PM_TOPOLOGY 1h Performance data topology No perfmon query

POLICIES 1h Policies in Files > Information Yes Yes mmlspolicy


Lifecycle

QUOTA 2:15 AM Quotas in Files > Quota Yes Yes mmrepquota and
mmlsdefaultquota
Fileset capacity in Monitoring >
Capacity

QUOTA_MAIL Once a day Sends daily quota reports if No


at 05:00 AM configured

RDMA_INTERFACES 12 h Information about the RDMA No mmremote ibinfo


interfaces

REMOTE_CLUSTER 10 m General information about No REST API call to remote GUIs


remote clusters

REMOTE_CONFIG 1h Not displayed Yes mmauth, gets and parses


sdr file

508 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 63. GUI refresh tasks (continued)

Prerequisite -
File system
must be
Refresh task Frequency Collected information mounted Invoked by event CLI commands used

REMOTE_FILESETS 1h Information about filesets of No REST API call to remote GUIs


remote clusters

REMOTE_GPFS_CONFIG 3h The GPFS configuration of No REST API call to remote GUIs


remote clusters

REMOTE_HEALTH 15 m The health states of remote No REST API call to remote GUIs
clusters
_STATES

SMB_GLOBALS 1h SMB settings in Settings > SMB Yes mmcessmblsconfig


Service

SMB_SHARES 1h Shares in Protocols > SMB Yes mmcessmblsexport


Shares

SNAPSHOTS 1h Snapshots in Files > Snapshots Yes Yes mmlssnapshot

SNAPSHOT_MANAGER 1m Creates scheduled snapshots mmcrsnapshot and


mmdelsnapshot

SYSTEMUTIL_DF 1h Used to generate warnings if Checks local disk space of node


nodes run out of local disk space

STORAGE_POOL 1h Pool properties in Files > File Yes mmlspool <device> all
Systems -L -Y

TCT_ACCOUNT 1h Information about TCT accounts No On execution of mmcloudgateway account


mmcloudgateway list
command

TCT_CLOUD_SERVICE 1h Information about TCT cloud No On execution of mmcloudgateway service


services mmcloudgateway status
command

TCT_NODECLASS 1h Information about TCT node No On execution of mmlsnodeclass


classes mmchnode command

THRESHOLDS 3h Information about the configured No On execution mmhealth thres


thresholds of mmhealth
thresholds

Capacity information is not available in GUI pages


The IBM Storage Scale management GUI does not display the capacity information on various GUI pages
if GPFSDiskCap and GPFSFilesetQuota sensors are disabled and quota is disabled on the file system.
The following table provides the solution for the capacity data display issues in the corresponding GUI
pages.

Table 64. Troubleshooting details for capacity data display issues in GUI
GUI page Solution
Files > File SystemsStorage > Pools Verify whether the GPFSPool sensor is enabled on
at least one node and ensure that the file system is
mounted on this node. The health subsystem might
have enabled this sensor already. The default
period for the GPFSPool sensor is 300 seconds (5
minutes).
Files > Filesets does not display fileset capacity In this case, the quota is not enabled for the file
details. system that hosts this fileset. Go to Files > Quotas
page and enable quotas for the corresponding file
system. By default, the quotas are disabled for all
file systems.

Chapter 31. GUI and monitoring issues 509


Table 64. Troubleshooting details for capacity data display issues in GUI (continued)
GUI page Solution
Monitoring > Statistics Verify whether the GPFSDiskCap and
GPFSFilesetQuota sensors are enabled and quota is
enabled for the file systems. For more information
on how to enable performance monitoring sensors,
see “Configuring performance monitoring options
in GUI” on page 161.

GUI automatically logs off the users when using Google Chrome or
Mozilla Firefox
If the GUI is accessed through Google Chrome or Mozilla Firefox browsers and the tab is in the
background for more than 30 minutes, the users get logged out of the GUI.
This issue is reported if no timeout is specified on the Services > GUI > Preferences page. If a timeout
was specified, the GUI session expires when there is no user activity for that period of time, regardless of
the active browser tab.
Note: This issue is reported on Google Chrome version 57 or later, and Mozilla Firefox 58 or later.

510 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 32. AFM issues
The following table lists the common questions in AFM.

Table 65. Common questions in AFM with their resolution


Question Answer / Resolution
How do I flush requeued Sometimes, requests in the AFM messages queue on the gateway node
messages? get requeued because of errors at the home cluster. For example, if
space is not available at the home cluster to perform a new write,
a write message that is queued is not successful and gets requeued.
The administrator views the failed message being requeued on the
Primary gateway. Add more space to the home cluster and run mmafmctl
resumeRequeued so that the requeued messages are executed at home
again. If mmafmctl resumeRequeued is not run by an administrator,
AFM executes the message in the regular order of message executions
from the cache cluster to the home cluster.
Running the mmfsadm saferdump afm all command on the gateway
node displays the queued messages. The requeued messages are
displayed in the dumps. An example:

c12c4apv13.gpfs.net: Normal Queue: (listed by execution order)


(state: Active) c12c4apv13.gpfs.net: Write [612457.552962]
requeued file3 (43 @ 293) chunks 0 bytes 0 0

Why is a fileset in the Unmounted Filesets that are using a mapping target go to the Disconnected mode if
or Disconnected state when the NFS server of the Primary gateway is unreachable, even if NFS servers
parallel I/O is set up? of all participating gateways are reachable. The NFS server of the Primary
gateway must be checked to fix this problem.
How do I activate an inactive The mmafmctl prefetch command without options, where prefetch
fileset? statistics are procured, activates an inactive fileset.
How do I reactivate a fileset in the The mmafmctl prefetch command without options, where prefetch
Dropped state? statistics are procured, activates a fileset in a dropped state.
How to clean unmount the home To have a clean unmount of the home filesystem, the filesystem must first
filesystem if there are caches using be unmounted on the cache cluster where it is remotely mounted and the
GPFS protocol as backend? home filesystem must be unmounted. Unmounting the remote file system
from all nodes in the cluster might not be possible until the relevant cache
cluster is unlinked or the local file system is unmounted.
Force unmount, shutdown, or crash of the remote cluster results in panic
of the remote filesystem at the cache cluster and the queue is dropped.
The next access to the fileset runs the recovery. However, this should not
affect the cache cluster.

What should be done if the df On RHEL 7.0 or later, df does not support hidden NFS mounts. As AFM
command hangs on the cache uses regular NFS mounts on the gateway nodes, this change causes
cluster? commands like df to hang if the secondary gets disconnected.
The following workaround can be used that allows NFS mounts to
continue to be hidden:
Remove /etc/mtab symlink, and create a new file /etc/mtab and copy /
proc/mounts to /etc/mtab file during the startup. In this solution, the
mtab file might go out of synchronization with /proc/mounts.

© Copyright IBM Corp. 2015, 2024 511


Table 65. Common questions in AFM with their resolution (continued)
Question Answer / Resolution
What happens when the hard Like any filesystem that reaches the hard quota limit, requests fail with
quota is reached in an AFM cache? E_NO_SPACE.
When are inodes deleted from the After an inode is allocated, it is never deleted. The space remains
cache? allocated and they are re-used.
If inode quotas are set on the Attempts to create new files fail, but cache eviction is not triggered. Cache
cache, what happens when the eviction is triggered only when block quota is reached, not the inode
inode quotas are reached? quotas.
How can the cache use more One way is for file deletions. If a file is renamed at the home site, the
inodes than the home? file in cache is deleted and created again in cache. This results in the file
being assigned a different inode number at the cache site. Also, if a cache
fileset is LU mode or SW mode, then there can be changes made at the
cache that cause it to be bigger than the home.
Why does fileset go to Unmounted Sometimes, it is possible that the same home is used by multiple clusters,
state even if home is accessible on one set of filesets doing a quiesce turn the home unresponsive to the
the cache cluster? second cluster's filesets, which show home as unmounted
What could be impact of not Sparse file support is not present even if home is GPFS. Recovery and
running mmafmconfig command many AFM functions do not work. Crashes can happen for readdir or
despite having a GPFS home? lookup, if the backend is using NSD protocol and remote mount is not
available at the gateway node.
What should be done if there are This can happen when the application is producing requests at a faster
cluster wide waiters but everything pace. Check iohist to check disk rates.
looks normal, such as home is
accessible from gateway nodes,
applications are in progress on the
cache fileset?
Read seems to be stuck/inflight for Restart nfs at home to see if error resolves. Check the status of the
a long time. What should be done? fileset using mmafmctl getstate command to see if your fileset is in
unmounted state.
The mmfs.log show errors during These are temporary issues during read:Tue Feb 16
read such as error 233 : 03:32:40.300 2016: [E] AFM: Read file system fs1
fileset newSanity-160216-020201-KNFS-TC8-SW file IDs
[58195972.58251658.-1.-1,R] name file-3G remote error
233 These go away automatically and read should be successful.
Can the home have different sub- This is not a recommended configuration.
directories exported using unique
FSIDs, while parent directory is
also exported using an FSID?
I have a non-GPFS home, I mmafmconfig is not setup at home. Running mmafmconfig command at
have applications running in home and relinking cache should resolve this issue.
cache and some requests are
requeued with the following error:
SetXAttr file system fs1
fileset sw_gpfs file IDs
[-1.1067121.-1.-1,N] name
local error 124

512 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 65. Common questions in AFM with their resolution (continued)
Question Answer / Resolution
During failover process, some This error is harmless. The failover completes successfully.
gateway nodes might show error
233 in mmfs.log.
Resync fails with No buffer Increase afmHardMemThreshold.
space available error, but
mmdiag --memory shows that
memory is available.
How can I change the mode of a The mode of an AFM client cache fileset cannot be changed from local-
fileset? update mode to any other mode; however, it can be changed from read-
only to single-writer (and vice versa), and from either read-only or single-
writer to local-update. Complete the following steps to change the mode:
1. Ensure that fileset status is active and that the gateway is available.
2. Unmount the file system
3. Unlink the fileset.
4. Run the mmchfileset command to change the mode.
5. Mount the file system again.
6. Link the fileset again.

Why are setuid or setgid bits in The setuid or setgid bits in a single-writer cache are reset at home
a single-writer cache reset at home after data is appended to files on which those bits were previously set
after data is appended? and synced. This is because over NFS, a write operation to a setuid file
resets the setuid bit.
How can I traverse a directory that On a fileset whose metadata in all subdirectories is not cached, any
is not cached? application that optimizes by assuming that directories contain two
fewer subdirectories than their hard link count do not traverse the last
subdirectory. One such example is find; on Linux, a workaround for this
is to use find -noleaf to correctly traverse a directory that has not been
cached.
What extended attribute size is For an operating system in the gateway whose Linux kernel version is
supported? below 2.6.32, the NFS max rsize is 32K, so AFM does not support an
extended attribute size of more than 32K on that gateway.
What should I do when my file The .ptrash directory is present in cache and home. In some cases,
system or fileset is getting full? where there is a conflict that AFM cannot resolve automatically, the file is
moved to .ptrash at cache or home. In cache the .ptrash gets cleaned
up when eviction is triggered. At home, it is not cleared automatically.
When the administrator is looking to clear some space, the .ptrash must
be cleaned up first.
How to restore an unmounted AFM If the NSD mount on the gateway node is unresponsive, AFM does not
fileset that uses GPFS™ protocol as synchronize data with home. The filesystem might be unmounted at
backend? the gateway node. A message AFM: Remote filesystem remotefs
is panicked due to unresponsive messages on fileset
<fileset_name>,re-mount the filesystem after it becomes
responsive. mmcommon preunmount invoked. File system:
fs1 Reason: SGPanic is written to mmfs.log. After the home is
responsive, you must restore the NSD mount on the gateway node.

Chapter 32. AFM issues 513


514 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 33. AFM DR issues
This topic lists the answers to the common AFM DR questions.

Table 66. Common questions in AFM DR with their resolution


Issue Resolution
How do I flush requeued messages? Sometimes, requests in the AFM messages queue
on the gateway node get requeued due to errors
at the home cluster. For example, if space is
not available at the home cluster to perform a
new write, a write message that is queued is not
successful and gets requeued. The administrator
views the failed message being requeued on the
MDS. Add more space to the home cluster and
run mmafmctl resumeRequeued so that the
requeued messages are executed at home again.
If mmafmctl resumeRequeued is not run by
an administrator, AFM executes the message in
the regular order of message executions from
the cache cluster to the home cluster. Running
mmfsadm saferdump afm all on the gateway
node displays the queued messages. The requeued
messages are displayed in the dumps. An example:
c12c4apv13.gpfs.net: Normal Queue:
(listed by execution order) (state:
Active)c12c4apv13.gpfs.net: Write
[612457.552962] requeued file3 (43 @
293) chunks 0 bytes 0 0

Why is a fileset in the Unmounted or Disconnected Filesets that are using a mapping target go to the
state when parallel I/O is set up? Disconnected mode if the NFS server of the MDS is
unreachable, even if NFS servers of all participating
gateways are reachable. The NFS server of the
MDS must be checked to fix this problem.
How to clean unmount of the secondary filesystem To have a clean unmount of secondary filesystem,
fails if there are caches using GPFS protocol as the filesystem should first be unmounted on
backend? the primary cluster where it has been remotely
mounted and then the secondary filesystem should
be unmounted. It might not be possible to
unmount the remote file system from all nodes in
the cluster until the relevant primary is unlinked or
the local file system is unmounted.
Force unmount/shutdown/crash of remote cluster
results panic of the remote filesystem at primary
cluster and queue gets dropped, next access to
fileset runs recovery. However this should not
affect primary cluster.

© Copyright IBM Corp. 2015, 2024 515


Table 66. Common questions in AFM DR with their resolution (continued)
Issue Resolution
‘DF’ command hangs on the primary cluster On RHEL 7.0 or later, df does not support hidden
NFS mounts. As AFM uses regular NFS mounts on
the gateway nodes, this change causes commands
like df to hang if the secondary gets disconnected.
The following workaround can be used that allows
NFS mounts to continue to be hidden:
Remove /etc/mtab symlink, and create new
file /etc/mtab and copy /proc/mounts
to /etc/mtab file during startup. In this solution,
mtab file might go out of sync with /proc/mounts

What does NeedsResync state imply ? NeedsResync state does not necessarily mean
a problem. If this state is during a conversion
or recovery, the problem gets automatically fixed
in the subsequent recovery. You can monitor the
mmafmctl $fsname getstate to check if its
queue number is changing. And also can check the
gpfs logs and for any errors, such as unmounted.

Is there a single command to delete all RPO No. All RPOs need to be manually deleted.
snapshots from a primary fileset?
Suppose there are more than two RPO snapshots Check the queue. Check if recovery happened
on the primary. Where did these snapshots come in the recent past. The extra snapshots will get
from? deleted during subsequent RPO cycles.
How to restore an unmounted AFM DR fileset that If the NSD mount on the gateway
uses GPFS™ protocol as backend? node is unresponsive, AFM DR does
not synchronize data with secondary. The
filesystem might be unmounted at the
gateway node. A message AFM: Remote
filesystem remotefs is panicked due
to unresponsive messages on fileset
<fileset_name>,re-mount the filesystem
after it becomes responsive. mmcommon
preunmount invoked. File system: fs1
Reason: SGPanic is written to mmfs.log. After
the secondary is responsive, you must restore the
NSD mount on the gateway node.

516 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 34. AFM to cloud object storage issues
The following table lists the common questions in the AFM to cloud object storage.

Table 67. Common questions in AFM to cloud object storage with their resolution
Question Answer or Resolution
What can be done if the AFM to The relationship is in the Unmounted state because of the buckets on a
cloud object storage relationship in cloud object storage are not accessible or the gateway can connect to
the Unmounted state? the endpoint but cannot see the buckets and cannot connect. Check that
buckets have correct configuration and keys.
Use the mmafmctl filesystem getstate command to check the
fileset state.
The fileset state is transient. When the network issue or bucket issues are
resolved, the state becomes Active. You need not to use any commands to
change the state.

Why an error message is displayed The No keys for bucket <Bucket_Name> is set for server
while creating the AFM to cloud <Server_name> error message is displayed while creating the AFM to
object storage relationship? cloud object storage relationship. This error occurs because correct keys
are not set for a bucket or no keys are set. Set the keys correctly for the
bucket before the relation creation.
Why operations are requeued on An IBM Storage Scale cluster supports special characters. But, when
the gateway node? a cloud object storage does not support object or bucket names with
special characters, they are requeued. Each cloud object storage provider
must have some limitations for bucket and objects names. For more
information about the limitations, see the documentation that is provided
by cloud object storage providers.
What must be done when the AFM A primary gateway cannot connect to a cloud object storage endpoint.
to cloud object storage relationship Check the endpoint configuration and network connection between a
is disconnected? cluster and the cloud object storage.
What can be done when fileset Callbacks such as lowDiskSpace or noDiskSpace can be set to confirm
space limit is approaching? that the space or fileset or file system is approaching. Provide a storage
and allocate more storage to the pools that are defined.
What can be done when messages When the provisioned space on a cloud object storage is full, messages
in the queue are requeued because can be requeued on a gateway node. Provide more space on the cloud
of cloud object storage is full? object storage and run the mmafmctl resumeRequeued command so
that the requeued messages are run again.
What can be done if there are When the objects are synchronized to a cloud object storage, the
servers on the mmafmtransfer mmafmtransfer command-related waiters can be seen especially for
command but everything looks large objects, or when the application is creating or accessing multiple
normal? objects at a faster pace.
Read seems to be stuck or inflight Check the status of the fileset by using mmafmctl getstate command
for a long time. What should be to see whether the fileset is in the Unmounted state. Check for network
done? errors.
How does the ls command claim When metadata is evicted, any operation that requires metadata, for
the inodes after metadata eviction? example, the ls command reclaims the metadata.

© Copyright IBM Corp. 2015, 2024 517


Table 67. Common questions in AFM to cloud object storage with their resolution (continued)
Question Answer or Resolution
Operations fail with the No buffer Increase the afmHardMemThreshold value.
space available error message, but
the mmdiag --memory command
shows that memory is available.
What should be done?
Why some of directories are not Empty directories are not replicated. There is no concept of directory in a
replicated? cloud object storage, directory names are prefixes to the objects.
Why does the synchronous keys Sometime, the configuration file is locked for security purpose. Because
setting fail? of this setting, many keys on some directories might fail at a time. These
keys can be set again.
What can be done if E_PERM or AFM to cloud object storage might get E_PERM or error 13 from a cloud
error 13 occurs when you transfer object storage on an object if the cloud object storage has such error
an object? response.
For example, for the signature mismatch error, see the AWS
troubleshooting guide.
An AFM to COS fileset tries to transfer again these objects to the cloud
object storage, and the objects might be requeued on gateway nodes.

518 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 35. Transparent cloud tiering issues
This topic describes the common issues (along with workarounds) that you might encounter while using
Transparent cloud tiering.

Migration/Recall failures
If a migration or recall fails, simply retry the policy or CLI command that failed two times after clearing the
condition causing the failure. This works because the Transparent cloud tiering service is idempotent.

mmcloudgateway: Internal Cloud services returned an error: MCSTG00098I: Unable


to reconcile /ibm/fs1 - probably not an space managed file system.
This typically happens if administrator has tried the mmcloudgateway account delete command
before and has not restarted the service prior to invoking the migrate, reconcile, or any other similar
commands. If the migration, reconcile, or any other Cloud services command fails with such a message,
restart the Cloud services once by using the mmcloudgateway service restart {-N node-class}
and retry the command.

Starting or stopping Transparent cloud tiering service fails with the Transparent
cloud tiering seems to be in startup phase message
This is typically caused if the Gateway service is killed manually by using the kill command, without the
graceful shutdown by using the mmcloudgateway service stop command.

Adding a cloud account to configure IBM Cloud Object Storage fails with the
following error: 56: Cloud Account Validation failed. Invalid credential for Cloud
Storage Provider. Details: Endpoint URL Validation Failed, invalid username or
password.
Ensure that the appropriate user role is set through IBM Cloud® Object Storage dsNet Manager GUI.

HTTP Error 401 Unauthorized exception while you configure a cloud account
This issue happens when the time between the object storage server and the Gateway node is not synced
up. Sync up the time with an NTP server and retry the operation.

Account creation command fails after a long wait and IBM Cloud Object Storage
displays an error message saying that the vault cannot be created; but the vault is
created
When you look at the IBM Cloud Object Storage manager UI, you see that the vault exists. This problem
can occur if Transparent cloud tiering does not receive a successful return code from IBM Cloud Object
Storage for the vault creation request.
The most common reason for this problem is that the threshold setting on the vault template is incorrect.
If you have 6 IBM Cloud Object Storage slicestors and the write threshold is 6, then IBM Cloud Object
Storage expects that all the slicestors are healthy. Check the IBM Cloud Object Storage manager UI. If any
slicestors are in a warning or error state, update the threshold of the vault template.

Account creation command fails with error MCSTG00065E, but the data vault and
the metadata vault exist
The full error message for this error is as follows:

© Copyright IBM Corp. 2015, 2024 519


MCSTG00065E: Command Failed with following reason: Error checking existence of, or creating,
cloud container container_name or cloud metadata container container_name.meta.

But the data vault and the metadata vault are visible on the IBM Cloud Object Storage UI.
This error can occur if the metadata vault was created but its name index is disabled. To resolve this
problem, do one of the follow actions:
• Enter the command again with a new vault name and vault template.
• Delete the vault on the IBM Cloud Object Storage UI and run the command again with the correct
--metadata-location.
Note: It is a good practice to disable the name index of the data vault. The name index of the metadata
vault must be enabled.

File or metadata transfer fails with koffLimitedRetryHandler:logError -


Cannot retry after server error, command has exceeded retry limit,
followed by RejectingHandler:exceptionCaught - Caught an exception
com.ibm.gpfsconnector.messages.GpfsConnectorException: Unable to
migrate
This is most likely caused by a network connectivity and/or bandwidth issue. Make sure that the network
is functioning properly and retry the operation. For policy-initiated migrations, IBM Storage Scale policy
scan might automatically retry the migration of the affected files on a subsequent run.

gpfs.snap: An Error was detected on node XYZ while invoking a request to collect
the snap file for Transparent cloud tiering: (return code: 137).
If the gpfs.snap command fails with this error, increase the value of the timeout parameter by using the
gpfs.snap --timeout Seconds option.
Note: If the Transparent cloud tiering log collection fails after the default timeout period expires, you can
increase the timeout value and collect the TCT logs. The default timeout is 300 seconds (or 5 minutes).

Migration fails with error: MCSTG00008E: Unable to get fcntl lock on inode. Another
MCStore request is running against this inode.
This happens because some other application might be having the file open, while Cloud services are
trying to migrate it.

Connect: No route to host Cannot connect to the Transparent Cloud Tiering service.
Please check that the service is running and that it is reachable over the network.
Could not establish a connection to the MCStore server
During any data command, if this error is observed, it is due to abrupt shutdown of Cloud services
on one of the nodes. This happens when Cloud services is not stopped on a node explicitly using the
mmcloudgateway service stop command, but power of a node goes down or IBM Storage Scale
daemon is taken down. This causes node IP address to be still considered as an active Cloud services
node and, the data commands routed to it fail with this error.

"Generic_error" in the mmcloudgateway service status output


This error indicates that the cloud object storage is unreachable. Ensure that there is outbound
connectivity to the cloud object storage. Logs indicate an exception about the failure.

520 IBM Storage Scale 5.1.9: Problem Determination Guide


An unexpected exception occurred during directory processing : Input/output error
You might encounter this error while migrating files to the cloud storage tier. To fix this, check the status
of NSDs and ensure that the database/journal files are not corrupted and can be read from the file system.

It is marked for use by Transparent Cloud Tiering


You might encounter this error when you try to remove a Cloud services node from a cluster. To resolve
this, use the --force option with the mmchnode command as follows:

mmchnode --cloud-gateway-disable -N nodename --cloud-gateway-nodeclass nodeclass --force

Container deletion fails


You might encounter this error when you try to remove a container pairset that is marked for Cloud
services. To resolve this, use the --force option with the mmcloudgateway command as follows:

mmcloudgateway containerpairset delete --container-pair-set-name x13


--cloud-nodeclass cloud --force

Chapter 35. Transparent cloud tiering issues 521


522 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 36. File audit logging issues
The following topics discuss issues that you might encounter in file audit logging.
The audit_parser script is helpful for troubleshooting the events in a file system by viewing logs with
reduced or specific information. For more information, see “Monitoring file audit logging using audit log
parser” on page 235.

Failure of mmaudit because of the file system level


The following problem can occur if you upgrade IBM Storage Scale without completing the permanent
cluster upgrade.

# mmaudit TestDevice enable


[E] File system device TestDevice is not at a supported file system level for audit logging.
The File Audit Logging command associated with device: TestDevice cannot be completed.
Choose a device that is at the minimum supported file system level
and is accessible by all nodes associated with File Audit Logging and try the command again.
mmaudit: Command failed. Examine previous error messages to determine cause.

See Completing the upgrade to a new level of IBM Storage Scale in the IBM Storage Scale: Concepts,
Planning, and Installation Guide to get the cluster and file system to enable new functionality.

JSON reporting issues in file audit logging


This topic describes limitations that the user might observe in the file audit logging JSON logs.
SMB
• For a created file, there can be many OPEN and CLOSE events.
• For a deleted file, a DESTROY event might never get logged. Instead, the JSON might show an
UNLINK event only.
NFS
• For NFSv3, upon file creation, audit logs only get an OPEN event without a corresponding CLOSE
event. The CLOSE event only occurs when a subsequent operation is done on the file (for example,
RENAME).
• The file path name, NFS IP for kNFS might not always be available.
Object
Object is not supported for file audit logging in IBM Storage Scale.
Note: In addition, file activity in the primary object fileset does not generate events.
Unified file
Unified file is not supported for file audit logging in IBM Storage Scale.
Access denied event
When access to a file might be overridden to allow a user to operate on a file, an ACCESS_DENIED
event might be generated even if the operation went through. For example, if a user without write
permission on a file tries to delete the file, but has read, write, and execute permissions on the parent
directory, there might be an ACCESS_DENIED event generated even though the delete operation goes
through.
Access to files through protocols might be denied at the protocol level (NFS and SMB). In these
scenarios, file audit logging does not generate an event for ACCESS_DENIED because IBM Storage
Scale is not made aware that access to the file was denied.
There is not an ACCESS_DENIED event generation for failed attempts to write to immutable files or for
attempts to, for example, truncate appendOnly files.

© Copyright IBM Corp. 2015, 2024 523


File audit logging issues with remotely mounted file systems
Collect gpfs.snap and other pertinent data from both the owning and accessing clusters to help
diagnose the issue.
• Use mmlsfs <Device> --file-audit-log on the accessing cluster to determine if a remotely
mounted file system is under audit.
• While upgrading the owning cluster to IBM Storage Scale 5.0.2, if I/Os are running from the accessing
cluster to the remotely mounted file systems, the user is likely to see file audit logging errors logged by
the accessing cluster.

Audit fileset creation issues when enabling file audit logging


Before creating the audit fileset, the mmaudit command will check to see if the fileset exists.
If the fileset exists and it is not already in IAM mode noncompliant, the command will fail. Similarly, if the
--compliant flag is provided when enabling file audit logging and the fileset exists and is not already in
IAM mode compliant, the command will fail.
To view the IAM compliance mode of the audit filesets for existing file systems with file audit logging
enabled, use the mmaudit all list command with the -Y option to show machine-readable output.
The compliance mode is given as one of the colon-delimited configuration values.

Failure to append messages to Buffer Pool


FAL uses buffers to hold json events in memory before flushing to disk.
If the buffer pool becomes resource constrained, IO threads will wait a short amount of time waiting
for space to become available before dropping events. This is done in order to prevent throttling the
filesystem waiting for available buffer space.
This issue might happen periodically if the IO load on the node is heavy, but does not indicate a fatal error.
The message Failed to append to Buffer Pool causes the health state of audit to go to degraded,
and should clear itself from system health once buffer space becomes available. In the logs, a message
like < Wrote message to audit log successfully! Total messages that could not be
sent: 372. > will appear after normal function is resumed, with the amount of messages that had to be
dropped displayed in the log entry.
It is recommended to try to limit the scope of the audit if possible. Rather than auditing for all events on
the entire filesystem, if not all events or filesets need to be audited, it is recommended to remove them
from the audit. This could help with buffer space contention, but not guaranteed to solve the issue.

524 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 37. Cloudkit issues
Learn about common issues and workarounds that you might encounter while running cloudkit on
Amazon Web Services (AWS) or Google Cloud Platform (GCP).

Cloudkit issues on AWS


This topic describes the common issues and workarounds that you might encounter while running
cloudkit on Amazon Web Services (AWS).

Create cluster issues


• Cloudkit produces logs that detail any errors or issues encountered during the provisioning and
configuration process. Review the logs to identify any specific errors or issues that may be causing
problems.
• Review the cloudkit inputs to ensure that all values are set correctly.
Quota issues
The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the AWS account.

creating EC2 EIP: AddressLimitExceeded: The maximum number of addresses has been reached."}

iamInstanceProfile issue
Cluster creation fails with the error:

Failed: Value (ibm-storage-scale-20230510042321111100000001) for parameter


iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name. Launching EC2 instance
failed.

Fix: Rerun cloudkit cluster creation with same parameters.


Credential issues
• Check your AWS credentials. Ensure that the AWS credentials used to provision resources through
cloudkit are valid and have the appropriate permissions. Verify this by checking the AWS access key and
secret key used by cloudkit and confirming that they have the necessary IAM permissions.

Error listing service quotas: ec2 AccessDeniedException: User: arn:aws:iam::1111111111:user/


testuser is not authorized to perform

Transient network errors


• Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered
while deploying new cluster resources.
2. If this is encountered, re-run the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

4. When cloudkit is executed from hosts which are on-premise and used behind NAT the IP address
displayed in the default option may result in connectivity problems. It is recommended to cross
verify the host public IP during the inputs section.
• If an IBM Storage Scale related problem is suspected, collect data by running a gpfs.snap. Upload this
gpfs.snap to the IBM Storage Scale support ticket that is opened.

© Copyright IBM Corp. 2015, 2024 525


• Plan your network infrastructure to ensure a reliable communication between installer node and cloud.
In jump host based connectivity, it could take little longer for ssh to reach the node, if there is a
network drop, it is recommended to re-run.
Host unreachable
Cluster creation fails with the error:

Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote
host\", \"unreachable

Fix: Rerun cloudkit cluster creation with same parameters.


Instance boot timeout issues
Cluster creation might fail if an instance takes more than 10 minutes to finish the boot-up.

waiting for EC2 Instance create: timeout while waiting for state to become 'running' (last
state: 'pending', timeout: 10m0s)

Fix: Verify if the instance has finished the boot-up. Cleanup this instance (irrespective of the state) and
retry the cluster creation.
KMS (Key Management Service) key issue
Cluster creation fails with the error when EBS encryption is selected:

create: unexpected state 'shutting-down', wanted target 'running'. last error:


Client.InternalError: Client error on launch\u001b[0m

Fix: Check if the user executing the cloudkit has read access to the provided KMS key.

Delete cluster issue


The cloudkit delete cluster command might encounter failures for the following reasons:
1. If any AWS resources have been provisioned into the VPC outside of the cloudkit, the cloudkit delete
command will be unable to delete these manually created resources. In this case, all resources
created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be
executed.
2. When IBM Storage Scale clusters have been deployed into a VPC that has been previously created by
another cloudkit create execution, ensure to delete the clusters in the correct order. If the clusters are
not deleted in the proper order then it could fail as resources are still in the VPC. For more information,
see Cleanup in the topic, "Understanding the cloudkit installation options" in the IBM Storage Scale:
Concepts, Planning, and Installation Guide.
3. If there is a session open and a user is in a filesystem mount point.

Grant filesystem issue


Credential issues
• Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered
during the provisioning and configuration process. Review the logs to identify any specific errors or
issues that may be causing problems.
• If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the
credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the
credentials.
Transient network errors
Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered
while deploying new cluster resources.

526 IBM Storage Scale 5.1.9: Problem Determination Guide


2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Revoke file system issue


Credential issues
• Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered
during the provisioning and configuration process. Review the logs to identify any specific errors or
issues that may be causing problems.
• If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the
credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the
credentials.
Transient network errors
Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues when deploying new
cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Cluster upgrade issue


Incompatible self-extracting packages
Each IBM Storage Scale edition requires a specific upgrade package. For example, the self-extracting
package for the IBM Storage Scale Developer Edition does not upgrade a cluster where the deployed
edition of IBM Storage Scale is Advanced Edition, Data Management Edition, or Data Access Edition.
If you attempt to upgrade a cluster by using a self-extracting package that does not match the IBM
Storage Scale edition deployed on that cluster, the upgrade fails with the following error:

"Cannot upgrade node 10.0.1.199 due to packages dependent on GPFS. If these are known
external dependencies, you can choose to override by setting the environment variable
\"SSFEATURE_OVERRIDE_EXT_PKG_DEPS=true\" environment variable. Instead if you would like to
continue an upgrade on all other nodes using the install toolkit, please remove this node from
the cluster definition via: spectrumscale node delete 10.0.1.199 and then re-run spectrumscale
upgrade. Otherwise, either remove the dependent packages manually or manually upgrade GPFS on
this node."}

Fix: Make sure to use a self-extracting package that does match the IBM Storage Scale edition deployed
on that cluster and rerun the cloudkit create cluster command.

Cloudkit issues on GCP


This topic describes the common issues and workarounds that you might encounter while running
cloudkit on Google Cloud Platform (GCP).

Create cluster issues


• Cloudkit produces logs that detail any errors or issues encountered during the provisioning and
configuration process. Review the logs to identify any specific errors or issues that may be causing
problems.
• Review the cloudkit inputs to ensure that all values are set correctly.

Chapter 37. Cloudkit issues 527


Quota issues
The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the GCP account.
For example, if the routers quota is exceeded, the following error message is logged:

Error waiting to create Router: Error waiting for Creating Router: Quota 'ROUTERS' exceeded.
Limit: 20.0 globally.

Fix: Increase the required quota or free some required resource.


Note: You can check if all the required quotas are met by running the ./cloudkit validate quota
command.
Credential issues
• Check your GCP credentials. Ensure that the GCP credentials used to provision resources through
cloudkit are valid and have the appropriate permissions. Verify this by checking the GCP credential
JSON file used by cloudkit and confirming that they have the necessary permissions.

google: could not find default credentials. See https://cloud.google.com/docs/authentication/


external/set-up-adc for more information

Transient network errors


• Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered
while deploying new cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:

Error waiting for instance to create: error while retrieving operation:


Get \"https://compute.googleapis.com/compute/v1/projects/<PROJECT_ID>/zones/us-east1-b/
operations/operation-1696304597507-1111111-e49f6fc6-a5b7ecc4?alt=json&prettyPrint=false\":
http2: client connection lost

4. If the problem still persists after rerunning the command, delete the existing cluster and create a
new cluster.
• If an IBM Storage Scale related problem is suspected, collect data by running a gpfs.snap. Upload this
gpfs.snap to the IBM Storage Scale support ticket that is opened.
• Plan your network infrastructure to ensure a reliable communication between installer node and cloud.
In jump host based connectivity, it could take little longer for ssh to reach the node, if there is a
network drop, it is recommended to re-run.
Plug-in load errors
Cluster creation fails while the terraform plug-in is loading and the following error message appears:

Error: Failed to load plugin schemas


Error: while loading schemas for plugin components: Failed to obtain schema: Could not load the
schema for provider

Fix: Rerun the ./cloudkit command with the same parameters.

Delete cluster issue


The cloudkit delete cluster command might encounter failures for the following reasons:
• If any GCP resources have been provisioned into the VPC outside of the cloudkit, the cloudkit delete
command will be unable to delete these manually created resources. In this case, all resources created
outside of the cloudkit must be deleted manually before the cloudkit delete operation can be executed.

528 IBM Storage Scale 5.1.9: Problem Determination Guide


When this issue occurs, the following message may be displayed:

I: Destroy all remote objects managed by terraform (instance_template) configuration.


I: Destroy all remote objects managed by terraform (bastion_template) configuration.
I: Destroy all remote objects managed by terraform (vpc_template) configuration.
E: Delete cluster received error: exit status 1

Fix: Manually clean the cloud resources and contact IBM Support.

Grant filesystem issue


Credential issues
• Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered
during the provisioning and configuration process. Review the logs to identify any specific errors or
issues that may be causing problems.
• If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the
credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the
credentials.
Transient network errors
Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered
while deploying new cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Revoke file system issue


Credential issues
• Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered
during the provisioning and configuration process. Review the logs to identify any specific errors or
issues that may be causing problems.
• If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the
credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the
credentials.
Transient Network errors
Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues when deploying new
cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Edit cluster issue


Quorum node issues
After running edit on any cluster, the following GPFS TIPS is displayed if you run the mmhealth node
show command:

GPFS TIPS 1 hour ago callhome_not_enabled, quorum_too_little_nodes

Chapter 37. Cloudkit issues 529


This means that the edit operation is not assigning any new quorum node.
Fix: You need to manually assign additional quorum node from the newly added nodes.

530 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 38. Troubleshooting mmwatch
This topic contains information for troubleshooting clustered watch folder.
Use the following log names and locations to troubleshoot mmwatch:
• /var/adm/ras/mmwatch.log contains information about the setup and configuration operations that
affect the clustered watch while in process. Valid on any node that is running the clustered watch
command or location where the conduit might be running.
• /var/adm/ras/mmwfclient.log contains information about clustered watch folder connections.
• /var/adm/ras/mmfs.log.latest is the daemon log. It contains entries when major cluster watch
activity occurs.
• /var/log/messages (Red Hat) or /var/log/syslog (Ubuntu) also might contain messages from the
producer and consumers that are running on a node.
• It can take up to 10 minutes for system health to refresh the clustered watch folder components. This
means that the accurate state of a watch might not be reflected until 10 minutes after an event is sent.
If the states of the watch is FAILED, then the mmhealth node show --refresh -N all command
can be used to force an update of the states.
• When you view the configuration with mmwatch all list, the output might show the incomplete
state in one of the fields. The incomplete state means that a disable operation did not complete
successfully. To remedy this, you can attempt to disable it again.

© Copyright IBM Corp. 2015, 2024 531


532 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 39. Maintenance procedures
The directed maintenance procedures (DMPs) assist you to fix certain issues reported in the system.

Directed maintenance procedures available in the GUI


The directed maintenance procedures (DMPs) assist you to repair a problem when you select the action
Run fix procedure on a selected event from the Monitoring > Events page. DMPs are present for only a
few events reported in the system.
The following table provides details of the available DMPs and the corresponding events.

Table 68. DMPs


DMP Event ID
Replace disks gnr_pdisk_replaceable
Update enclosure firmware enclosure_firmware_wrong
Update drive firmware drive_firmware_wrong
Update host-adapter firmware adapter_firmware_wrong
Start NSD disk_down
Start GPFS daemon gpfs_down
Increase fileset space inode_error_high and inode_warn_high
Start performance monitoring collector service pmcollector_down
Start performance monitoring sensor service pmsensors_down
Activate AFM performance monitoring sensors afm_sensors_inactive
Activate NFS performance monitoring sensors nfs_sensors_inactive
Activate SMB performance monitoring sensors smb_sensors_inactive
Configure NFS sensor nfs_sensors_not_configured
Configure SMB sensor smb_sensors_not_configured
Mount file systems unmounted_fs_check
Start GUI service on remote node gui_down
Repair a failed GUI refresh task gui_refresh_task_failed

Start NSD
The Start NSD DMP assists to start NSDs that are not working.
The following are the corresponding event details and the proposed solution:
• Event ID: disk_down
• Problem: The availability of an NSD is changed to “down”.
• Solution: Recover the NSD.
The DMP provides the option to start the NSDs that are not functioning. If multiple NSDs are down, you
can select whether to recover only one NSD or all of them.

© Copyright IBM Corp. 2015, 2024 533


The system issues the mmchdisk command to recover NSDs as given in the following format:

/usr/lpp/mmfs/bin/mmchdisk <device> start -d <disk description>

For example: /usr/lpp/mmfs/bin/mmchdisk r1_FS start -d G1_r1_FS_data_0

Start GPFS daemon


When the GPFS daemon is down, GPFS functions do not work properly on the node.
The following are the corresponding event details and the proposed solution:
• Event ID: gpfs_down
• Problem: The GPFS daemon is down. GPFS is not operational on node.
• Solution: Start GPFS daemon.
The system issues the mmstartup -N command to restart GPFS daemon as given in the following
format:

/usr/lpp/mmfs/bin/mmstartup -N <Node>

For example: usr/lpp/mmfs/bin/mmstartup -N gss-05.localnet.com

Increase fileset space


The system needs inodes to allow I/O on a fileset. If the inodes allocated to the fileset are exhausted, you
need to either increase the number of maximum inodes or delete the existing data to free up space.
The procedure helps to increase the maximum number of inodes by a percentage of the already allocated
inodes. The following are the corresponding event details and the proposed solution:
• Event ID: inode_error_high and inode_warn_high
• Problem: The inode usage in the fileset reached an exhausted level.
• Solution: Increase the maximum number of inodes.
The system issues the mmchfileset command to recover NSDs as given in the following format:

/usr/lpp/mmfs/bin/mmchfileset <Device> <Fileset> --inode-limit <inodesMaxNumber>

For example: /usr/lpp/mmfs/bin/mmchfileset r1_FS testFileset --inode-limit 2048

Synchronize node clocks


The time must be in sync with the time set on the GUI node. If the time is not in sync, the data that is
displayed in the GUI might be wrong or it does not even display the details. For example, the GUI does not
display the performance data if time is not in sync.
The procedure assists to fix timing issue on a single node or on all nodes that are out of sync. The
following are the corresponding event details and the proposed solution:
• Event ID: time_not_in_sync
• Limitation: This DMP is not available in sudo wrapper clusters. In a sudo wrapper cluster,
the user name is different from 'root'. The system detects the user name by finding the
parameter GPFS_USER=<user name>, which is available in the file /usr/lpp/mmfs/gui/conf/
gpfsgui.properties.
• Problem: The time on the node is not synchronous with the time on the GUI node. It differs more than 1
minute.
• Solution: Synchronize the time with the time on the GUI node.

534 IBM Storage Scale 5.1.9: Problem Determination Guide


The system issues the sync_node_time command as given in the following format to synchronize the
time in the nodes:

usr/lpp/mmfs/gui/bin-sudo/sync_node_time <nodeName>

For example: /usr/lpp/mmfs/gui/bin-sudo/sync_node_time c55f06n04.gpfs.net

Start performance monitoring collector service


The collector services on the GUI node must be functioning properly to display the performance data in
the IBM Storage Scale management GUI.
The following are the corresponding event details and the proposed solution:
• Event ID: pmcollector_down
• Limitation: This DMP is not available in sudo wrapper clusters when a remote pmcollector service
is used by the GUI. A remote pmcollector service is detected in case a different value than
localhost is specified in the ZIMonAddress in file, which is located at: /usr/lpp/mmfs/gui/conf/
gpfsgui.properties. In a sudo wrapper cluster, the user name is different from 'root'. The system
detects the user name by finding the parameter GPFS_USER=<user name>, which is available in the
file /usr/lpp/mmfs/gui/conf/gpfsgui.properties.
• Problem: The performance monitoring collector service pmcollector is in inactive state.
• Solution: Issue the systemctl status pmcollector to check the status of the collector. If
pmcollector service is inactive, issue systemctl start pmcollector.
The system restarts the performance monitoring services by issuing the systemctl restart
pmcollector command.
The performance monitoring collector service might be on some other node of the current cluster. In this
case, the DMP first connects to that node, then restarts the performance monitoring collector service.

ssh <nodeAddress> systemctl restart pmcollector

For example: ssh 10.0.100.21 systemctl restart pmcollector


In a sudo wrapper cluster, when collector on remote node is down, the DMP does not restart the collector
services by itself. You need to do it manually.

Start performance monitoring sensor service


You need to start the sensor service to get the performance details in the collectors. If sensors and
collectors are not started, the GUI and CLI do not display the performance data in the IBM Storage Scale
management GUI.
The following are the corresponding event details and the proposed solution:
• Event ID: pmsensors_down
• Limitation: This DMP is not available in sudo wrapper clusters. In a sudo wrapper cluster,
the user name is different from 'root'. The system detects the user name by finding the
parameter GPFS_USER=<user name>, which is available in the file /usr/lpp/mmfs/gui/conf/
gpfsgui.properties.
• Problem: The performance monitoring sensor service pmsensor is not sending any data. The service
might be down or the difference between the time of the node and the node hosting the performance
monitoring collector service pmcollector is more than 15 minutes.
• Solution: Issue systemctl status pmsensors to verify the status of the sensor service. If
pmsensor service is inactive, issue systemctl start pmsensors.
The system restarts the sensors by issuing systemctl restart pmsensors command.
For example: ssh gss-15.localnet.com systemctl restart pmsensors

Chapter 39. Maintenance procedures 535


Activate AFM performance monitoring sensors
The activated AFM performance monitoring sensor's DMP assists to activate the inactive AFM sensors.
The following are the corresponding event details and the proposed solution:
• Event ID: afm_sensors_inactive
• Problem: The AFM performance cannot be monitored because one or more of the performance sensors
like GPFSAFMFS, GPFSAFMFSET, and GPFSAFM are offline.
• Solution: Activate the AFM sensors.
The DMP provides the option to activate the AFM monitoring sensor and select a data collection interval
that defines how frequently the sensors must collect data. It is recommended to select a value that
is greater than or equal to 10 as the data collection frequency to reduce the impact on the system
performance.
The system issues the mmperfmon command to activate AFM sensors as given in the following format:

/usr/lpp/mmfs/bin/mmperfmon config update <<sensor_name>>.restrict=<<afm_gateway_nodes>>


/usr/lpp/mmfs/bin/mmperfmon config update <<sensor_name>>.period=<<seconds>>

For example,

/usr/lpp/mmfs/bin/mmperfmon config update GPFSAFM.restrict=gss-41


/usr/lpp/mmfs/bin/mmperfmon config update GPFSAFM.period=30

Activate NFS performance monitoring sensors


The activate NFS performance monitoring sensors DMP assists to activate the inactive NFS sensors.
The following are the corresponding event details and the proposed solution:
• Event ID: nfs_sensors_inactive
• Problem: The NFS performance cannot be monitored because the performance monitoring sensor
NFSIO is inactive.
• Solution: Activate the SMB sensors.
The DMP provides the option to activate the NFS monitoring sensor and select a data collection interval
that defines how frequently the sensors must collect data. It is recommended to select a value that
is greater than or equal to 10 as the data collection frequency to reduce the impact on the system
performance.
The system issues the mmperfmon command to activate the sensors as given in the following format:

/usr/lpp/mmfs/bin/mmperfmon config update NFSIO.restrict=cesNodes NFSIO.period=<<seconds>>

For example: /usr/lpp/mmfs/bin/mmperfmon config update NFSIO.restrict=cesNodes


NFSIO.period=10

Activate SMB performance monitoring sensors


The activate SMB performance monitoring sensors DMP assists to activate the inactive SMB sensors.
The following are the corresponding event details and the proposed solution:
• Event ID: smb_sensors_inactive
• Problem: The SMB performance cannot be monitored because either one or both the SMBStats and
SMBGlobalStats sensors are inactive.
• Solution: Activate the SMB sensors.
The DMP provides the option to activate the SMB monitoring sensor and select a data collection interval
that defines how frequently the sensors must collect data. It is recommended to select a value that

536 IBM Storage Scale 5.1.9: Problem Determination Guide


is greater than or equal to 10 as the data collection frequency to reduce the impact on the system
performance.
The system issues the mmperfmon command to activate the sensors as given in the following format:

/usr/lpp/mmfs/bin/mmperfmon config update SMBStats.restrict=cesNodes SMBStats.period=<<seconds>>

For example: /usr/lpp/mmfs/bin/mmperfmon config update SMBStats.restrict=cesNodes


SMBStats.period=10

Configure NFS sensors


The configure NFS sensor DMP assists you to configure NFS sensors.
The following are the details of the corresponding event:
• Event ID: nfs_sensors_not_configured
• Problem: The configuration details of the NFS sensor is not available in the sensor configuration.
• Solution: The sensor configuration is stored in a temporary file that is located at: /var/lib/
mmfs/gui/tmp/sensorDMP.txt. The DMP provides options to enter the following details in the
sensorDMP.txt file and later add them to the configuration by using the mmperfmon config add
command.

Table 69. NFS sensor configuration example


Sensor Restrict to nodes Intervals Contents of the sensorDMP.txt file
NFSIO Node class - 1, 5, 10, 15, 30
sensors={
cesNodes name = "sensorName"
Default value is 10. period = period
proxyCmd = "/opt/IBM/zimon/
GaneshaProxy"
restrict = "cesNodes"
type = "Generic"
}

Only users with ProtocolAdministrator, SystemAdministrator, SecurityAdministrator, and Administrator


roles can use this DMP to configure NFS sensor.
After you complete the steps in the DMP, refresh the configuration by issuing the following command:

/usr/lpp/mmfs/bin/mmhealth node show nfs --refresh -N cesNodes

Issue the mmperfmon config show command to verify whether the NFS sensor is configured properly.

Configure SMB sensors


The configure SMB sensor DMP assists you to configure SMB sensors.
The following are the details of the corresponding event:
• Event ID: smb_sensors_not_configured
• Problem: The configuration details of the SMB sensor is not available in the sensor configuration.
• Solution: The sensor configuration is stored in a temporary file that is located at: /var/lib/
mmfs/gui/tmp/sensorDMP.txt. The DMP provides options to enter the following details in the
sensorDMP.txt file and later add them to the configuration by using the mmperfmon config add
command.

Chapter 39. Maintenance procedures 537


Table 70. SMB sensor configuration example
Sensor Restrict to nodes Intervals Contents of the sensorDMP.txt file
SMBStats Node class - 1, 5, 10, 15, 30
sensors={
cesNodes name = "sensorName"
SMBGlobalStats Default value is 10. period = period
restrict = "cesNodes"
type = "Generic"
}

Only users with ProtocolAdministrator, SystemAdministrator, SecurityAdministrator, and Administrator


roles can use this DMP to configure SMB sensor.
After you complete the steps in the DMP, refresh the configuration by issuing the following command:

/usr/lpp/mmfs/bin/mmhealth node show SMB --refresh -N cesNodes

Issue the mmperfmon config show command to verify whether the SMB sensor is configured properly.

Mount file system if it must be mounted


The mount file system DMP assists you to mount the file systems that must be mounted.
The following are the details of the corresponding event:
• Event ID: unmounted_fs_check
• Problem: A file system is assumed to be mounted all the time because it is configured to mount
automatically, but the file system is currently not mounted on all nodes.
• Solution: Mount the file system on the node where it is not mounted.
Only users with ProtocolAdministrator, SystemAdministrator, SecurityAdministrator, and Administrator
roles can use this DMP to mount the file systems on the required nodes.
If there is more than one instance of unmounted_fs_check event for the file system, you can choose
whether to mount the file system on all nodes where it is not mounted but supposed to be mounted.
The DMP issues the following command for mounting the file system on one node:

mmmount Filesystem -N Node

The DMP issues the following command for mounting the file system on several nodes if automatic mount
is not included:

mmmount Filesystem -N all

The DMP issues the following command for mounting the file system on certain nodes if automatic mount
is not included in those nodes:

mmmount Filesystem -N Nodes (comma-separated list)

Note: Nodes where the file /var/mmfs/etc/ignoreStartupMount.filesystem or /var/


mmfs/etc/ignoreStartupMount exists are excluded from automatic mount of this file system.
After running the mmmount command, the DMP waits until the unmounted_fs_check event disappear
from the event list. If the unmounted_fs_check event does not get removed from the event list after 120
seconds, a warning message is displayed.

Start the GUI service on the remote nodes


You can start the GUI service on the remote nodes by using this DMP.
The following are the details of the corresponding event:

538 IBM Storage Scale 5.1.9: Problem Determination Guide


• Event ID: gui_down
• Problem: A GUI service is supposed to be running but it is down.
• Solution: Start the GUI service.
• Limitation: This DMP can only be used if GUI service is down on the remote nodes.
Only users with ProtocolAdministrator, SystemAdministrator, SecurityAdministrator, and Administrator
roles can use this DMP to mount the file systems on the required nodes.
The DMP issues the systemctl restart gpfsgui command to start the GUI service on the remote
node.
After running the mmmount command, the DMP waits until the gui_down event disappears from the event
list. If the gui_down event does not get removed from the event list after 120 seconds, a warning message
is displayed.

Repair a failed GUI refresh task


The refresh tasks help to display the latest information on the GUI. This directed maintenance procedure
(DMP) assists the users to repair any failed GUI refresh tasks.
The following are the corresponding event details and the proposed solution:
• Event ID: gui_refresh_task_failed
• Problem: One or more GUI refresh tasks have failed to complete the process.
• Solution: Show the last execution error to the user. If this does not indicate the problem, user can run
the task with the debug option to see more details.
• Limitations: This DMP works only if the tasks are failed on local GUI. User must switch to remote GUI
node if tasks are failed on remote GUI.
You can invoke this DMP from the Monitoring > Events page. If there are multiple failed refresh tasks, you
need to address the issues one by one in the DMP wizard.
The system issues the following command to debug the issue:

/usr/lpp/mmfs/gui/cli/runtask <taskname> --debug

Note: Only the users with Storage Administrator, System Administrator, Security Administrator, and
Administrator user roles can launch this DMP.
For example: /usr/lpp/mmfs/bin/mmchfileset r1_FS testFileset --inode-limit 2048

Directed maintenance procedures for tip events


The directed maintenance procedures (DMPs) assist you to repair a problem when you select the action
Run fix procedure on a selected event from the GUI > Monitoring > Tips page. DMPs are present for the
following tip events reported in the system.
Attention:
If you run these DMPs manually on the command line, the tip event will not reset immediately.

Chapter 39. Maintenance procedures 539


Table 71. Tip events list

Reporting Event Name Prerequisites Conditions Fix Procedure


component

gpfs gpfs_pagepool_ small The actively used • To change the value and make it effective immediately, use the following command::
GPFS pagepool
gpfs_pagepool_ ok
setting (mmdiag -- mmchconfig pagepool=<value> -i
config | grep
pagepool) is lower
than or equal to 1 where <value> is a value higher than 1GB.
GB. • To change the value and make it effective after next GPFS recycle, use the following
command::

mmchconfig pagepool=<value>

where <value> is a value higher than 1GB.


• To ignore the event, use the following command:

mmhealth event hide gpfs_pagepool_small

AFM component afm_sensors_ inactive Verify that the The period for • To change the period when the sensors are defined in the perfmon configuration file, use
node has a at least one of the following command:
afm_sensors_ active gateway the following AFM
designation and sensors' is set mmperfmon config update <sensor_name>.period=<interval>
a perfmon to 0: GPFSAFM,
designation GPFSAFMFS,
using the GPFSAFMFSET. Where <sensor_name> is one of the AFM sensors GPFSAFM, GPFSAFMFS, or
mmlscluster GPFSAFMFSET, and <interval> is the time in seconds that the sensor waits to gather
command. the different sensors' metrics again.
• To change the period when the sensors are not defined in the perfmon configuration file,
create a sensorsFile with input using the following command:

sensors = {

name = <sensor_name>

period = <interval>

type = "Generic"

}
mmperfmon config add --sensors <path_to_tmp_cfg_file>

• To ignore the event, use the following command:

mmhealth event hide afm_sensors_inactive

NFS component nfs_sensors_ inactive Verify that the The NFS sensor • To change the period when the sensors are defined in the perfmon configuration file, use
node is NFS NFSIO has a period the following command:
nfs_sensors_ active enabled, and of 0.
has a perfmon mmperfmon config update <sensor_name>.period=<interval>
designation
using the
mmlscluster Where <sensor_name> is the NFS sensor NFSIO, and <interval> is the time in
command. seconds that the sensor waits to gather the different sensors' metrics again.
• To change the period when the sensors are not defined in the perfmon configuration file,
use the following command:

mmperfmon config add --sensors /opt/IBM/zimon/defaults/


GaneshaProxy.conf

• To ignore the event, use the following command:

mmhealth event hide nfs_sensors_inactive

SMB component smb_sensors_ inactive Verify that the The period of • To change the period when the sensors are defined in the perfmon configuration file, use
node is SMB at least one of the following command:
smb_sensors_ active enabled, and the following SMB
has a perfmon sensors' is set mmperfmon config update <sensor_name>.period=<interval>
designation to 0: SMBStats,
using the SMBGlobalStats .
mmlscluster Where <sensor_name> is one of the SMB sensors SMBStats or SMBGlobalStats, and
command. <interval> is the time in seconds that the sensor waits to gather the different sensors'
metrics again.
• To change the period when the sensors are not defined in the perfmon configuration file,
use the following command:

mmperfmon config add --sensors /opt/IBM/zimon/defaults/


ZIMonSensors_smb.cfg

• To ignore the event, use the following command:

mmhealth event hide smb_sensors_inactive

540 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 71. Tip events list (continued)

Reporting Event Name Prerequisites Conditions Fix Procedure


component

gpfs gpfs_maxfilestocache_ Verify that the The actively used • To change the value, use the following command:
small node is in the GPFS
cesNodes node maxFilesToCache mmchconfig maxFilesToCache=<value>; mmshutdown; mmstartup
gpfs_maxfilestocache _ok class using the (mmdiag --
config | grep
maxFilesToCache where <value> is a value higher than 100,000
mmlsnode
class ) setting has a value • To ignore the event, use the following command:
--all smaller than or
equal to 100,000.
mmhealth event hide gpfs_maxfilestocache_small
command.

gpfs gpfs_maxstatcache _high Verify that the The actively • To change the value, use the following command:
node is a Linux used GPFS
gpfs_maxstatcache _ok node. maxStatCache mmchconfig maxStatCache=0; mmshutdown; mmstartup
(mmdiag --
config | grep
maxStatCache) • To ignore the event, use the following command:
value is higher than
0. mmhealth event hide gpfs_maxstatcache_high

gpfs callhome_not_enabled Verify that the Call home is not • To install call home, install the gpfs.callhome-ecc-client-{version-
node is the enabled on the number}.noarch.rpm package for the ECCClient on the potential call home nodes.
callhome_enabled Cluster cluster.
Manager using • To configure the call home package that are installed but not configured:
the mmlsmgr 1. Issue the mmcallhome capability enable command to initialize the
-c command. configuration.
2. Issue the mmcallhome info change command to add personal information.
3. Issue the mmcallhome proxy command to include a proxy if needed.
4. Issue the mmcallhome group add or mmcallhome group auto command to
create call home groups .
• To enable call home once the call home package is installed and the groups are
configured, issue the mmcallhome capability enable command.

For information on tip events, see “Event type and monitoring status for system health” on page 18.
Note: Since the TIP state is only checked once every hour, it might take up to an hour for the change to
reflect in the output of the mmhealth command.

Chapter 39. Maintenance procedures 541


542 IBM Storage Scale 5.1.9: Problem Determination Guide
Chapter 40. Recovery procedures
You need to perform certain procedures to recover the system to minimize the impact of the issue
reported in the system and to bring the system back to the normal operating state. The procedure
re-creates the system by using saved configuration data or by restarting the affected services.

Restoring data and system configuration


You can back up and restore the configuration data for the system after preliminary recovery tasks are
completed.
You can maintain your configuration data for the system by completing the following tasks:
1. Backing up the configuration data
2. Restoring the configuration data
3. Deleting unwanted backup configuration data files
The following topics describes how to perform backup and restore data and configuration in the IBM
Storage Scale system:
• Restore procedure with SOBAR in IBM Storage Scale: Administration Guide
• Encryption and backup/restore in IBM Storage Scale: Administration Guide
• Backup and restore with storage pools in IBM Storage Scale: Administration Guide
• Restoring quota files in IBM Storage Scale: Administration Guide
• Failback or restore steps for object configuration in IBM Storage Scale: Administration Guide
• Recovery of cluster configuration information when no CCR backup is available in IBM Storage Scale:
Administration Guide

Automatic recovery
The IBM Storage Scale recovers itself from certain issues without manual intervention.
The following automatic recovery options are available in the system:
• Failover of CES IP addresses to recover from node failures. That is, if any important service or protocol
service is broken on a node, the system changes the status of that node to Failed and moves the public
IPs to healthy nodes in the cluster.
A failover gets triggered due to the following conditions:
1. If the IBM Storage Scale monitoring service detects a critical problem in any of the CES components
such as NFS, SMB, or OBJ, then the CES state is set to FAILED and it triggers a failover.
2. If the IBM Storage Scale daemon detects a problem with the node or cluster such as expel node, or
quorum loss, then it runs callbacks and a failover is triggered.
3. The CES framework also triggers a failover during the distribution of IP addresses as specified in the
distribution policy.
• If there are any errors with the SMB and Object protocol services, the system restarts the corresponding
daemons. If restarting the protocol service daemons does not resolve the issue and the maximum retry
count is reached, the system changes the status of the node to Failed. The protocol service restarts are
logged in the event log. Issue the mmhealth node eventlog command to view the details of such
events.
If the system detects multiple problems simultaneously, then it starts the recovery procedure such as
automatic restart, and addresses the issue of the highest priority event first. After the recovery actions
are completed for the highest priority event, the system health is monitored again and then the recovery
actions for the next priority event is started. Similarly, issues for all the events are handled based on

© Copyright IBM Corp. 2015, 2024 543


their priority state until all failure events are resolved or the retry count is reached. For example, if the
system has two failure events as smb_down and ctdb_down, then since the ctdb_down event has a
higher priority, so the ctdb service is restarted first. After the recovery actions for ctdb_down event is
completed, the system health is monitored again. If the ctdb_down issue is resolved, then the recovery
actions for the smb_down event is started.
• For CES HDFS, there is an extra active to passive switch on top of the basic CES failover. This switch
moves the HDFS-dedicated IP addresses without affecting other protocols. For example, consider two
protocol nodes, node 1 and node 2 that have active HDFS, NFS, and SMB. If the HDFS NameNode
changed from active to standby or passive in protocol node 1, then the HDFS NameNode changes from
standby to active in protocol node 2. However, the SMB and NFS on protocol node 1 and node 2 are not
affected.

Upgrade recovery
Use this information to recover from a failed upgrade.
A failed upgrade might leave a cluster in a state of containing multiple code levels. It is important to
analyze console output to determine which nodes or components were upgraded prior to the failure and
which node or component was in the process of being upgraded when the failure occurred.
Once the problem has been isolated, a healthy cluster state must be achieved prior to continuing the
upgrade. Use the mmhealth command in addition to the mmces state show -a command to verify
that all services are up. It might be necessary to manually start services that were down when the
upgrade failed. Starting the services manually helps achieve a state in which all components are healthy
prior to continuing the upgrade.
For more information about verifying service status, see mmhealth command and mmces state show
command in IBM Storage Scale: Command and Programming Reference Guide.

Upgrade failure recovery when using the installation toolkit


If a failure occurs during an upgrade that is being done with the installation toolkit, examine the log
at the location that is provided in the output from the installation toolkit to determine the cause. More
information on each error is provided in the output from the installation toolkit.
Certain failures cause the upgrade process to stop. In this case, the cause of the failure must be
addressed on the node on which it occurred. After the problem is addressed, the upgrade command
can be run again. Running the upgrade command again does not affect any nodes that were upgraded
successfully and the upgrade process continues from the point of failure in most cases.
Examples of failure during upgrade are:
A protocol is enabled in the configuration file, but is not running on a node
Using the mmces service list command on the node that failed highlights which process is
not running. The output from the installation toolkit also reports which component failed. Ensure
that the component is started on all nodes with mmces service start nfs | smb | obj, or
alternatively disable the protocol in the configuration by using ./spectrumscale disable nfs|
smb|object if the component was intentionally stopped.
CAUTION:
When a protocol is disabled, the protocol is stopped on all protocol nodes in the cluster and all
protocol-specific configuration data is removed.
CES cannot be resumed on a node due to CTDB version mismatch
During an upgrade run, the CES-resume operation might fail on a node because the CTDB service
startup fails. The CTDB service startup fails because the SMB version on one or more nodes is
different from the SMB version on other nodes that are forming an active CTDB cluster. In this case, do
the following steps:
1. From the upgrade logs, determine the nodes on which the SMB version is different and designate
those nodes as offline in the upgrade configuration.

544 IBM Storage Scale 5.1.9: Problem Determination Guide


2. Do an upgrade rerun to complete the upgrade.
3. Remove the offline designation of the nodes and manually resume CES on those nodes.
For more information, see Performing offline upgrade or excluding nodes from upgrade using
installation toolkit and Upgrade rerun after an upgrade failure in IBM Storage Scale: Concepts,
Planning, and Installation Guide.
Upgrade recovery for HDFS not supported
CES also supports HDFS protocols. However, upgrade recovery is not supported for HDFS protocols.
Related concepts
“Resolving most frequent problems related to installation, deployment, and upgrade” on page 339
Use the following information to resolve the most frequent problems related to installation, deployment,
and upgrade.

Recovering cluster configuration by using CCR


Different procedures can be followed for recovering from a broken CCR. The recovery actions to be
applied vary based on the use cases.
Refer to the following documentation for details about recovering cluster configuration:

Recovering from a single quorum or non-quorum node failure


A quorum node failure can happen because of various reasons. For example, a node failure might occur
when the local hard disk on the node fails and must be replaced. The old content of the /var/mmfs
directory is lost after you replace the disk and reinstall the operating system and other software, including
the IBM Storage Scale software stack.
Note: The information given in this topic can also be used for recovering from a non-quorum node failure.
The recovery procedure for this case works only if the cluster has still enough quorum nodes available,
which can be checked by using the mmgetstate -a command on one of the remaining quorum nodes.
It is assumed that the node to be recovered is configured with the same IP address as before and the
contents of the /etc/hosts file is consistent with the other remaining quorum nodes as shown in the
following example:

# mmlscluster

GPFS cluster information


========================
GPFS cluster name: gpfs-cluster-2.localnet.com
GPFS cluster id: 13445038716777501310
GPFS UID domain: gpfs-cluster-2.localnet.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR

Node Daemon node name IP address Admin node name Designation


----------------------------------------------------------------------------
1 node-21.localnet.com 10.0.100.21 node-21.localnet.com quorum-manager
2 node-22.localnet.com 10.0.100.22 node-22.localnet.com quorum-manager
3 node-23.localnet.com 10.0.100.23 node-23.localnet.com quorum-manager
4 node-24.localnet.com 10.0.100.24 node-24.localnet.com
5 node-25.localnet.com 10.0.100.25 node-25.localnet.com

The entire content of /var/mmfs/ is deleted on the node node‑23 to simulate this case, then the
mmgetstate on the node to be recovered returns the following output:

# mmgetstate
mmgetstate: This node does not belong to a GPFS cluster.
mmgetstate: Command failed. Examine previous error messages to determine cause.

The cluster has still quorum:

# mmgetstate -a

Chapter 40. Recovery procedures 545


Node number Node name GPFS state
-------------------------------------
1 node-21 active
2 node-22 active
3 node-23 unknown
4 node-24 active
5 node-25 active

Run the mmccr check command on the node to be recovered as shown in the following example:

# mmccr check -Ye


mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEnti
ties:ListOfSucceedEntities:Severity:
mmccr::0:1:::-1:CCR_CLIENT_INIT:811:CCR directory or subdirectory missing:/var/mmfs/
ccr:Security:FATAL:
mmccr::0:1:::-1:FC_CCR_AUTH_KEYS:813:File does not exist:/var/mmfs/ssl/
authorized_ccr_keys::FATAL:
mmccr::0:1:::-1:FC_CCR_PAXOS_CACHED:811:CCR directory or subdirectory missing:/var/mmfs/ccr/
cached::WARNING:
mmccr::0:1:::-1:PC_QUORUM_NODES:812:ccr.nodes file missing or empty:/var/mmfs/ccr/
ccr.nodes::FATAL:
mmccr::0:1:::-1:FC_COMMITTED_DIR:812:ccr.nodes file missing or empty:/var/mmfs/ccr/
ccr.nodes::FATAL:

In this case, you can recover this node by using the mmsdrrestore command with the -p option. The
-p option must specify a healthy quorum node from which the necessary files can be transferred. The
mmsdrrestore command must run on the node to be recovered as shown in the following example:

# mmsdrrestore -p node-21
genkeyData1

100% 3529 1.8MB/s 00:00


genkeyData2

100% 3529 2.8MB/s 00:00


Wed Jul 7 14:42:16 CEST 2021: mmsdrrestore: Processing node node-23.localnet.com
mmsdrrestore: Node node-23.localnet.com successfully restored.

Immediately after the mmsdrrestore command completes, the mmgetstate command still reports
that the GPFS is down. However, you can start the GPFS now on the recovered node. The mmgetstate
command then shows GPFS as active as shown in the following example:

# mmgetstate

Node number Node name GPFS state


-------------------------------------
3 node-23 active

The output of the mmccr check command on the recovered shows a healthy status as shown in the
following example:

# mmccr check -Ye


mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEnti
ties:ListOfSucceedEntities:Severity:
mmccr::0:1:::3:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,Security:OK:
mmccr::0:1:::3:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::3:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::3:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::3:PC_LOCAL_SERVER:0:::node-23.localnet.com:OK:
mmccr::0:1:::3:PC_IP_ADDR_LOOKUP:0:::node-23.localnet.com,0.000:OK:
mmccr::0:1:::3:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22,10.0.100.23:OK:
mmccr::0:1:::3:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::3:TC_TIEBREAKER_DISKS:0::::OK:

Recovering from the loss of a majority of quorum nodes


Quorum loss might happen because of hardware issues on the quorum nodes. Due to the loss of a
majority of quorum nodes, the cluster becomes inoperable.
Perform the following steps to investigate the issue with the cluster and recover CCR:

546 IBM Storage Scale 5.1.9: Problem Determination Guide


1. Issue the mmgetstate command to understand the status of the cluster. When a majority of quorum
nodes are not available, the mmgetstate command gives an output similar to the following example:

# mmgetstate -a
mmgetstate: [E] The command was unable to reach the CCR service on the majority of quorum
nodes to form CCR quorum. Ensure the CCR service (mmfsd or mmsdrserv daemon) is running on
all quorum nodes and the communication port is not blocked by a firewall.
mmgetstate: Command failed. Examine previous error messages to determine cause.

2. Issue the mmlscluster --noinit command to identify the quorum nodes in the cluster as shown in
the following example:

# mmlscluster --noinit

GPFS cluster information


========================
GPFS cluster name: gpfs-cluster-2.localnet.com
GPFS cluster id: 13445038716777501310
GPFS UID domain: gpfs-cluster-2.localnet.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR

Node Daemon node name IP address Admin node name Designation


----------------------------------------------------------------------------
1 node-21.localnet.com 10.0.100.21 node-21.localnet.com quorum-manager
2 node-22.localnet.com 10.0.100.22 node-22.localnet.com quorum-manager
3 node-23.localnet.com 10.0.100.23 node-23.localnet.com quorum-manager
4 node-24.localnet.com 10.0.100.24 node-24.localnet.com
5 node-25.localnet.com 10.0.100.25 node-25.localnet.com

3. Issue the ping command to verify whether the lost quorum nodes are reachable:

# ping -c 1 node-22.localnet.com
PING node-22.localnet.com (10.0.100.22) 56(84) bytes of data.
From node-21.localnet.com (10.0.100.21) icmp_seq=1 Destination Host Unreachable

--- node-22.localnet.com ping statistics ---


1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping -c 1 node-23.localnet.com
PING node-23.localnet.com (10.0.100.23) 56(84) bytes of data.
From node-21.localnet.com (10.0.100.21) icmp_seq=1 Destination Host Unreachable

--- node-23.localnet.com ping statistics ---


1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

4. Issue the mmccr check on the remaining quorum node to get the details of the missing quorum
nodes and a quorum loss (809) of the CCR server, which is running on the local node:

# mmccr check -Ye


mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedE
ntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,Security:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/
ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::node-21.localnet.com:OK:
mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::node-21.localnet.com,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:1143:Ping CCR quorum nodes
failed:10.0.100.22,10.0.100.23:10.0.100.21:FATAL:
mmccr::0:1:::1:FC_COMMITTED_DIR:809:Connect local CCR server failed:10.0.100.21::WARNING:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0::::OK:

5. Issue the mmchnode command with the --force option to force the system to reduce the number
of quorum nodes to the still available quorum nodes. This command takes a while and expects a
confirmation to proceed.
The --force option enforces the GPFS to continue run normally by using the copy of the CCR state
found on the only remaining quorum node. As CCR no longer has quorum, GPFS cannot verify whether
it is the most recent version of the CCR state. If the other two quorum nodes are failed while a GPFS

Chapter 40. Recovery procedures 547


command was running and some configuration data was changed during this time, then the CCR state
on the surviving quorum node might become stale or inconsistent with the state of one of the GPFS file
systems. Therefore, this procedure must be used only if no recent configuration change or if none of
the failed quorum nodes can be brought back online.

# mmchnode --noquorum -N node-22,node-23 --force


mmchnode: Unable to obtain the GPFS configuration file lock.
mmchnode: Processing continues without lock protection.
mmchnode: Entering mmchnode restricted mode of operations.
Wed Jul 7 16:44:21 CEST 2021: mmchnode: Processing node node-23.localnet.com
Wed Jul 7 16:44:21 CEST 2021: mmchnode: Processing node node-22.localnet.com
mmchnode: You are attempting to override normal GPFS quorum semantics.
This may endanger the integrity of the configuration data and prevent normal operations.
Proceed only if this cluster is part of a disaster recovery environment that is set up
according to the instructions in "Establishing disaster recovery for your GPFS cluster"
in the GPFS Advanced Administration guide and you are strictly following the failover
procedures described in that document.
Do you want to continue? (yes/no) yes
mmchnode: mmsdrfs propagation completed.
mmchnode: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
[root@node-21 ~]# Wed Jul 7 16:45:56 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation
started
Wed Jul 7 16:46:07 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh
rc=0

After the command returned successfully, the cluster is back to a working state because CCR is able to
reach quorum without the quorum nodes that are no longer available. The failed nodes are still in the
list of cluster nodes.
6. Issue the mmdelnode command as shown in the following example to remove the failed nodes:

# mmlscluster

GPFS cluster information


========================
GPFS cluster name: gpfs-cluster-2.localnet.com
GPFS cluster id: 13445038716777501310
GPFS UID domain: gpfs-cluster-2.localnet.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR

Node Daemon node name IP address Admin node name Designation


----------------------------------------------------------------------------
1 node-21.localnet.com 10.0.100.21 node-21.localnet.com quorum-manager
2 node-22.localnet.com 10.0.100.22 node-22.localnet.com manager
3 node-23.localnet.com 10.0.100.23 node-23.localnet.com manager
4 node-24.localnet.com 10.0.100.24 node-24.localnet.com
5 node-25.localnet.com 10.0.100.25 node-25.localnet.com

# mmgetstate -a

Node number Node name GPFS state


-------------------------------------
1 node-21 active
2 node-22 unknown
3 node-23 unknown
4 node-24 active
5 node-25 active

# mmdelnode -N node-22,node-23 --force


Verifying GPFS is stopped on all affected nodes ...
mmdsh: There are no available nodes on which to run the command.
mmdelnode: Unable to confirm that GPFS is stopped on all of the affected nodes.
Nodes should not be removed from the cluster if GPFS is still running.
Make sure GPFS is down on all affected nodes before continuing. If not,
this may cause a cluster outage.
Do you want to continue? (yes/no) yes
mmdelnode: Removing GPFS system files on all deleted nodes ...
mmdelnode: [W] Could not cleanup the following unreached nodes:
node-23.localnet.com
node-22.localnet.com
mmdelnode: Command successfully completed
QOS configuration has been installed and broadcast to all nodes.
mmdelnode: Propagating the cluster configuration data to all

548 IBM Storage Scale 5.1.9: Problem Determination Guide


affected nodes. This is an asynchronous process.
Wed Jul 7 18:08:55 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation started

# mmlscluster

GPFS cluster information


========================
GPFS cluster name: gpfs-cluster-2.localnet.com
GPFS cluster id: 13445038716777501310
GPFS UID domain: gpfs-cluster-2.localnet.com
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR

Node Daemon node name IP address Admin node name Designation


----------------------------------------------------------------------------
1 node-21.localnet.com 10.0.100.21 node-21.localnet.com quorum-manager
4 node-24.localnet.com 10.0.100.24 node-24.localnet.com
5 node-25.localnet.com 10.0.100.25 node-25.localnet.com

You can also use the mmhealth node show instead of using the mmlscluster --noinit
command to get the list of quorum nodes. The mmhealth node show command provides the status
of the IBM Storage Scale components as shown in the following example:

# mmhealth node show

Node name: node-21.localnet.com


Node status: DEGRADED
Status Change: 7 min. ago

Component Status Status Change Reasons


---------------------------------------------------------------------------------------------
-
FILESYSMGR HEALTHY 22 hours ago -
GPFS DEGRADED 7 min. ago ccr_local_server_warn, quorum_down,
ccr_quorum_nodes_fail
NETWORK HEALTHY 14 days ago -
FILESYSTEM FAILED 8 min. ago stale_mount(gpfs0)
DISK HEALTHY 14 days ago -

In addition, the mmhealth node show <COMPONENT> -v --unhealthy lists more details about
the specified component. You can find the IP addresses of the unavailable quorum nodes from the
command output:

# mmhealth node show GPFS -v --unhealthy

Node name: node-21.localnet.com

Component Status Status Change Reasons


---------------------------------------------------------------------------------------------
-
GPFS DEGRADED 2021-07-23 13:53:11 ccr_local_server_warn, quorum_down,
ccr_quorum_nodes_fail

Event Parameter Severity Active Since Event Message


---------------------------------------------------------------------------------------------
-
ccr_local_server_warn GPFS WARNING 2021-07-23 14:14:04 The local GPFS
CCR server is not reachable Item=PC_LOCAL_SERVER,ErrMsg='Ping local CCR server
failed',Failed='node-21.localnet.com'
quorum_down GPFS ERROR 2021-07-23 13:53:11 The node is not
able to reach enough quorum nodes/disks to work properly.
ccr_quorum_nodes_fail GPFS ERROR 2021-07-23 13:58:54 A majority of
the quorum nodes are not reachable over the management network
Item=PC_QUORUM_NODES,ErrMsg='Ping CCR quorum nodes failed',Failed='10.0.100.22;10.0.100.23'

Recovering from damage or loss of the CCR on all quorum nodes


This scenario occurs when files get corrupted in the CCR directory on all quorum nodes in the cluster.
Perform the following steps to investigate the issue and recover CCR on all quorum nodes:
1. Check the CCR status as shown in the following example:

Chapter 40. Recovery procedures 549


# mmccr check -Ye
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedE
ntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,Security:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/
ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::node-21.localnet.com:OK:
mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::node-21.localnet.com,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22,10.0.100.23:OK:
mmccr::0:1:::1:FC_COMMITTED_DIR:5:Files in committed directory missing or
corrupted:1:6:WARNING:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0::::OK:

2. Issue the mmsdrrestore command with the --ccr-repair option to repair CCR. A sample output is
as follows:

# mmsdrrestore --ccr-repair
mmsdrrestore: Checking CCR on all quorum nodes ...
mmsdrrestore: Invoking CCR restore in dry run mode ...

ccrrestore: +++ DRY RUN: CCR state on quorum nodes will not be restored +++
ccrrestore: 1/8: Test tool chain successful
ccrrestore: 2/8: Setup local working directories successful
ccrrestore: 3/8: Copy Paxos state files from quorum nodes successful
ccrrestore: 4/8: Getting most recent Paxos state file successful
ccrrestore: 5/8: Get cksum of files in committed directory successful
ccrrestore: 6/8: WARNING: Intact ccr.nodes file with version 5 missing in committed
directory
ccrrestore: 6/8: INFORMATION: Intact ccr.disks found (file id: 2 version: 1)
ccrrestore: 6/8: INFORMATION: Intact mmLockFileDB found (file id: 3 version: 1)
ccrrestore: 6/8: INFORMATION: Intact genKeyData found (file id: 4 version: 1)
ccrrestore: 6/8: INFORMATION: Intact genKeyDataNew found (file id: 5 version: 2)
ccrrestore: 6/8: INFORMATION: Intact mmsdrfs found (file id: 6 version: 23)
ccrrestore: 6/8: INFORMATION: Intact mmsysmon.json found (file id: 7 version: 1)
ccrrestore: 6/8: Parsing committed file list successful
ccrrestore: 7/8: Pulling committed files from quorum nodes successful
ccrrestore: 8/8: File name: 'ccr.nodes' file state: UPDATED remark: 'OLD (v5,
((n1,e6),103), f20ea9e3)'
ccrrestore: 8/8: File name: 'ccr.disks' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'mmLockFileDB' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'genKeyData' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'genKeyDataNew' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'mmsdrfs' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'mmsysmon.json' file state: MATCHING remark: 'none'
ccrrestore: 8/8: Patching Paxos state successful

mmsdrrestore: Review the dry run report above to see what will be changed and decide if you
want to continue the restore or not. Do you want to continue? (yes/no) yes
ccrrestore: 1/14: Test tool chain successful
ccrrestore: 2/14: Test GPFS shutdown successful
ccrrestore: 3/14: Setup local working directories successful
ccrrestore: 4/14: Archiving CCR directories on quorum nodes successful
ccrrestore: 5/14: Kill GPFS mmsdrserv daemon successful
ccrrestore: 6/14: Copy Paxos state files from quorum nodes successful
ccrrestore: 7/14: Getting most recent Paxos state file successful
ccrrestore: 8/14: Get cksum of files in committed directory successful
ccrrestore: 9/14: WARNING: Intact ccr.nodes file with version 5 missing in committed
directory
ccrrestore: 9/14: INFORMATION: Intact ccr.disks found (file id: 2 version: 1)
ccrrestore: 9/14: INFORMATION: Intact mmLockFileDB found (file id: 3 version: 1)
ccrrestore: 9/14: INFORMATION: Intact genKeyData found (file id: 4 version: 1)
ccrrestore: 9/14: INFORMATION: Intact genKeyDataNew found (file id: 5 version: 2)
ccrrestore: 9/14: INFORMATION: Intact mmsdrfs found (file id: 6 version: 23)
ccrrestore: 9/14: INFORMATION: Intact mmsysmon.json found (file id: 7 version: 1)
ccrrestore: 9/14: Parsing committed file list successful
ccrrestore: 10/14: Pulling committed files from quorum nodes successful
ccrrestore: 11/14: File name: 'ccr.nodes' file state: UPDATED remark: 'OLD (v5,
((n1,e6),103), f20ea9e3)'
ccrrestore: 11/14: File name: 'ccr.disks' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'mmLockFileDB' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'genKeyData' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'genKeyDataNew' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'mmsdrfs' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'mmsysmon.json' file state: MATCHING remark: 'none'
ccrrestore: 11/14: Patching Paxos state successful
ccrrestore: 12/14: Pushing CCR files successful

550 IBM Storage Scale 5.1.9: Problem Determination Guide


ccrrestore: 13/14: Started GPFS mmsdrserv daemon successful
ccrrestore: 14/14: Ping GPFS mmsdrserv daemon successful

3. Issue the mmccr check command as shown in the following example to check the status of the CCR:

# mmccr check -Ye


mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedE
ntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,Security:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/
ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::node-21.localnet.com:OK:
mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::node-21.localnet.com,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22,10.0.100.23:OK:
mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0::::OK:

Important: The CCR restore script recovers the CCR from the fragments of CCR configuration files
that are available in the cluster nodes. The recovered CCR might be having the details of an old
cluster configuration. If a recent backup is available, it might be better to use that backup, even if
mmsdrrestore --ccr-repair is able to restore from available fragments.

Recovering from an existing CCR backup


When you have a CCR backup file, you can recover a cluster by using that backup file. The CCR backup file
contains all files that are committed to the CCR at the time the mmccr backup command is launched.
When you restore the cluster from a CCR backup file, the cluster runs based on the configuration in this
backup file. Therefore, it is recommended to create a CCR backup file in regular intervals or at least after
important cluster configuration changes.
Note: The mmsdrbackup user exit can be used to create CCR backups automatically every time the
mmsdrfs file changes. The mmsdrfs file contains basic cluster and file system configuration information.
For more information, see mmsdrbackup command in IBM Storage Scale: Command and Programming
Reference Guide.
Perform the following steps to create a CCR backup:
1. Issue the mmccr backup to create the CCR backup as shown in the following example:

# mmccr backup -A /root/CCRbackup_20210708115924.tar.gz


CCR archive stored under: '/root/CCRbackup_20210708115924.tar.gz

Note: In this example, the entire /var/mmfs directory is deleted on all nodes in the cluster to
simulate this case.
2. Issue the mmsdrrestore command with -F and -a options as shown in the following example to
restore backup:

# mmsdrrestore -F /root/CCRbackup_20210708115924.tar.gz -a

The system displays output similar to this:

Restoring CCR backup


Verifying that GPFS is inactive on quorum nodes
Node node-25.localnet.com was not recovered because it is not available
Node node-24.localnet.com was not recovered because it is not available
When the unreached nodes are available, they can be recovered with:
mmsdrrestore -p node-21.localnet.com
mmsdrrestore: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
CCR backup has been restored
[root@node-21 ~]# Thu Jul 8 16:00:55 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation
started
Thu Jul 8 16:01:00 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh
rc=0

Chapter 40. Recovery procedures 551


3. Issue the mmsdrrestore command with the -p option to retrieve the backup on the non-quorum
nodes. The -p option must specify a healthy quorum node from which the necessary files can be
transferred. The mmsdrrestore command must run on the node to be recovered as shown in the
following example:

# mmsdrrestore -p node-21
genkeyData1

100% 3529 2.0MB/s 00:00


genkeyData2

100% 3529 3.3MB/s 00:00


Thu Jul 8 16:03:46 CEST 2021: mmsdrrestore: Processing node node-24.localnet.com
mmsdrrestore: Node node-24.localnet.com successfully restored.

# mmsdrrestore -p node-21
genkeyData1

100% 3529 4.0MB/s 00:00


genkeyData2

100% 3529 4.1MB/s 00:00


Thu Jul 8 16:05:04 CEST 2021: mmsdrrestore: Processing node node-25.localnet.com
mmsdrrestore: Node node-25.localnet.com successfully restored.

4. Verify the GPFS state at the cluster level by using the mmgetstate command as shown in the
following example:

# mmgetstate -a

Node number Node name GPFS state


-------------------------------------
1 node-21 active
4 node-24 active
5 node-25 active
6 node-22 active
7 node-23 active

Note: Based on the age of the used CCR backup file, it is possible that the cluster might be recovered to
an old cluster configuration. It is recommended to take regular backup of CCR.

Repair of cluster configuration information when no CCR backup is


available
The following procedures describe how to repair missing or corrupted cluster configuration information,
when no intact CCR can be found on the quorum nodes and no CCR backup is available from which the
broken cluster can be recovered.
These procedures do not guarantee to recover the most recent state of all the configuration files in the
CCR. Instead, they bring the CCR back into a consistent state with the most recent available version of
each configuration file.
All of the procedures include the following major steps:
1. Diagnose if the CCR is broken on a majority of the quorum nodes.
2. Evaluate the CCR's most recent Paxos state file.
3. Patch the CCR's Paxos state file.
4. Verify that the CCR state is intact after patching the Paxos state file and copying back to the CCR
directory.
In most cases the mmsdrrestore --ccr-repair command restores missing or corrupted CCR files as
well as or better than the manual procedures.

552 IBM Storage Scale 5.1.9: Problem Determination Guide


Repair of cluster configuration information when no CCR backup information
is available: mmsdrrestore command
When no CCR backup information is available, you can repair missing or corrupted CCR files with the
mmsdrrestore command and the --ccr-repair parameter.
The mmsdrrestore command with the --ccr-repair parameter can repair cluster configuration
information in a cluster in which no intact CCR state can be found on the quorum nodes and no CCR
backup is available from which the cluster configuration information can be recovered. In this state the
CCR committed directory of all of the quorum nodes has corrupted or lost files. For more information,
see the topic mmsdrrestore command in the IBM Storage Scale: Command and Programming Reference
Guide.
This procedure does not guarantee to recover the most recent state of all the configuration files in the
CCR. Instead, it brings the CCR back into a consistent state with the most recent available version of each
configuration file.
For an example of running the mmsdrrestore command with the --ccr-repair parameter, see the
topic mmsdrrestore command in the IBM Storage Scale: Command and Programming Reference Guide.
The following events can cause missing or corrupted files in the CCR committed directory of all of the
quorum nodes:
• All the quorum nodes crash or lose power at the same time, and on each quorum node one or more files
are corrupted or lost in the CCR committed directory after the quorum nodes started up.
• The local disks of all the quorum nodes have a power loss, and on each quorum node a file is corrupted
or lost in the CCR committed directory after the local disks came back.
The following error messages are possible indicators of corrupted or lost files in the CCR committed
directory. If a command displays one of these error messages, follow the instructions in the "User
response" section of the error message before you try to run the mmsdrrestore --ccr-repair
command:
• “6027-4200 [E]” on page 917
• “6027-4205 [E]” on page 918
The following error messages are indicators of other CCR problems. Consider reading and following
the instructions in the "User response" sections of these error messages before you try to run the
mmsdrrestore --ccr-repair command:
• “6027-4204 [E]” on page 918
• “6027-4206 [E]” on page 918
• “6027-4207 [E]” on page 919

Checking for corrupted or lost files in the CCR committed directory


To determine whether the CCR committed directory of a quorum node has corrupted or lost files, issue
the following command from the node:

mmccr check -Y -e

In the following example, the next-to-last line of the output indicates that one or more files are corrupted
or lost in the CCR committed directory of the current node:

# mmccr check -Y -e
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:
ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,
Security,/var/mmfs/ccr/ccr.disks:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::c80f5m5n01.gpfs.net:OK:

Chapter 40. Recovery procedures 553


mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::c80f5m5n01.gpfs.net,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:0:::192.168.80.181,192.168.80.182:OK:
mmccr::0:1:::1:FC_COMMITTED_DIR:5:Files in committed directory missing or corrupted:1:6:WARNING:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0:::1:OK:

In the following example, the next-to-last line indicates that none of the files in the CCR committed
directory of the current node are corrupted or lost:

# mmccr check -Y -e
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:
ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,
Security,/var/mmfs/ccr/ccr.disks:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::c80f5m5n01.gpfs.net:OK:
mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::c80f5m5n01.gpfs.net,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:0:::192.168.80.181,192.168.80.182:OK:
mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0:::1:OK:

554 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 41. Support for troubleshooting
This topic describes the support that is available for troubleshooting any issues that you might encounter
while using IBM Storage Scale .

Contacting IBM support center


Specific information about a problem such as: symptoms, traces, error logs, GPFS logs, and file system
status is vital to IBM in order to resolve a GPFS problem.
Obtain this information as quickly as you can after a problem is detected, so that error logs will not
wrap and system parameters that are always changing, will be captured as close to the point of failure
as possible. When a serious problem is detected, collect this information and then call IBM. For more
information, see:
• “Information to be collected before contacting the IBM Support Center” on page 555
• “How to contact the IBM Support Center” on page 557

Information to be collected before contacting the IBM Support Center


For effective communication with the IBM Support Center to help with problem diagnosis, you need to
collect certain information.

Information to be collected for all problems related to GPFS


Regardless of the problem encountered with GPFS, the following data should be available when you
contact the IBM Support Center:
1. A description of the problem.
2. Output of the failing application, command, and so forth.
3. A tar file generated by the gpfs.snap command that contains data from the nodes in the cluster.
In large clusters, the gpfs.snap command can collect data from certain nodes (for example, the
affected nodes, NSD servers, or manager nodes) using the -N option.
If the gpfs.snap command cannot be run, then collect these items:
a. Any error log entries relating to the event:
• On an AIX node, issue this command:

errpt -a

• On a Linux node, create a tar file of all the entries in the /var/log/messages file from all
nodes in the cluster or the nodes that experienced the failure. For example, issue the following
command to create a tar file that includes all nodes in the cluster:

mmdsh -v -N all "cat /var/log/messages" > all.messages

• On a Windows node, use the Export List... dialog in the Event Viewer to save the event log to a
file.
b. A master GPFS log file that is merged and chronologically sorted for the date of the failure (see
“Creating a master GPFS log file” on page 258).
c. If the cluster was configured to store dumps, then collect any internal GPFS dumps written to that
directory relating to the time of the failure. The default directory is /tmp/mmfs.

© Copyright IBM Corp. 2015, 2024 555


d. On a failing Linux node, gather the installed software packages and the versions of each package by
issuing this command:

rpm -qa

e. On a failing AIX node, gather the name, most recent level, state, and description of all installed
software packages by issuing this command:

lslpp -l

f. File system attributes for all of the failing file systems, issue:

mmlsfs Device

g. The current configuration and state of the disks for all of the failing file systems, issue:

mmlsdisk Device

h. A copy of file /var/mmfs/gen/mmsdrfs from the primary cluster configuration server.


4. For Linux on Z, collect the data of the operating system as described in the Linux on z Systems®
Troubleshooting Guide.
5. If you are experiencing one of the following problems, then see the appropriate section before
contacting the IBM Support Center:
• For delay and deadlock issues, see “Additional information to collect for delays and deadlocks” on
page 556.
• For file system corruption or MMFS_FSSTRUCT errors, see “Additional information to collect for file
system corruption or MMFS_FSSTRUCT errors” on page 556.
• For GPFS daemon crashes, see “Additional information to collect for GPFS daemon crashes” on page
557.

Additional information to collect for delays and deadlocks


When a delay or deadlock situation is suspected, the IBM Support Center needs an additional information
to assist with problem diagnosis. If you have not done so already, then ensure you have the following
information available before contacting the IBM Support Center:
1. Everything that is listed in “Information to be collected for all problems related to GPFS” on page 555.
2. The deadlock debug data collected automatically.
3. If the cluster size is relatively small and the maxFilesToCache setting is not high (less than 10,000),
then issue the following command:

gpfs.snap --deadlock

If the cluster size is large or the maxFilesToCache setting is high (greater than 1M), then issue the
following command:

gpfs.snap --deadlock --quick

Additional information to collect for file system corruption or MMFS_FSSTRUCT


errors
When file system corruption or MMFS_FSSTRUCT errors are encountered, the IBM Support Center needs
additional information to assist with problem diagnosis. If you have not done so already, then ensure you
have the following information available before contacting the IBM Support Center:
1. Everything that is listed in “Information to be collected for all problems related to GPFS” on page 555.
2. Unmount the file system everywhere, then run mmfsck -n in offline mode and redirect it to an output
file.

556 IBM Storage Scale 5.1.9: Problem Determination Guide


The IBM Support Center determines when you should run the mmfsck -y command.

Additional information to collect for GPFS daemon crashes


When the GPFS daemon is repeatedly crashing, the IBM Support Center needs an additional information
to assist with problem diagnosis. If you have not done so already, then ensure you have the following
information available before contacting the IBM Support Center:
1. Everything that is listed in “Information to be collected for all problems related to GPFS” on page 555.
2. Ensure the /tmp/mmfs directory exists on all nodes. If this directory does not exist, then the GPFS
daemon does not generate internal dumps.
3. Set the traces on this cluster and all clusters that mount any file system from this cluster:

mmtracectl --set --trace=def --trace-recycle=global

4. Start the trace facility by issuing:

mmtracectl --start

5. Recreate the problem when possible or wait for the assert to be triggered again.
6. Once the assert is encountered on the node, turn off the trace facility by issuing:

mmtracectl --off

If traces were started on multiple clusters, then issue the mmtracectl --off command immediately
on all clusters.
7. Collect gpfs.snap output:

gpfs.snap

How to contact the IBM Support Center


The IBM Support Center is available for various types of IBM hardware and software problems that GPFS
customers may encounter.
These problems include the following:
• IBM hardware failure
• Node halt or crash not related to a hardware failure
• Node hang or response problems
• Failure in other software supplied by IBM
If you have an IBM Software Maintenance service contract
If you have an IBM Software Maintenance service contract, contact the IBM Support Center, as
follows:

Your location Method of contacting the IBM Support Center


In the United States Call 1-800-IBM-SERV for support.
Outside the United States Contact your local IBM Support Center or see the
Directory of worldwide contacts (www.ibm.com/
planetwide).

When you contact the IBM Support Center, the following will occur:
1. You will be asked for the information you collected in “Information to be collected before
contacting the IBM Support Center” on page 555.

Chapter 41. Support for troubleshooting 557


2. You will be given a time period during which an IBM representative will return your call. Be sure
that the person you identified as your contact can be reached at the phone number you provided in
the PMR.
3. An online Problem Management Record (PMR) will be created to track the problem you are
reporting, and you will be advised to record the PMR number for future reference.
4. You may be requested to send data related to the problem you are reporting, using the PMR
number to identify it.
5. Should you need to make subsequent calls to discuss the problem, you will also use the PMR
number to identify the problem.
If you do not have an IBM Software Maintenance service contract
If you do not have an IBM Software Maintenance service contract, contact your IBM sales
representative to find out how to proceed. Be prepared to provide the information you collected
in “Information to be collected before contacting the IBM Support Center” on page 555.
For failures in non-IBM software, follow the problem-reporting procedures provided with that product.

Call home notifications to IBM Support


The call home feature automatically notifies IBM Support if certain types of events occur in the
system. Using this information, IBM Support can contact the system administrator in case of any issues.
Configuring call home reduces the response time for IBM Support to address the issues.
The details are collected from individual nodes that are marked as call home child nodes in the cluster.
The details from each child node are collected by the call home node. You need to create a call home
group by grouping call home child nodes. One of the nodes in the group is configured as the call home
node, and it performs data collection and upload.
The data gathering and upload can be configured individually on each group. Use the groups to reflect
logical units in the cluster. For example, it is easier to manage when you create a group for all CES nodes
and another group for all non-CES nodes.
You can also use the Settings > Call Home page in the IBM Storage Scale management GUI to configure
the call home feature. For more information on configuring call home using GUI, see the Configuring call
home using GUI section in the IBM Storage Scale: Administration Guide.
For more information on how to configure and manage the call home feature, see Chapter 11, “Monitoring
the IBM Storage Scale system by using call home,” on page 227.

558 IBM Storage Scale 5.1.9: Problem Determination Guide


Chapter 42. References
The IBM Storage Scale system displays messages if it encounters any issues when you configure the
system. The message severity tags help to assess the severity of the issue.

Events
The recorded events are stored in the local database on each node. The user can get a list of recorded
events by using the mmhealth node eventlog command. Users can use the mmhealth node
show or mmhealth cluster show commands to display the active events in the node and cluster
respectively.
The recorded events can also be displayed through the GUI.
When you upgrade to IBM Storage Scale 5.0.5.3 or a later version, the nodes where no sqlite3 package
is installed have their RAS event logs converted to a new database format to prevent known issues. The
old RAS event log is emptied automatically. You can verify that the event log is emptied either by using the
mmhealth node eventlog command or in the IBM Storage Scale GUI.
Note: The event logs are updated only the first time IBM Storage Scale is upgraded to version 5.0.5.3 or
higher.
The following sections list the RAS events that are applicable to various components of the IBM Storage
Scale system:

AFM events
The following table lists the events that are created for the AFM component.
Table 72. Events for the AFM component

Event Event Severity Call Details


Type Home

afm_cache_disconnected STATE_CHANGE WARNING no Message: Fileset {0} is disconnected.

Description: The AFM cache fileset is not connected to its home server.

Cause: Shows that the connectivity between the AFM Gateway and the
mapped home server is lost.

User Action: The user action is based on the source of the disconnectivity.
Check the settings on both home and cache sites and correct the
connectivity issues. The state automatically changes to ACTIVE state after
solving the issues.

afm_cache_dropped STATE_CHANGEIBM ERROR no Message: Fileset {0} is in the DROPPED state.


Storage Scale
Description: The AFM cache fileset state moves to the DROPPED state.

Cause: An AFM cache fileset state moves to dropped due to different


reasons, such as recovery failures or failback failures, etc.

User Action: There are many reasons that can cause the cache to go to
DROPPED state. For more information, see the Monitoring fileset states for
AFM (DR) section in the IBM Storage Scale: Problem Determination Guide.

afm_cache_expired INFO ERROR no Message: Fileset {0} in {1} mode is now in the EXPIRED state.

Description: Cache contents are no longer accessible due to expiration of


time.

Cause: Cache contents are no longer accessible due to expiration of time.

User Action: Check the network connectivity to the home server as well as
the home server availability.

© Copyright IBM Corp. 2015, 2024 559


Table 72. Events for the AFM component (continued)

Event Event Severity Call Details


Type Home

afm_cache_inactive STATE_CHANGE INFO no Message: The AFM cache fileset {0} is in the INACTIVE state.

Description: The AFM fileset is in the INACTIVE state until initial


operations on the fileset are triggered by the user.

Cause: N/A

User Action: N/A

afm_cache_recovery STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the RECOVERY state.

Description: In this state, the AFM cache fileset recovers from a previous
failure and identifies changes that need to be synchronized to its home
server.

Cause: A previous failure triggered a cache recovery.

User Action: This state automatically changes back to ACTIVE when the
recovery is finished.

afm_cache_stopped STATE_CHANGE WARNING no Message: The AFM fileset {0} is stopped.

Description: The AFM cache fileset is stopped.

Cause: The AFM cache fileset is in the Stopped state.

User Action: Run the mmafmctl restart command to continue


operations on the fileset.

afm_cache_suspended STATE_CHANGE WARNING no Message: AFM fileset {0} is suspended.

Description: The AFM cache fileset is suspended.

Cause: The AFM cache fileset is in the Suspended state.

User Action: Run the mmafmctl resume command to resume operations


on the fileset.

afm_cache_unmounted STATE_CHANGE ERROR no Message: The AFM cache fileset {0} is in unmounted state.

Description: The AFM cache fileset is in an Unmounted state because of


issues on the home site.

Cause: The AFM cache fileset is in this state when either the home server's
NFS-mount is not accessible, home server's exports are not exported
properly, or home server's export does not exist.

User Action: Resolve issues on the home server's site. Afterwards, this
state changes automatically.

afm_cache_up STATE_CHANGE INFO no Message: An 'Active' or 'Dirty' status is expected in the mmdiag --afm
command output, and the output shows that the cache is in a HEALTHY
state.

Description: The AFM cache is up and ready for operations.

Cause: N/A

User Action: N/A

afm_cmd_requeued STATE_CHANGE WARNING no Message: Messages are requeued on the AFM fileset {0}. Details: {1}.

Description: Triggered during replication when messages are queued up


again because of errors. These messages are retried after 15 minutes.

Cause: Callback afmCmdRequeued is being processed.

User Action: It is usually a transient state. Track this event. If the problem
remains, then contact IBM Support.

afm_event_connected STATE_CHANGE INFO no Message: The AFM node {0} has regained connection to the home site.
Details: {1}.

Description: Triggered when a gateway node connects to the afmTarget of


the fileset that it is serving.

Cause: N/A

User Action: N/A

560 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 72. Events for the AFM component (continued)

Event Event Severity Call Details


Type Home

afm_event_disconnected STATE_CHANGE ERROR no Message: The AFM node {0} has lost connection to the home site. Fileset
{1}. Details: {2}.

Description: Triggered when a gateway node gets disconnected from the


afmTarget of the fileset that it is serving.

Cause: Callback afmHomeDisconnected is being processed.

User Action: Check the network connectivity to the home server as well as
the home server availability.

afm_failback_complete STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the
FailbackCompleted state.

Description: The independent writer failback is finished.

Cause: The independent writer failback is finished and needs further user
actions.

User Action: The administrator must run the mmafmctl failback --


stop command to move the IW cache to the ACTIVE state.

afm_failback_needed STATE_CHANGE ERROR no Message: The AFM cache fileset {0} in {1} mode is in the NeedFailback
state.

Description: A previous failback operation could not be completed and


needs to be re-run.

Cause: This state is reached when a previously initialized failback was


interrupted and not completed.

User Action: Failback automatically gets triggered on the fileset. The


administrator can manually re-run a failback by using the mmafmctl
failback command.

afm_failback_running STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the
FailbackInProgress state.

Description: A failback process on the independent writer cache is in-


progress.

Cause: A failback process has been initiated on the independent writer


cache and is in-progress.

User Action: No user action is needed at this point. After completion, the
state automatically changes to the FailbackCompleted state.

afm_failover_running STATE_CHANGE WARNING no Message: The AFM cache fileset {0} is in FailoverInProgress state.

Description: The AFM cache fileset is in the middle of a failover process.

Cause: The AFM cache fileset is in the middle of a failover process.

User Action: No user action is needed at this point. The cache state is
moved automatically to the ACTIVE state when the failover is completed.

afm_fileset_changed INFO INFO no Message: AFM fileset {0} is changed.

Description: An AFM fileset is changed.

Cause: N/A

User Action: N/A

afm_fileset_created INFO INFO no Message: AFM fileset {0} is created.

Description: An AFM fileset is created.

Cause: N/A

User Action: N/A

afm_fileset_deleted INFO INFO no Message: AFM fileset {0} is deleted.

Description: An AFM fileset is deleted.

Cause: N/A

User Action: N/A

Chapter 42. References 561


Table 72. Events for the AFM component (continued)

Event Event Severity Call Details


Type Home

afm_fileset_expired INFO WARNING no Message: The contents of the AFM cache fileset {0} are expired.

Description: The AFM cache fileset contents are expired.

Cause: The contents of a fileset expire either as a result of the fileset being
disconnected for the expiration timeout value or when the fileset is marked
as expired using the AFM administration commands. This event is triggered
through an AFM callback.

User Action: Check why the fileset is disconnected to refresh the contents.

afm_fileset_found INFO_ADD_ENTITY INFO no Message: The AFM fileset {0} was found.

Description: An AFM fileset was detected.

Cause: An AFM fileset was detected through the appearance of the fileset
in the mmdiag --afm command output.

User Action: N/A

afm_fileset_linked INFO INFO no Message: AFM fileset {0} is linked.

Description: An AFM fileset is linked.

Cause: N/A

User Action: N/A

afm_fileset_unexpired INFO INFO no Message: The contents of the AFM cache fileset {0} are unexpired.

Description: The contents of the AFM cache filesets did not expire,
and available for operations. This event is triggered when the home is
reconnected, and cache contents are available, or the administrator runs
the mmafmctl unexpire command on the cache fileset. This event is
triggered through an AFM callback.

Cause: N/A

User Action: N/A

afm_fileset_unlinked INFO INFO no Message: AFM fileset {0} is unlinked.

Description: An AFM fileset is unlinked.

Cause: N/A

User Action: N/A

afm_fileset_unmounted STATE_CHANGE ERROR no Message: The AFM fileset {0} was unmounted because the remote side is
not reachable. Details: {1}.

Description: Triggered when the fileset is moved to an Unmounted state


because NFS server is not reachable or remote cluster mount is not
available for GPFS Native protocol.

Cause: After 300 seconds, the cache retries to connect to home, and it
moves to the Active state. If AFM is using the native GPFS protocol as
target, the cache state is moved to the Unmounted state because the local
mount of the remote file system is not accessible.

User Action: Remount the remote file system on the local cache cluster.

afm_fileset_vanished INFO_DELETE_ENTITY INFO no Message: The AFM fileset {0} has vanished.

Description: An AFM fileset is not in use anymore.

Cause: The AFM fileset is not in use anymore. This is detected through the
absence of the fileset in the mmdiag --afm command output.

User Action: N/A

afm_flush_only INFO INFO no Message: The AFM cache fileset {0} is in the FlushOnly state.

Description: Indicates that operations are queued, but have not started to
flush to the home server.

Cause: N/A

User Action: N/A

562 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 72. Events for the AFM component (continued)

Event Event Severity Call Details


Type Home

afm_home_connected STATE_CHANGE INFO no Message: The AFM fileset {0} has regained connection to the home site.
Details: {1}.

Description: Callback afmHomeConnected is being processed. This is a


healthy state.

Cause: N/A

User Action: N/A

afm_home_disconnected STATE_CHANGE ERROR no Message: The AFM fileset {0} has lost connection to the home site. Details:
{1}.

Description: Triggered when a gateway node gets disconnected from the


afmTarget of the fileset that it is serving.

Cause: Callback afmHomeDisconnected is being processed.

User Action: Check the network connectivity to the home server as well as
the home server availability.

afm_pconflicts_empty STATE_CHANGE INFO no Message: .pconflicts is healthy.

Description: Clear TIPS events from AFM. The .pconflict directory is clean.

Cause: N/A

User Action: N/A

afm_pconflicts_storage TIP TIP no Message: The fileset {0} .pconflicts directory contains user data. Examine
and remove unused files to free storage.

Description: AFM detected conflicting file changes and stored the


conflicting file(s) in .pconflict directory.

Cause: .pconflicts directory is not empty.

User Action: Analyze contents of .pconflicts directories to free up storage.

afm_prim_init_fail STATE_CHANGE ERROR no Message: The AFM cache fileset {0} is in the PrimInitFail state.

Description: The AFM cache fileset is in the PrimInitFail state. No data is


moved from the primary to the secondary fileset.

Cause: This rare state appears if the initial creation of psnap0 on the
primary cache fileset failed.

User Action: Check whether the fileset is available and exported to be used
as primary. The gateway node should be able to access this mount and the
primary ID should be setup on the secondary gateway. You may try running
the mmafmctl converToPrimary command on the primary fileset again.

afm_prim_init_running STATE_CHANGE WARNING no Message: The AFM primary cache fileset {0} is in the PrimInitProg state.

Description: The AFM cache fileset is synchronizing psnap0 with its


secondary AFM cache fileset.

Cause: This AFM cache fileset is a primary fileset and synchronizing the
content of psnap0 to the secondary AFM cache fileset.

User Action: This state changes back to 'Active' automatically when the
synchronization is finished.

afm_ptrash_empty STATE_CHANGE INFO no Message: .ptrash is healthy.

Description: Clear TIPS events from AFM. The .ptrash directory is clean.

Cause: N/A

User Action: N/A

afm_ptrash_storage TIP TIP no Message: The fileset {0} .ptrash directory contains user data. Examine and
remove unused files to free storage.

Description: .ptrash directory is not empty.

Cause: .ptrash directory is not empty.

User Action: Analyze contents of .ptrash directories to free up storage.

Chapter 42. References 563


Table 72. Events for the AFM component (continued)

Event Event Severity Call Details


Type Home

afm_queue_dropped STATE_CHANGE ERROR no Message: The AFM cache fileset {0} encountered an error synchronizing
with its remote cluster. Details: {1}.

Description: The AFM cache fileset encountered an error synchronizing


with its remote cluster. It cannot synchronize with the remote cluster until
AFM recovery is executed.

Cause: This event occurs when a queue is dropped on the gateway node.

User Action: Initiate IO to trigger recovery on this fileset.

afm_queue_only STATE_CHANGE INFO no Message: The changes of AFM cache fileset {0} in {1} mode are not flushed
yet to home.

Description: This state is applicable for SW/IW caches. A cache fileset is


moved to QueueOnly when operations at the cache are queued but not
yet flushed. This can happen in states such as recovery, resync, failover
when the queue is in the process of getting flushed to home. This state is
temporary and the user can continue normal activity.

Cause: N/A

User Action: N/A

afm_recovery_failed STATE_CHANGE ERROR no Message: AFM recovery on fileset {0} failed with error {1}.

Description: AFM recovery has failed.

Cause: AFM recovery has failed.

User Action: Recovery is retried on next access after the recovery retry
interval. Alternatively, you can manually resolve known problems and
recover the fileset.

afm_recovery_finished STATE_CHANGE INFO no Message: A recovery process ended for the AFM cache fileset {0}.

Description: A recovery process has ended on this AFM fileset.

Cause: N/A

User Action: N/A

afm_recovery_running STATE_CHANGE WARNING no Message: AFM fileset {0} is triggered for recovery start.

Description: A recovery process was started on this AFM cache fileset.

Cause: A recovery process was started on this AFM cache fileset.

User Action: The cache fileset state moves to the healthy state when
recovery is complete. Monitor this event.

afm_resync_needed STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the NeedsResync
state.

Description: The AFM cache fileset detects some accidental corruption of


data on the home server.

Cause: The AFM cache fileset detects some accidental corruption of data
on the home server.

User Action: Run the mmafmctl resync command to trigger a resync.


The fileset moves automatically to the ACTIVE state afterward.

afm_rpo_miss STATE_CHANGE_EXTE WARNING no Message: The AFM recovery point objective (RPO) is missed for {id}.
RNAL
Description: The primary fileset is triggering an RPO snapshot which is
expected to complete within a specified interval (RPO). This time interval is
exceeded.

Cause: The callback afmRPOMiss was triggered due to the network


delay, too much data to replicate, or an error during snapshot creation.

User Action: Ensure that the network connectivity is sufficient to transfer


all changes within the RPO interval and check that the Home and Cache
sites are operating correctly. If there are no issues, use the mmhealth
event resolve <fileset identifier> command to manually clear
this health event.

564 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 72. Events for the AFM component (continued)

Event Event Severity Call Details


Type Home

afm_rpo_sync INFO INFO User Action: N/A

Message: An AFM RPO snapshot is completed {id}.

Description: The primary fileset is triggering an RPO snapshot at a


specified interval. The snapshot is completed successfully.

Cause: The callback afmDRRPOSync was triggered.

afm_sensors_active TIP INFO no Message: The AFM perfmon sensor {0} is active.

Description: The AFM perfmon sensors are active. This event's monitor is
running only once an hour.

Cause: The value of the AFM perfmon sensors' period attribute is greater
than 0.

User Action: N/A

afm_sensors_inactive TIP TIP no Message: The following AFM perfmon sensor {0} is inactive.

Description: The AFM perfmon sensors are inactive. This event's monitor is
running only once an hour.

Cause: The value of the AFM perfmon sensors' period attribute is 0.

User Action: Set the period attribute of the AFM sensors


greater than 0. Therefore, run the mmperfmon config update
SensorName.period=N command, where SensorName is one of the
AFM sensors' name and N is a natural number greater 0. On the other
hand, you can hide this event by using the mmhealth event hide
afm_sensors_inactive command.

afm_sensors_not_configured TIP TIP no Message: The AFM perfmon sensor {0} is not configured.

Description: The AFM perfmon sensor does not exist in the mmperfmon
config show command output.

Cause: The AFM perfmon sensor is not configured in the sensors'


configuration file.

User Action: Include the sensors into the perfmon configuration by using
the mmperfmon config add --sensors SensorFile command.
An example for the configuration file can be found in the mmperfmon
command page.

Authentication events
The following table lists the events that are created for the AUTH component.
Table 73. Events for the Auth component

Event Event Severity Call Details


Type Home

ad_smb_nfs_ready STATE_CHANGE INFO no Message: SMB and NFS monitoring has started.

Description: AD authentication for NFS is configured, and SMB and NFS


monitoring has started.

Cause: SMB and NFS monitoring has started.

User Action: N/A

ad_smb_not_yet_ready STATE_CHANGE WARNING no Message: AD authentication is configured, but the SMB monitoring is not
yet ready.

Description: AD authentication for NFS is configured, but SMB monitoring


is not yet ready.

Cause: SMB monitoring has not started yet, but NFS is ready to process
requests.

User Action: Check as to why the SMB or CTDB is not yet running. This
problem might be caused by a temporary issue.

Chapter 42. References 565


Table 73. Events for the Auth component (continued)

Event Event Severity Call Details


Type Home

ads_cfg_entry_warn INFO WARNING no Message: {0} returned unknown result for item {1} query.

Description: The external DNS server monitoring returned an unknown


result.

Cause: An internal error occurred while querying the external DNS server.

User Action: Perform troubleshooting to resolve the error.

ads_down STATE_CHANGE ERROR FTDC Message: The external ADS server is unresponsive.
uploa
d Description: The external ADS server is unresponsive.

Cause: The local node is unable to connect to any Active Directory Service
(ADS) server.

User Action: Verify the network connection and check whether the ADS
server is operational.

ads_failed STATE_CHANGE ERROR FTDC Message: The local winbindd service is unresponsive.
uploa
d Description: The local winbindd service is unresponsive.

Cause: The local winbindd service, which is needed for ADS, is not
responding to ping requests.

User Action: Restart the winbindd service. If the service restart is not
successful, then perform the winbindd troubleshooting.

ads_up STATE_CHANGE INFO no Message: The external ADS server is up.

Description: The external ADS server is operational.

Cause: N/A

User Action: N/A

ads_warn INFO WARNING no Message: The external ADS server monitoring returned an unknown result.

Description: The external ADS server monitoring returned an unknown


result.

Cause: An internal error occurred while monitoring the external ADS server.

User Action: Perform troubleshooting to resolve the error.

dns_found INFO_ADD_ENTITY INFO no Message: The nameserver {0} was found.

Description: A nameserver was found.

Cause: A nameserver was found.

User Action: N/A

dns_krb_tcp_dc_msdcs_down STATE_CHANGE WARNING no Message: {0} has no {1} declared.

Description: The required SRV


item_kerberos._tcp.dc._msdcs.<Realm> is not declared in the DNS.

Cause: The AD server might not be fully usable.

User Action: Add the missing settings to the DNS.

dns_krb_tcp_dc_msdcs_up STATE_CHANGE INFO no Message: {0} has {1} declared.

Description: A required SRV


item_kerberos._tcp.dc._msdcs.<Realm> is declared in the DNS.

Cause: A required AD server SRV item was found.

User Action: N/A

dns_krb_tcp_down STATE_CHANGE WARNING no Message: {0} has no {1} declared.

Description: The required SRV item _kerberos._tcp.<Realm> is not


declared in the DNS.

Cause: The AD server might not be fully usable.

User Action: Add the missing settings to the DNS.

566 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 73. Events for the Auth component (continued)

Event Event Severity Call Details


Type Home

dns_krb_tcp_up STATE_CHANGE INFO no Message: {0} has {1} declared.

Description: A required SRV item _kerberos._tcp.<Realm> is


declared in the DNS.

Cause: A required AD server SRV item was found.

User Action: N/A

dns_ldap_tcp_dc_msdcs_dow STATE_CHANGE WARNING no Message: {0} has no {1} declared.


n
Description: The required SRV item_ldap._tcp.dc._msdcs.<Realm>
is not declared in the DNS.

Cause: The ADS might not be fully usable.

User Action: Add the missing settings to the DNS.

dns_ldap_tcp_dc_msdcs_up STATE_CHANGE INFO no Message: {0} has {1} declared.

Description: A required SRV item_ldap._tcp.dc._msdcs.<Realm> is


declared in the DNS.

Cause: A required ADS SRV item was found.

User Action: N/A

dns_ldap_tcp_down STATE_CHANGE WARNING no Message: {0} has no {1} declared.

Description: The required SRV item _ldap._tcp.<Realm> is not


declared in the DNS.

Cause: The AD server might not be fully usable.

User Action: Add the missing settings to the DNS.

dns_ldap_tcp_up STATE_CHANGE INFO no Message: {0} has {1} declared.

Description: A required SRV item _ldap._tcp.<Realm> is declared in


the DNS.

Cause: A required ADS SRV item was found.

User Action: N/A

dns_query_fail STATE_CHANGE WARNING no Message: The {0} query failed {1}.

Description: A nameserver did not provide the expected AD-related SRV


items.

Cause: A nameserver query for AD-specific SRV items failed.

User Action: Check the nameserver configuration for the missing AD-
specific settings.

dns_query_ok STATE_CHANGE INFO no Message: The {0} query is successful.

Description: A nameserver query for AD-specific settings is successful.

Cause: A nameserver query is successful.

User Action: N/A

dns_query_proto_fail TIP WARNING no Message: The {0} query failed for UDP and TCP.

Description: A nameserver did not react on a query protocol.

Cause: A nameserver query for UDP and TCP check failed.

User Action: Check the nameserver configuration for allowed protocols,


port, and firewall settings.

dns_query_proto_ok TIP INFO no Message: The {0} query succeeded with UDP and TCP protocols.

Description: A nameserver can be queried with UDP and TCP protocols.

Cause: A nameserver can be queried with UDP and TCP protocols.

User Action: N/A

Chapter 42. References 567


Table 73. Events for the Auth component (continued)

Event Event Severity Call Details


Type Home

dns_vanished INFO_DELETE_ENTITY INFO no Message: The nameserver {0} has vanished.

Description: A declared nameserver is not detected anymore.

Cause: A nameserver is not detected anymore, which could be a valid


situation.

User Action: Check the configuration, which could be a valid situation.

ldap_down STATE_CHANGE ERROR no Message: The external LDAP server {0} is unresponsive.

Description: The external LDAP server is unresponsive.

Cause: The local node is unable to connect to the LDAP server.

User Action: Verify the network connection and check whether the LDAP
server is operational.

ldap_up STATE_CHANGE INFO no Message: The external LDAP server {0} is up.

Description: The external LDAP server is operational.

Cause: N/A

User Action: N/A

nis_down STATE_CHANGE ERROR no Message: The external NIS server {0} is unresponsive.

Description: The external NIS server is unresponsive.

Cause: The local node is unable to connect to any Network Information


Service (NIS) server.

User Action: Verify the network connection and check whether the NIS
server is operational.

nis_failed STATE_CHANGE ERROR no Message: ypbind daemon is unresponsive.

Description: ypbind daemon is unresponsive.

Cause: The local ypbind daemon does not respond.

User Action: Restart ypbind daemon. If the restart is not successful, then
perform ypbind troubleshooting.

nis_up STATE_CHANGE INFO no Message: The external NIS server {0} is up.

Description: The external NIS server is operational.

Cause: N/A

User Action: N/A

nis_warn INFO WARNING no Message: The external NIS monitoring returned unknown result.

Description: The external NIS server monitoring returned an unknown


result.

Cause: An internal error occurred while monitoring external NIS server.

User Action: Perform troubleshooting to resolve the error.

sssd_down STATE_CHANGE ERROR no Message: The SSSD process is not running.

Description: The SSSD process is not running.

Cause: The SSSD authentication service is not running.

User Action: Perform troubleshooting to resolve the issue.

sssd_restart INFO INFO no Message: The SSSD process is not running. Trying to start the SSSD
process.

Description: Attempt to start the SSSD authentication process.

Cause: The SSSD process is not running.

User Action: N/A

568 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 73. Events for the Auth component (continued)

Event Event Severity Call Details


Type Home

sssd_up STATE_CHANGE INFO no Message: The SSSD process is now running.

Description: The SSSD authentication process is running.

Cause: N/A

User Action: N/A

sssd_warn INFO WARNING no Message: The SSSD process monitoring returned unknown result.

Description: The SSSD authentication process monitoring returned an


unknown result.

Cause: An internal error occurred while monitoring SSSD.

User Action: Perform troubleshooting to resolve the error.

wnbd_down STATE_CHANGE ERROR no Message: The WINBINDD process is not running.

Description: The WINBINDD authentication process is not running.

Cause: The WINBINDD authentication service is not running.

User Action: Verify configuration and Active Directory (AD) connection.

wnbd_restart INFO INFO no Message: WINBINDD process was not running. Trying to start the
WINBINDD process.

Description: Attempt to start the WINBINDD authentication process.

Cause: The WINBINDD process was not running.

User Action: N/A

wnbd_up STATE_CHANGE INFO no Message: The WINBINDD process is now running.

Description: The WINBINDD authentication service is operational.

Cause: N/A

User Action: N/A

wnbd_warn INFO WARNING no Message: The WINBINDD process monitoring returned unknown result.

Description: The WINBINDD process monitoring returned an unknown


result.

Cause: An internal error occurred while monitoring the WINBINDD


process.

User Action: Perform troubleshooting to resolve the error.

yp_down STATE_CHANGE ERROR no Message: The YPBIND process is not running.

Description: The YPBIND process is not running.

Cause: The YPBIND authentication service is not running.

User Action: Perform troubleshooting to resolve the issue.

yp_restart INFO INFO no Message: The YPBIND process was not running. Trying to start the YPBIND
process.

Description: Attempt to start the YPBIND process.

Cause: The YPBIND process was not running.

User Action: N/A

yp_up STATE_CHANGE INFO no Message: The YPBIND process is now running.

Description: The YPBIND process is operational.

Cause: N/A

User Action: N/A

Chapter 42. References 569


Table 73. Events for the Auth component (continued)

Event Event Severity Call Details


Type Home

yp_warn INFO WARNING no Message: The YPBIND process monitoring returned unknown result.

Description: The YPBIND process monitoring returned an unknown result.

Cause: An internal error occurred while monitoring the YPBIND process.

User Action: Perform troubleshooting to resolve the error.

Call Home events


The following table lists the events that are created for the call home component.
Table 74. Events for the Callhome component

Event Event Severity Call Details


Type Home

callhome_customer_info_disa TIP INFO no Message: The required customer information for call home is not checked.
bled
Description: The required customer information for call home is not
checked.

Cause: Call Home capability is disabled.

User Action: N/A

callhome_customer_info_fille TIP INFO no Message: All required customer information for call home was provided.
d
Description: All required customer information for call home was provided.

Cause: All required customer information for call home was provided by
using the mmcallhome info change command.

User Action: N/A

callhome_customer_info_mis TIP TIP no Message: The required customer information is not provided: {0}.
sing
Description: Some of the required customer information was not provided,
but is required. For more information, see the Configuring call home to
enable manual and automated data upload section in the IBM Storage
Scale: Administration Guide.

Cause: Some of the required customer information was not provided by


using the mmcallhome info change command.

User Action: Run the mmcallhome info change command to collect the
required information about the customer for call home capability.

callhome_hcalerts_ccr_failed STATE_CHANGE ERROR no Message: The data for the last health check monitoring cannot be updated
in the CCR.

Description: The data for the last health check monitoring cannot be
updated in the CCR.

Cause: The data for the last health check monitoring cannot be updated in
the Cluster Configuration Repository (CCR).

User Action: Ensure that your cluster has a quorum. If the cluster has
a quorum, then repair the CCR by following the steps mentioned in the
Repair of cluster configuration information when no CCR backup is available
section in the IBM Storage Scale: Problem Determination Guide.

callhome_hcalerts_disabled STATE_CHANGE INFO no Message: The health check monitoring feature is disabled.

Description: The health check monitoring feature is disabled.

Cause: Call Home capability is disabled.

User Action: To enable health check monitoring, you must enable call
home by using the mmcallhome capability enable command and set
'monitors_enabled = true' in the mmsysmonitor.conf file.

570 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 74. Events for the Callhome component (continued)

Event Event Severity Call Details


Type Home

callhome_hcalerts_failed STATE_CHANGE ERROR no Message: The last health check monitoring was not successfully
processed.

Description: The last health check monitoring was not successfully


processed.

Cause: The last health check monitoring was not successfully processed.

User Action: Check connectivity to the IBM ECuRep server, which includes
cabling, firewall, and proxy. For more information, check the output
of the mmcallhome status list --task sendfile --verbose
command.

callhome_hcalerts_noop STATE_CHANGE INFO no Message: No health check monitoring operation is performed on this node.

Description: No health check monitoring operation is performed on this


node.

Cause: The health check monitoring is performed exclusively on the first


call home master node.

User Action: N/A

callhome_hcalerts_ok STATE_CHANGE INFO no Message: Call Home health check monitoring was successfully performed.

Description: The last health check monitoring was successfully performed.

Cause: The last health check monitoring was successfully performed.

User Action: N/A

callhome_heartbeat_collectio STATE_CHANGE ERROR no Message: The data for the last call home heartbeat cannot be collected.
n_failed
Description: The data for the last call home heartbeat cannot be collected.

Cause: The data for the last call home heartbeat cannot be collected.

User Action: Check whether you have enough free space in the
dataStructureDump.

callhome_heartbeat_disabled STATE_CHANGE INFO no Message: The heartbeat feature is disabled.

Description: The heartbeat feature is disabled.

Cause: Call Home capability is disabled.

User Action: To enable heartbeats, you must enable the call home
capability by using the mmcallhome capability enable command.

callhome_heartbeat_failed STATE_CHANGE ERROR no Message: The last call home heartbeat was not successfully sent.

Description: The last call home heartbeat was not successfully sent.

Cause: The last call home heartbeat was not successfully sent.

User Action: Check connectivity to the IBM ECuRep server, which includes
cabling, firewall, and proxy. For more information, check the output
of the mmcallhome status list --task sendfile --verbose
command.

callhome_heartbeat_ok STATE_CHANGE INFO no Message: Call Home heartbeats are successfully sent.

Description: The last call home heartbeat was successfully sent.

Cause: The last call home heartbeat was successfully sent.

User Action: N/A

callhome_ptfupdates_ccr_fail STATE_CHANGE ERROR no Message: The data for the last ptf update check cannot be updated in the
ed CCR.

Description: The data for the last ptf update check cannot be updated in
the CCR.

Cause: The data for the last ptf update check cannot be updated in the
CCR.

User Action: Ensure that your cluster has a quorum. If the cluster has
a quorum, then repair the CCR by following the steps mentioned in the
Repair of cluster configuration information when no CCR backup is available
section of the IBM Storage Scale: Problem Determination Guide .

Chapter 42. References 571


Table 74. Events for the Callhome component (continued)

Event Event Severity Call Details


Type Home

callhome_ptfupdates_disable STATE_CHANGE INFO no Message: The ptf update check feature is disabled.
d
Description: The ptf update check feature is disabled.

Cause: Call Home capability is disabled.

User Action: To enable ptf update, you must enable call home by using the
mmcallhome capability enable command and set 'monitors_enabled
= true' in the mmsysmonitor.conf file.

callhome_ptfupdates_failed STATE_CHANGE ERROR no Message: The last ptf update check was not successfully processed.

Description: The last ptf update check was not successfully processed.

Cause: The last ptf update check was not successfully processed.

User Action: Check connectivity to the IBM ECuRep server, which includes
cabling, firewall, and proxy. For more information, check the output
of the mmcallhome status list --task sendfile --verbose
command.

callhome_ptfupdates_noop STATE_CHANGE INFO no Message: No Call Home ptf update check operation on this node.

Description: No Call Home ptf update check operation is performed on this


node.

Cause: The call home ptf update check is performed exclusively on the first
call home master node considering that it is not running in the cloud native
storage architecture.

User Action: N/A

callhome_ptfupdates_ok STATE_CHANGE INFO no Message: Call Home ptf update check was successfully performed.

Description: The last ptf update check was successfully performed.

Cause: The last ptf update check was successfully performed.

User Action: N/A

callhome_sudouser_defined STATE_CHANGE INFO no Message: The sudo user variable is properly set up.

Description: The sudo user variable is properly set up.

Cause: The mmsdrquery sdrq_cluster_info sudoUser command


reports a non-root sudo user, which exists on this node.

User Action: N/A

callhome_sudouser_not_exist STATE_CHANGE ERROR no Message: The sudo user '{0}' does not exist on this node.
s
Description: The sudo user does not exist on this node.

Cause: The sudo user, specified in the mmsdrquery


sdrq_cluster_info sudoUser command, does not exist on this node.

User Action: Create the sudo user on this node or specify a new sudo user
by using the mmchcluster --sudo-user <userName> command.

callhome_sudouser_not_need STATE_CHANGE INFO no Message: Not monitoring sudo user configuration variable, since sudo
ed wrappers are not being used.

Description: Not monitoring sudo user configuration variable, since sudo


wrappers are not being used.

Cause: The mmsdrquery sdrq_cluster_info sdrq_rsh_path


command reports that sudo wrappers are not being used.

User Action: N/A

callhome_sudouser_perm_mi STATE_CHANGE ERROR no Message: The sudo user is missing a recursive execute permission for the
ssing dataStructureDump: {0}.

Description: The sudo user is missing a recursive execute permission for


the dataStructureDump.

Cause: The sudo user, specified in the IBM Storage Scale configuration,
cannot read or write call home directories in dataStructureDump.

User Action: Ensure that the sudo user, specified in the IBM
Storage Scalesettings, has the recursive execute permission for the
dataStructureDump.

572 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 74. Events for the Callhome component (continued)

Event Event Severity Call Details


Type Home

callhome_sudouser_perm_no STATE_CHANGE INFO no Message: Not monitoring sudo user permissions, since sudo wrappers are
t_needed not being used.

Description: Not monitoring sudo user permissions, since sudo wrappers


are not being used.

Cause: The mmsdrquery sdrq_cluster_info sdrq_rsh_path


command reports that sudo wrappers are not being used.

User Action: N/A

callhome_sudouser_perm_ok STATE_CHANGE INFO no Message: The sudo user has correct permissions for the
dataStructureDump: {0}.

Description: The sudo user has correct permissions for the


dataStructureDump.

Cause: The sudo user, specified in the IBM Storage Scale configuration, can
read and write call home directories in the dataStructureDump.

User Action: N/A

callhome_sudouser_undefine STATE_CHANGE ERROR no Message: The sudo user variable is not set up in the IBM Storage Scale
d configuration.

Description: The sudo user variable is not set up in the IBM Storage Scale
configuration.

Cause: The mmsdrquery sdrq_cluster_info sudoUser command


reports 'root' or a sudo user is not set up.

User Action: Specify a valid non-root sudo user by using the mmchcluster
--sudo-user <userName> command.

CES network events


The following table lists the events that are created for the CES Network component.
Table 75. Events for the CES network component

Event Event Severity Call Details


Type Home

ces_bond_degraded STATE_CHANGE WARNING no Message: Some secondaries of the CES-bond {0} went down.

Description: Some of the CES-bond parts are malfunctioning.

Cause: Some secondaries of the network bond are not functioning


properly.

User Action: Check the bonding configuration, network configuration, and


cabling of the malfunctioning secondaries of the bond.

ces_bond_down STATE_CHANGE ERROR no Message: All secondaries of the CES-bond {0} are down.

Description: All secondaries of a CES-network bond are down.

Cause: All secondaries of this network bond went down.

User Action: Check the bonding configuration, network configuration, and


cabling of all secondaries of the CES-bond.

ces_bond_up STATE_CHANGE INFO no Message: All secondaries of the CES bond {0} are working as expected.

Description: This CES bond is functioning properly.

Cause: N/A

User Action: N/A

ces_disable_node_network INFO_EXTERNAL INFO no Message: Network was disabled.

Description: Clean up by using the mmchnode --ces-disable


command. Accordingly, the network configuration is modified.

Cause: Informational message, which is cleaned up by using the


mmchnode --ces-disable command.

User Action: N/A

Chapter 42. References 573


Table 75. Events for the CES network component (continued)

Event Event Severity Call Details


Type Home

ces_enable_node_network INFO_EXTERNAL INFO no Message: Network was enabled.

Description: Called to handle any network-specific issues involved after


running the mmchnode --ces-enable command. Accordingly, the
network configuration is modified.

Cause: Informational message, which is called by using the mmchnode


--ces-enable command.

User Action: N/A

ces_ips_hostable TIP INFO no Message: All declared CES-IPs could be hosted on this node.

Description: All declared CES-IPs could be hosted on this node.

Cause: N/A

User Action: N/A

ces_ips_not_hostable TIP TIP no Message: One or more CES IP cannot be hosted on this node (no interface).

Description: One or more CES IP address cannot be hosted on this node


because no active network interface for this subnet is available.

Cause: One or more CES IP address cannot be hosted on this node.

User Action: Check whether interfaces are active and the CES group
assignment is correct. For more information, run the mmces address
list --full-list command.

ces_load_monitord_bad STATE_CHANGE WARNING no Message: Execution failed. Exit code: ${0}.

Description: CES monitor daemon cannot run the ces load monitor
successfully.

Cause: An issue with automated ip distribution is detected.

User Action: If CES IP distribution causes a problem, then set address


policy to node affinity and move ips manually.

ces_load_monitord_good STATE_CHANGE INFO no Message: Execution successfully. Exit code: ${0}.

Description: CES monitor daemon runs ces load monitor with success.

Cause: N/A

User Action: N/A

ces_many_tx_errors STATE_CHANGE ERROR FTDC Message: CES NIC {0} had many TX errors since the last monitoring cycle.
uploa
d Description: This CES-related NIC had many TX errors since the last
monitoring cycle.

Cause: The /proc/net/dev lists more TX errors for this adapter since the
last monitoring cycle.

User Action: Check the network cabling and network infrastructure.

ces_monitord_down STATE_CHANGE WARNING no Message: The CES-IP background monitor is not running. CES-IPs cannot
be configured.

Description: The CES-IP background monitor is not running. It might be


terminated.

Cause: The CES-IP background monitor is not running.

User Action: If the CES-IP background monitor stops without any known
reason, check the local /var file system. Restart it by using the mmces
node resume --start command.

ces_monitord_ok STATE_CHANGE INFO no Message: The CES-IP background monitor is running.

Description: The CES-IP background monitor is running.

Cause: N/A

User Action: N/A

574 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 75. Events for the CES network component (continued)

Event Event Severity Call Details


Type Home

ces_monitord_warn INFO WARNING no Message: The IBM Storage Scale CES IP assignment monitor
(mmcesmonitord) alive check cannot be executed, which can be a timeout
issue.

Description: The CES IP monitor daemon state cannot be determined due


to a problem.

Cause: The CES IP monitor daemon state cannot be determined due to a


problem.

User Action: Find potential issues for this kind of


failure in the /var/adm/ras/mmfs.log and /var/adm/ras/
mmsysmonitor.log files.

ces_network_affine_ips_not_ STATE_CHANGE WARNING no Message: No CES IP addresses can be applied on this node. Check group
defined membership of node and IP addresses.

Description: No CES IP addresses found, which can be hosted on this


node. IPs should be in the global pool or in the group for which this node is
a member.

Cause: No valid CES IP found which could be hosted on this node.

User Action: Use the mmces address add to add CES IP addresses to
the global pool or to a group for which this node is a member.

ces_network_connectivity_do STATE_CHANGE ERROR no Message: CES NIC {0} cannot connect to the gateway.
wn
Description: This CES-related NIC cannot connect to the gateway.

Cause: The gateway does not respond to the sent connections-checking


packets.

User Action: Check the network configuration of the network adapter, path
to the gateway, and gateway itself.

ces_network_connectivity_up STATE_CHANGE INFO no Message: CES NIC {0} can connect to the gateway and responds to the sent
connections-checking packets.

Description: This CES-related network adapter can connect to the


gateway.

Cause: N/A

User Action: N/A

ces_network_down STATE_CHANGE ERROR no Message: CES NIC {0} is down.

Description: This CES-related network adapter is down.

Cause: This network adapter was disabled.

User Action: Enable this network adapter or check for problems in system
logs.

ces_network_found INFO_ADD_ENTITY INFO no Message: CES NIC {0} was found.

Description: A new CES-related network adapter was found.

Cause: A new NIC, which is relevant for the CES, is listed by ip a.

User Action: N/A

ces_network_ips_down STATE_CHANGE WARNING no Message: No CES IPs were assigned to this node.

Description: No CES IPs were assigned to any network adapter of this


node.

Cause: No network adapters have the CES-relevant IPs, which makes the
node unavailable for the CES clients.

User Action: If CES is FAILED, then analyse the reason. If there are not
enough IPs in the CES pool for this node, then extend the pool.

Chapter 42. References 575


Table 75. Events for the CES network component (continued)

Event Event Severity Call Details


Type Home

ces_network_ips_not_assign STATE_CHANGE ERROR FTDC Message: No NICs are set up for CES.
able uploa
d Description: No network adapters are properly configured for CES.

Cause: There are no network adapters with a static IP, matching any of the
IPs from the CES pool.

User Action: Setup the static IPs and netmasks of the CES NICs in the
network interface configuration scripts, or add new matching CES IPs to the
pool. The static IPs must not be aliased.

ces_network_ips_not_define STATE_CHANGE WARNING no Message: No CES IP addresses have been defined.


d
Description: No CES IP addresses have been defined. Run the mmces
address add command to add CES IP addresses.

Cause: No CES IP defined, but at least one CES IP is needed.

User Action: Run the mmces address add command to add CES IP
addresses. Check the group membership of IP addresses and nodes.

ces_network_ips_up STATE_CHANGE INFO no Message: CES-relevant IPs are served by found NICs.

Description: CES-relevant IPs are served by network adapters.

Cause: N/A

User Action: N/A

ces_network_link_down STATE_CHANGE ERROR no Message: Physical link of the CES NIC {0} is down.

Description: The physical link of this CES-related adapter is down.

Cause: The LOWER_UP flag is not set for this NIC in the output of ip a.

User Action: Check the network cabling and network infrastructure.

ces_network_link_up STATE_CHANGE INFO no Message: Physical link of the CES NIC {0} is up.

Description: The physical link of this CES-related adapter is up.

Cause: N/A

User Action: N/A

ces_network_monitord_bad STATE_CHANGE WARNING no Message: Execution failed. Exit code: ${0}.

Description: CES monitor daemon could not run ces network monitor
successfully.

Cause: An issue with automated ces network verification is detected.

User Action: If protocol IO is failing, then suspend this node.

ces_network_monitord_good STATE_CHANGE INFO no Message: Executed successfully. Exit code: {0}.

Description: The CES monitor daemon runs the network monitor


successfully. Automated network verification is working.

Cause: N/A

User Action: N/A

ces_network_up STATE_CHANGE INFO no Message: CES NIC {0} is up.

Description: This CES-related network adapter is up.

Cause: N/A

User Action: N/A

ces_network_vanished INFO_DELETE_ENTITY INFO no Message: CES NIC {0} has vanished.

Description: One of CES-related network adapters cannot be detected


anymore.

Cause: One of the previously monitored NICs is not listed by ip a


anymore.

User Action: N/A

576 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 75. Events for the CES network component (continued)

Event Event Severity Call Details


Type Home

ces_no_tx_errors STATE_CHANGE INFO no Message: CES NIC {0} had no or a tiny number of TX errors.

Description: A CES-related NIC had no or an insignificant number of TX


errors.

Cause: N/A

User Action: N/A

ces_startup_network INFO_EXTERNAL INFO no Message: CES network service was started.

Description: The CES network has started.

Cause: CES network IPs are started.

User Action: N/A

dir_sharedroot_perm_ok STATE_CHANGE INFO no Message: The permissions of the sharedroot directory are correct: {0}.

Description: The permissions of the sharedroot directory are sufficient.

Cause: N/A

User Action: N/A

dir_sharedroot_perm_proble STATE_CHANGE WARNING no Message: The permissions of the sharedroot directory are not sufficient
m {0}.

Description: The cesSharedRoot directory must have 'rx' permissions for


'group' and 'others', since these are required for NFS 'rpcuser' to access its
data inside of cesSharedRoot.

Cause: The cesSharedRoot directory did not have 'rx' permissions for
'group' and 'others'.

User Action: Provide 'rx' permissions for 'group' and 'others' for the
cesSharedRoot directory.

handle_network_problem_inf INFO_EXTERNAL INFO no Message: Handle network problem - Problem: {0}, Argument: {1}.
o
Description: Information about network-related reconfigurations. For
example, enable or disable IPs, and assign or unassign IPs.

Cause: A change in the network configuration. Details are part of the


information message.

User Action: N/A

move_cesip_from INFO_EXTERNAL INFO no Message: Address {0} was moved from this node to {1}.

Description: A CES IP address was moved from the current node to


another node.

Cause: Rebalancing of CES IP addresses.

User Action: N/A

move_cesip_to INFO_EXTERNAL INFO no Message: Address {0} was moved from {1} to this node.

Description: A CES IP address was moved from another node to the


current node.

Cause: Rebalancing of CES IP addresses.

User Action: N/A

move_cesips_info INFO_EXTERNAL INFO no Message: A move request for ip addresses was executed. Reason {0}.

Description: CES IP addresses can be moved in case of node failovers from


one node to one or more other nodes. This message is logged on a node
that is observing this, not necessarily on any affected node.

Cause: A CES IP movement was detected.

User Action: N/A

Chapter 42. References 577


CESIP events
The following table lists the events that are created for the CESIP component.
Table 76. Events for the CESIP component

Event Event Severity Call Details


Type home

ces_ips__warn INFO WARNING no Message: The IBM Storage Scale CES IP assignment monitor cannot be
executed, which can be a timeout issue.

Description: Check the CES IP assignment state that returned an unknown


result. This might be a temporary issue, like a timeout during the check
procedure.

Cause: The CES IP assignment state cannot be determined due to a


problem.

User Action: Find potential issues for this kind of failure in


the /var/adm/ras/mmsysmonitor.log file.

ces_ips_all_unassigned STATE_CHANGE ERROR no Message: All {0} declared CES IPs are unassigned.

Description: All of the declared CES IPs are unassigned.

Cause: All declared CES IP addresses are unassigned.

User Action: Check the configuration of network interfaces. For more


information, run the mmces address list command.

ces_ips_assigned STATE_CHANGE INFO no Message: All {0} expected CES IPs are assigned.

Description: All declared CES IPs are assigned.

Cause: There are no unassigned CES IP addresses.

User Action: N/A

ces_ips_unassigned STATE_CHANGE WARNING no Message: {0} of {1} declared CES IPs are unassigned.

Description: Not all of the declared CES IPs are assigned.

Cause: There are unassigned CES IP addresses.

User Action: Check the configuration of network interfaces. For more


information, run the mmces address list command.

Cluster state events


The following table lists the events that are created for the Cluster state component.
Table 77. Events for the cluster state component

Event Event Severity Call Details


Type Home

cluster_state_manager_resen INFO INFO no Message: The CSM requests resending all information.
d
Description: The CSM requests resending all information.

Cause: The CSM is missing information about this node.

User Action: N/A

cluster_state_manager_reset INFO INFO no Message: Clear memory of cluster state manager for this node.

Description: A reset request for the monitor state manager was received.

Cause: A reset request for the monitor state manager was received.

User Action: N/A

component_state_change INFO INFO no Message: The state of component {0} changed to {1}.

Description: The state of a component changed.

Cause: An event was detected by the system health framework that


triggered a state change for a component.

User Action: N/A

578 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 77. Events for the cluster state component (continued)

Event Event Severity Call Details


Type Home

entity_state_change INFO INFO no Message: The state of {0} {1} of the component {2} changed to {3}.

Description: The state of an entity changed.

Cause: An event was detected by the system health framework that


triggered a state change for an entity.

User Action: N/A

eventlog_cleared INFO INFO no Message: On the node {0}, the eventlog was cleared.

Description: The user cleared the eventlog with the mmhealth node
eventlog --clearDB command. This command also clears the events
of the mmces events list command.

Cause: The user cleared the eventlog.

User Action: N/A

heartbeat STATE_CHANGE INFO no Message: Node {0} sent a heartbeat.

Description: The node is alive.

Cause: The cluster node sent a heartbeat to the CSM.

User Action: N/A

heartbeat_missing STATE_CHANGE ERROR no Message: CSM is missing a heartbeat from the node {0}.

Description: The Cluster State Manager (CSM) is missing a heartbeat from


the specified node.

Cause: The specified cluster node did not send a heartbeat to the Cluster
State Manager (CSM).

User Action: Check network connectivity of the node. Check whether


the monitor is running there (by using the mmsysmoncontrol status
command).

heartbeat_missing_server_un STATE_CHANGE ERROR no Message: CSM is missing a heartbeat from node {0}, which might be due to
reachable the node, the network, and the processes being down.

Description: The Cluster State Manager (CSM) is missing a heartbeat from


the specified node, which might be because the server or network is down.

Cause: The specified cluster node cannot be contacted to have it send a


heartbeat to the Cluster State Manager (CSM).

User Action: Check network connectivity to the node. Check the


operational state of the node. Check that the IBM Storage Scale
processes are running and communicating with the cluster (mmgetstate
-Lv, mmsysmoncontrol status).

node_resumed STATE_CHANGE INFO no Message: Node {0} is not suspended anymore.

Description: The node is resumed after it was suspended.

Cause: The cluster node was resumed after being suspended.

User Action: N/A

node_state_change INFO INFO no Message: The state of this node is changed to {0}.

Description: The state of this node changed.

Cause: An event was detected by the system health framework that


triggered a state change for this node.

User Action: N/A

node_suspended STATE_CHANGE INFO no Message: Node {0} is suspended.

Description: The node is suspended.

Cause: The cluster node is now suspended.

User Action: Run the mmces node resume command to stop the node
from being suspended.

Chapter 42. References 579


Table 77. Events for the cluster state component (continued)

Event Event Severity Call Details


Type Home

service_added INFO INFO no Message: On the node {0}, the {1} monitor was started.

Description: A new monitor was started by the Sysmonitor daemon.

Cause: A new monitor was started.

User Action: N/A

service_disabled STATE_CHANGE INFO no Message: The service {0} is disabled.

Description: The service is disabled.

Cause: The service is disabled.

User Action: Run the mmces service enable <service> command to


enable the service.

service_no_pod_data STATE_CHANGE WARNING no Message: A request to {id} did not yield expected health data.

Description: A check on the service did not work.

Cause: The service is running in a different POD and does not respond to
requests that are regarding its health state.

User Action: Check that all pods are running in the container environment.
The event can be manually cleared by using the mmhealth event
resolve service_no_pod_data <id> command.

service_pod_data STATE_CHANGE INFO no Message: The request to {id} did return health data as expected.

Description: A check on the service did work as expected.

Cause: The service is running in a different POD and does respond to


requests that are regarding its health state.

User Action: N/A

service_removed INFO INFO no Message: On the node {0} the {1} monitor was removed.

Description: A monitor was removed by Sysmonitor.

Cause: A monitor was removed.

User Action: N/A

service_reset STATE_CHANGE INFO no Message: The service {0} on node {1} was reconfigured, and its events were
cleared.

Description: All current service events were cleared.

Cause: The service was reconfigured.

User Action: N/A

service_running STATE_CHANGE INFO no Message: The service {0} is running on node {1}.

Description: The service is not stopped or disabled anymore.

Cause: The service is not stopped or disabled anymore.

User Action: N/A

service_stopped STATE_CHANGE INFO no Message: The service {0} is stopped on node {1}.

Description: The service is stopped.

Cause: The service was stopped.

User Action: Run the mmces service start <service> command to


start the service.

singleton_sensor_off INFO INFO no Message: The singleton sensors of pmsensors are turned off.

Description: The pmsensors' configuration is reloaded. This node is not


configured to start the singleton sensors anymore.

Cause: The following node was assigned as singleton sensor before.


However, it does not satisfy the requirements for a singleton sensor
anymore (perfmon designation, PERFMON component HEALTHY, GPFS
component HEALTHY).

User Action: N/A

580 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 77. Events for the cluster state component (continued)

Event Event Severity Call Details


Type Home

singleton_sensor_on INFO INFO no Message: The singleton sensors of pmsensors are turned on.

Description: The pmsensors' configuration is reloaded. This node is now


configured to start the singleton sensors.

Cause: Another node was assigned as a singleton sensor before. However,


it does not satisfy the requirements for a singleton sensor anymore
(perfmon designation, PERFMON component HEALTHY, GPFS component
HEALTHY). This node was assigned as new singleton sensor node.

User Action: N/A

webhook_url_abort INFO WARNING no Message: Webhook URL {0} was disabled because a fatal runtime
error was encountered. For more information, see the monitoring logs
in /var/adm/ras/mmsysmonitor.log.

Description: The system health framework encountered a fatal runtime


error that forced it to stop activity to this webhook URL.

Cause: The system health framework encountered a fatal runtime error


when it was sending events to a webhook URL.

User Action: Check that the webhook URL is reachable and re-enable the
URL by using the mmhealth config webhook add command.

webhook_url_communication INFO INFO no Message: Webhook URL {0} was not able to receive event information.

Description: The system health framework was not able to send event
information to a configured webhook URL.

Cause: The system health framework was not able to send event
information.

User Action: N/A

webhook_url_disabled INFO WARNING no Message: Webhook URL {0} was disabled as too many failures occurred.

Description: The system health framework could not repeatedly contact a


webhook URL.

Cause: The system health framework could not repeatedly contact a


webhook URL.

User Action: Check that the webhook URL is reachable and re-enable the
URL by using the mmhealth config webhook add command.

webhook_url_reset INFO INFO no Message: Webhook URL {0} communication was set back to a HEALTHY
state.

Description: The system health framework set this webhook URL status
back to a HEALTHY state after being disabled because of repeated failures.

Cause: The system health framework set this webhook URL status back to
a HEALTHY state.

User Action: N/A

webhook_url_restored INFO INFO no Message: Webhook URL {0} communication was restored and event
information was successfully sent.

Description: The system health framework was able to send event


information to the webhook URL after a previous failure.

Cause: The system health framework was able to send event information
to the webhook URL.

User Action: N/A

webhook_url_ssl_validation INFO WARNING no Message: Communication to webhook URL {} was established, but Server-
Side certificate validation failed and was disabled. Check the HTTPS server
configuration to ensure that this disabling is the intended behavior.

Description: The system health framework failed to validate the Server-


Side certificate.

Cause: The system health framework failed to validate the Server-Side


certificate.

User Action: Check that the webhook URL has a valid SSL certificate
and re- enable the URL by using the mmhealth config webhook add
command.

Chapter 42. References 581


Disk events
The following table lists the events that are created for the Disk component.
Table 78. Events for the Disk component

Event Event Severity Call Details


Type Home

disc_recovering STATE_CHANGE WARNING no Message: Disk {0} is reported as recovering.

Description: A disk is in recovering state.

Cause: A disk is in recovering state.

User Action: If the recovering state is unexpected, then refer to the Disk
issues section in the IBM Storage Scale: Problem Determination Guide.

disc_unrecovered STATE_CHANGE WARNING no Message: Disk {0} is reported as unrecovered.

Description: A disk is in unrecovered state.

Cause: A disk is in unrecovered state. The metadata scan might have failed.

User Action: If the unrecovered state is unexpected, then refer to the Disk
issues section in the IBM Storage Scale: Problem Determination Guide.

disk_down STATE_CHANGE WARNING no Message: Disk {0} is reported as not up.

Description: A disk is reported as down.

Cause: This can indicate a hardware issue.

User Action: If the down state is unexpected, then refer to the Disk issues
section in the IBM Storage Scale: Problem Determination Guide. The failed
disk might be a descriptor disk.

disk_down_change STATE_CHANGE INFO no Message: Disk {0} is reported as down because the configuration changed.
FS={1}, reason code={2}.

Description: A disk is reported as down because the configuration changed


by using the mmchdisk command.

Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because the configuration was changed.

User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

disk_down_del STATE_CHANGE INFO no Message: Disk {0} is reported as down because it was deleted. FS={1},
reason code={2}.

Description: A disk is reported as down because it was deleted by using


the mmdeldisk command.

Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because it was deleted.

User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

disk_down_io STATE_CHANGE ERROR no Message: Disk {0} is reported as down because of an I/O issue. FS={1},
reason code={2}.

Description: A disk is reported as down because of an I/O issue.

Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because of an I/O issue.

User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

disk_down_rpl STATE_CHANGE INFO no Message: Disk {0} is reported as down because it was replaced. FS={1},
reason code={2}.

Description: A disk is reported as down because it was replaced by using


the mmrpldisk command.

Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because it was replaced.

User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

582 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 78. Events for the Disk component (continued)

Event Event Severity Call Details


Type Home

disk_down_unexpected STATE_CHANGE ERROR no Message: Disk {0} is reported as unexpected down. FS={1},
reasoncode={2}.

Description: A disk is reported as unexpected down.

Cause: An IBM Storage Scale callback event reported a disk in down state
of unexpected reason.

User Action: If the down state is unexpected, then refer to the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

disk_down_unknown STATE_CHANGE WARNING no Message: Disk {0} is reported as down for unknown reason. FS={1},
reasoncode={2}.

Description: A disk is reported as down for unknown reason. The disk was
probably stopped.

Cause: An IBM Storage Scale callback event reported a disk in down state
of unknown reason. Probably the disk was stopped or suspended.

User Action: If the down state is unexpected, then refer to the 'Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

disk_failed_cb INFO_EXTERNAL INFO no Message: Disk {0} is reported as failed. FS={1}. Affected NSD servers are
notified about the disk_down state.

Description: A disk is reported as failed.

Cause: An IBM Storage Scale callback event reported a disk failure.

User Action: If the failure state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.

disk_found INFO_ADD_ENTITY INFO no Message: The disk {0} was found.

Description: A disk was detected.

Cause: A disk was detected.

User Action: N/A

disk_fs_desc_missing STATE_CHANGE INFO no Message: Device {0} has no desc disks assigned in failure group(s) {1}.

Description: Failure groups have no assigned descriptor disks as reported


by the mmlsdisk command. This is usually a normal condition.

Cause: GPFS device has failure groups with no desc disks.

User Action: If the missing descriptor disks in the failure groups is


unexpected, then refer to the Disk issues section in the IBM Storage Scale:
Problem Determination Guide.

disk_fs_desc_ok STATE_CHANGE INFO no Message: Device {0} descriptor disks identified for all failure groups.

Description: GPFS device has descriptor disks identified for all failure
groups as reported by the mmlsdisk command.

Cause: GPFS device has descriptor disks for all failure groups.

User Action: N/A

disk_io_err_cb STATE_CHANGE ERROR no Message: Disk {0} is reported as I/O error. node={1}.

Description: A disk state is reported as I/O error.

Cause: An IBM Storage Scale callback event reported a disk I/O error.

User Action: For more information, see the Disk issues section in the IBM
Storage Scale: Problem Determination Guide.

disk_up STATE_CHANGE INFO no Message: Disk {0} is up.

Description: Disk is up.

Cause: A disk was detected in up state.

User Action: N/A

Chapter 42. References 583


Table 78. Events for the Disk component (continued)

Event Event Severity Call Details


Type Home

disk_vanished INFO_DELETE_ENTITY INFO no Message: The disk {0} has vanished.

Description: A declared disk was not detected.

Cause: A disk is not in use for an IBM Storage Scale file system, which can
be a valid situation.

User Action: N/A

Enclosure events
The following table lists the events that are created for the Enclosure component.
Table 79. Events for the enclosure component

Event Event Severity Call Details


Type Home

adapter_bios_notavail STATE_CHANGE WARNING no Message: The BIOS level of adapter {0} is not available.

Description: The BIOS level of the adapter is not available. A BIOS update
might solve the problem.

Cause: The mmlsfirmware -Y command reports no information about


the BIOS firmware level.

User Action: Issue the mmlsfirmware command. For more information,


see the IBM Storage Scale: Problem Determination Guide of the relevant
system. Follow the maintenance procedures for updating the BIOS
firmware. If the issue persists, then contact IBM support.

adapter_bios_ok STATE_CHANGE INFO no Message: The BIOS level of adapter {0} is correct.

Description: The BIOS level of the adapter is correct.

Cause: N/A

User Action: N/A

adapter_bios_wrong STATE_CHANGE WARNING no Message: The BIOS level of adapter {0} is wrong.

Description: The BIOS level of the adapter is not correct. The BIOS
firmware needs an update.

Cause: The mmlsfirmware -Y command reports that the BIOS firmware


is not up to date.

User Action: Issue the mmlsfirmware command. For more information,


see the IBM Storage Scale: Problem Determination Guide of the relevant
system. Follow the maintenance procedures for updating the BIOS
firmware. If the issue persists, then contact IBM support.

adapter_firmware_notavail STATE_CHANGE WARNING no Message: The firmware level of adapter {0} is not available.

Description: No or insufficient information about the adapter firmware is


available. An update might solve the problem.

Cause: The mmlsfirmware -Y command reports no information about


the adapter firmware level.

User Action: Issue the mmlsfirmware command. For more information,


see the IBM Storage Scale: Problem Determination Guide of the relevant
system. Follow the maintenance procedures for updating the adapter
firmware. Contact IBM support if you cannot solve the problem.

adapter_firmware_ok STATE_CHANGE INFO no Message: The firmware level of adapter {0} is correct.

Description: The firmware level of the adapter is correct.

Cause: N/A

User Action: N/A

584 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

adapter_firmware_wrong STATE_CHANGE WARNING no Message: The firmware level of adapter {0} is wrong.

Description: The firmware level of the adapter is not correct. The adapter
firmware needs an update.

Cause: The mmlsfirmware -Y command reports that a wrong firmware


level for the adapter.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. Follow the maintenance
procedures for updating the adapter firmware.

current_failed STATE_CHANGE ERROR no Message: Current sensor {0} measured wrong current.

Description: A current sensor might be broken and should be replaced.

Cause: The mmlsenclosure all -L -Y command reports that a current


sensor has measured a wrong current.

User Action: Issue the mmlsenclosure all -L command to get more


details. If the problem persists, then contact IBM support.

current_ok STATE_CHANGE INFO no Message: currentSensor {0} is OK.

Description: The currentSensor state is OK.

Cause: N/A

User Action: N/A

current_warn STATE_CHANGE WARNING no Message: Current sensor {0} might be facing an issue.

Description: A current sensor has measured a value outside the warning


limits.

Cause: The mmlsenclosure all -L -Y command reports a warning


about a current sensor.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

dcm_drawer_open STATE_CHANGE WARNING no Message: DCM {0} drawer is open.

Description: The Drawer Control Module (DCM) drawer is open.

Cause: The mmlsenclosure all -L -Y command reports that the DCM


drawer is open.

User Action: Close the DCM drawer.

dcm_failed STATE_CHANGE WARNING no Message: DCM {0} is FAILED.

Description: The Drawer Control Module (DCM) state is FAILED.

Cause: The mmlsenclosure all -L -Y command reports that the DCM


is FAILED.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

dcm_not_available STATE_CHANGE WARNING no Message: DCM {0} is not available.

Description: The Drawer Control Module (DCM) is not installed or not


responding.

Cause: The mmlsenclosure all -L -Y command reports that the DCM


component is not available.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

dcm_ok STATE_CHANGE INFO no Message: DCM {id[1]} is OK.

Description: The DCM state is OK.

Cause: N/A

User Action: N/A

Chapter 42. References 585


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

dimm_config_mismatch STATE_CHANGE_EXTE ERROR no Message: Enclosure has an inconsistent DIMM configuration.


RNAL
Description: Inconsistent DIMM configuration.

Cause: The DIMMs in the enclosure do not fit to each other.

User Action: Verify that all DIMMs in all canisters have the same
specification (size, speed, etc). The event can be manually cleared by using
the mmhealth event resolve dimm_config_mismatch command.

dimm_config_ok STATE_CHANGE_EXTE INFO no Message: Enclosure has a correct DIMM configuration.


RNAL
Description: The DIMM configuration is OK.

Cause: N/A

User Action: N/A

door_failed STATE_CHANGE ERROR no Message: Door {0} state is FAILED.

Description: The door state is reported as failed.

Cause: The mmlsenclosure all -L -Y command reports that the door


is in the failed state.

User Action: Run the mmlsenclosure all -L command to see further


details.

door_is_absent STATE_CHANGE ERROR no Message: Door {0} state is absent.

Description: The door state is reported as absent.

Cause: The mmlsenclosure all -L -Y command reports that the door


is absent.

User Action: Install the enclosure door. Verify the door state by using the
mmlsenclosure all -L command. For more help, contact IBM support.

door_is_open STATE_CHANGE ERROR no Message: Door {0} state is open.

Description: The door state is reported as open.

Cause: The mmlsenclosure all -L -Y command reports that the door


is open.

User Action: Close the enclosure door. Verify the door state by using the
mmlsenclosure all -L command.

door_ok STATE_CHANGE INFO no Message: Door {0} is OK.

Description: The door state is OK.

Cause: N/A

User Action: N/A

drawer_failed STATE_CHANGE ERROR no Message: Drawer {0} state is FAILED.

Description: The drawer state is reported as FAILED.

Cause: The mmlsenclosure all -L -Y command reports that the


drawer is in FAILED state.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

drawer_ok STATE_CHANGE INFO no Message: Drawer {0} is OK.

Description: The drawer state is OK.

Cause: N/A

User Action: N/A

586 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

drive_firmware_notavail STATE_CHANGE WARNING no Message: The firmware level of drive {0} is not available.

Description: Zero or insufficient information about the drive firmware is


available. An update might solve the problem.

Cause: The mmlsfirmware -Y command reports that no information


about the drive firmware level.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. Follow the maintenance
procedures for updating the drive firmware. If the issue persists, then
contact IBM support.

drive_firmware_ok STATE_CHANGE INFO no Message: The firmware level of drive {0} is correct.

Description: The firmware level of the drive is correct.

Cause: N/A

User Action: N/A

drive_firmware_wrong STATE_CHANGE WARNING no Message: The firmware level of drive {0} is wrong.

Description: The firmware level of the drive is not correct. The drive
firmware needs an update.

Cause: The mmlsfirmware -Y command reports that the drive firmware


is not up to date.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. Follow the maintenance
procedures for updating the drive firmware. If the issue persists, then
contact IBM support.

enclosure_data STATE_CHANGE INFO no Message: Enclosure data is found.

Description: The enclosure data is available.

Cause: N/A

User Action: N/A

enclosure_firmware_notavail STATE_CHANGE WARNING no Message: The firmware level of enclosure {0} is not available.

Description: Zero or insufficient information about the enclosure firmware


is available. An update might solve the problem.

Cause: The mmlsfirmware -Y command does not report any information


about the enclosure firmware level.

User Action: Issue the mmlsfirmware command. If the issue persists,


then contact IBM support.

enclosure_firmware_ok STATE_CHANGE INFO no Message: The firmware level of enclosure {0} is correct.

Description: The firmware level of the enclosure is correct.

Cause: N/A

User Action: N/A

enclosure_firmware_unknown STATE_CHANGE WARNING no Message: The firmware level of enclosure {0} is unknown.

Description: The enclosure firmware is of an unknown version. An update


might solve the problem.

Cause: The mmlsfirmware -Y command does not report any enclosure


firmware information.

User Action: Issue the mmlsfirmware command. For more information on


how to update the enclosure firmware, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. If the issue persists, then
contact IBM support if you cannot solve the problem.

Chapter 42. References 587


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

enclosure_firmware_wrong STATE_CHANGE WARNING no Message: The firmware level of enclosure {0} is wrong.

Description: The firmware level of the enclosure is not up to date. The


enclosure firmware needs an update.

Cause: The mmlsfirmware -Y command reports that the enclosure


firmware is not up to date.

User Action: Issue the mmlsfirmware command. If the warning message


persists, see the IBM Storage Scale: Problem Determination Guide of the
relevant system.

enclosure_found INFO_ADD_ENTITY INFO no Message: Enclosure {0} is found.

Description: A GNR enclosure, which is listed in the IBM Storage Scale


configuration, was detected.

Cause: N/A

User Action: N/A

enclosure_needsservice STATE_CHANGE WARNING no Message: Enclosure {0} needs service.

Description: The enclosure needs service.

Cause: The mmlsenclosure all -L -Y command reports that the


enclosure needs service.

User Action: For more information, contact IBM support.

enclosure_ok STATE_CHANGE INFO no Message: Enclosure {0} is OK.

Description: The enclosure state is OK.

Cause: N/A

User Action: N/A

enclosure_unknown STATE_CHANGE WARNING no Message: Enclosure state {0} is unknown.

Description: The enclosure state is unknown.

Cause: The mmlsenclosure all -L -Y command reports that the


enclosure was not detected.

User Action: Restart the system monitor by issuing the mmsysmoncontrol


restart command.

enclosure_vanished INFO_DELETE_ENTITY INFO no Message: Enclosure {0} has vanished.

Description: A GNR enclosure, which was previously listed in the IBM


Storage Scale configuration, is no longer detected. This can be a valid
situation.

Cause: A GNR enclosure, which was previously listed in the IBM Storage
Scale configuration, is no longer found.

User Action: Run the mmlsenclosure all -L command to verify that


all expected enclosures exist in the listing of the IBM Storage Scale
configuration.

esm_absent STATE_CHANGE WARNING no Message: ESM {0} is absent.

Description: The Environmental Service Module (ESM) is absent or not


installed.

Cause: The mmlsenclosure all -L -Y command reports that the ESM


module is absent.

User Action: Check whether the ESM is installed and operational. For more
information, see the IBM Storage Scale: Problem Determination Guide of the
relevant system.

esm_failed STATE_CHANGE WARNING servic Message: ESM {0} is FAILED.


e
ticket Description: The Environmental Service Module (ESM) state is FAILED.

Cause: The mmlsenclosure all -L -Y command reports that the ESM


has failed.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

588 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

esm_ok STATE_CHANGE INFO no Message: ESM {0} is OK.

Description: The ESM state is OK.

Cause: N/A

User Action: N/A

expander_absent STATE_CHANGE WARNING no Message: Expander {0} is absent.

Description: The expander is absent.

Cause: The mmlsenclosure all -L -Y command reports that the


expander is absent.

User Action: Verify that the expander is correctly installed.

expander_failed STATE_CHANGE ERROR servic Message: Expander {0} is FAILED.


e
ticket Description: The expander state is reported as FAILED.

Cause: The mmlsenclosure all -L -Y command reports the expander


has failed.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

expander_ok STATE_CHANGE INFO no Message: Expander {0} is OK.

Description: The expander state is OK.

Cause: N/A

User Action: N/A

fan_absent STATE_CHANGE WARNING servic Message: Fan {0} is absent.


e
ticket Description: A fan is absent.

Cause: The mmlsenclosure all -L command reports that a fan is


absent.

User Action: Check the enclosure. Insert or replace fan. If the problem
remains, then contact IBM support.

fan_failed STATE_CHANGE WARNING servic Message: Fan {0} state is FAILED.


e
ticket Description: A fan state is FAILED.

Cause: The mmlsensclosure all -L command reports that a fan state


is FAILED.

User Action: Replace the fan. Contact IBM support for a service action.

fan_fault_indicated STATE_CHANGE WARNING servic Message: Fan {0} has a fault.


e
ticket Description: A fan has a fault.

Cause: The mmlsenclosure all -L command reports that a fan has a


fault.

User Action: For more information, issue the mmlsenclosure all -L


command. Check the enclosure, and insert or replace the fan as needed. If
the problem remains, then contact IBM support.

fan_ok STATE_CHANGE INFO no Message: Fan {0} is OK.

Description: Fan state is OK.

Cause: N/A

User Action: N/A

fan_speed_high STATE_CHANGE WARNING servic Message: Fan {0} speed is too high.
e
ticket Description: Fan speed is out of tolerance because it is too high.

Cause: The mmlsensclosure all -L -Y command reports that the fan


speed is too high.

User Action: For more information, check the enclosure cooling module
LEDs for fan faults.

Chapter 42. References 589


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

fan_speed_low STATE_CHANGE WARNING servic Message: Fan {0} speed is too low.
e
ticket Description: Fan speed is out of tolerance because it is too low.

Cause: The mmlsensclosure all -L -Y command reports that the fan


speed is too low.

User Action: For more information, check the enclosure cooling module
LEDs for fan faults.

no_enclosure_data STATE_CHANGE WARNING no Message: Enclosure data and enclosure state information cannot be
queried.

Description: Correct enclosure details cannot be queried.

Cause: The mmlsenclosure all -L -Y command has failed to report


any enclosure data, or the data is inconsistent.

User Action: Run the mmlsenclosure all -L command to check for


any enclosure issues. You must also verify that the pemsmod is loaded by
issuing the 'lsmod' command.

power_high_current STATE_CHANGE WARNING servic Message: Power supply {0} reports high current.
e
ticket Description: The DC power supply current is greater than the threshold.

Cause: The mmlsenclosure all -L -Y command reports high current


for a power supply.

User Action: Issue the mmlsenclosure all -L -Y command to check


the details. For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

power_high_voltage STATE_CHANGE WARNING servic Message: Power supply {0} reports high voltage.
e
ticket Description: The DC power supply voltage is greater than the threshold.

Cause: The mmlsenclosure all -L -Y command returns high voltage


for a power supply.

User Action: Issue the mmlsenclosure all -L command to check


the details. For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

power_no_power STATE_CHANGE WARNING no Message: Power supply {0} has no power.

Description: The power supply has no power. It might be switched off or


has no input AC.

Cause: The hardware monitor reports that power is not being supplied to
the power supply.

User Action: Ensure that the power supply is switched on or connected to


AC. Check cable.

power_supply_absent STATE_CHANGE WARNING no Message: Power supply {0} is missing.

Description: A power supply is missing or absent.

Cause: The mmlsenclosure all -L -Y command reports that a power


supply is absent.

User Action: Check whether the power supply is installed and operational.
For more information, see the IBM Storage Scale: Problem Determination
Guide of the relevant system.

power_supply_config_mismat STATE_CHANGE_EXTE ERROR servic Message: Enclosure has an inconsistent power supply configuration.
ch RNAL e
ticket Description: Inconsistent power supply configuration.

Cause: The power supplies in the enclosure do not fit to each other.

User Action: Verify that all power supplies in all canisters have the same
specification. The event can be manually cleared by using the mmhealth
event resolve power_supply_config_mismatch command.

590 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

power_supply_config_ok STATE_CHANGE_EXTE INFO no Message: Enclosure has a correct power supply configuration.
RNAL
Description: The power supply configuration is OK.

Cause: N/A

User Action: N/A

power_supply_failed STATE_CHANGE WARNING servic Message: Power supply {0} is FAILED.


e
ticket Description: A power supply has failed.

Cause: The hardware monitor reports that a power supply has failed.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

power_supply_off STATE_CHANGE WARNING no Message: Power supply {0} is off.

Description: A power supply is off.

Cause: The hardware monitor reports that the power supply is turned off.

User Action: Make sure that the power supply continues to get power, such
as power cable is plugged-in. However, if the problem persists, see IBM
Storage Scale: Problem Determination Guide of the relevant system.

power_supply_ok STATE_CHANGE INFO no Message: Power supply {0} is OK.

Description: The power supply state is OK.

Cause: N/A

User Action: N/A

power_switched_off STATE_CHANGE WARNING no Message: Power supply {0} is switched off.

Description: A power supply is switched off.

Cause: The hardware monitor reports that a power supply is switched off.
The requested on-bit is off, which means that the power supply is not
manually switched on or is missing by setting the requested on-bit.

User Action: Switch on the power supply and check whether it is


operational. For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

sideplane_failed STATE_CHANGE ERROR no Message: Sideplane {0} has failed.

Description: A failure of the sideplane is reported.

Cause: The mmlsenclosure all -L -Y command reports that the


sideplane has failed.

User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.

sideplane_ok STATE_CHANGE INFO no Message: Sideplane {0} is OK.

Description: The sideplane state is OK.

Cause: N/A

User Action: N/A

temp_bus_failed STATE_CHANGE WARNING servic Message: Temperature sensor {0} I2C bus has failed.
e
ticket Description: Temperature sensor I2C bus has failed.

Cause: The mmlsenclosure all -L Y command reports that the bus of


a temperature sensor has failed.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

Chapter 42. References 591


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

temp_high_critical STATE_CHANGE WARNING no Message: Temperature sensor {0} measured a high temperature value.

Description: The measured temperature has exceeded the high critical


threshold.

Cause: The mmlsenclosure all -L -Y command reports a critical high


temperature for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

temp_high_warn STATE_CHANGE WARNING no Message: Temperature sensor {0} has measured a high temperature value.

Description: The measured temperature has exceeded the high warning


threshold.

Cause: The mmlsenclosure all -L -Y command reports a high


temperature for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

temp_low_critical STATE_CHANGE WARNING no Message: Temperature sensor {0} has measured a temperature is less than
the low critical value.

Description: The measured temperature is less than the lower critical


threshold.

Cause: The mmlsenclosure all -L -Y command reports a critical low


temperature for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

temp_low_warn STATE_CHANGE WARNING no Message: Temperature sensor {0} has measured temperature is less than
the low warning value.

Description: The measured temperature is less than the lower warning


threshold.

Cause: The mmlsenclosure all -L -Y command reports a


temperature is less than the low warning threshold.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

temp_sensor_failed STATE_CHANGE WARNING servic Message: Temperature sensor {0} has failed.
e
ticket Description: A temperature sensor might be broken.

Cause: The mmlsenclosure all -L -Y command reports that a


temperature sensor has failed.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

temp_sensor_ok STATE_CHANGE INFO no Message: Temperature sensor {0} is OK.

Description: The temperature sensor state is OK.

Cause: N/A

User Action: N/A

voltage_bus_failed STATE_CHANGE WARNING servic Message: Voltage sensor {0} communication with the I2C bus has failed.
e
ticket Description: The voltage sensor cannot communicate with the I2C bus.

Cause: The mmlsenclosure all -L -Y command reports a bus failure


for a voltage sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

592 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 79. Events for the enclosure component (continued)

Event Event Severity Call Details


Type Home

voltage_high_critical STATE_CHANGE WARNING no Message: Voltage sensor {0} measured a high voltage value.

Description: The voltage has exceeded the actual high critical threshold for
at least one sensor.

Cause: The mmlsenclosure all -L -Y command reports high critical


voltage for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

voltage_high_warn STATE_CHANGE WARNING no Message: Voltage sensor {0} has measured a high voltage value.

Description: The voltage has exceeded the actual high warning threshold
for at least one sensor.

Cause: The mmlsenclosure all -L -Y command reports high voltage


for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

voltage_low_critical STATE_CHANGE WARNING no Message: Voltage sensor {0} has measured a critical low voltage value.

Description: The voltage has fallen under the lower critical threshold.

Cause: The mmlsenclosure all -L -Y command reports critical low


voltage for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

voltage_low_warn STATE_CHANGE WARNING no Message: Voltage sensor {0} has measured a low voltage value.

Description: The voltage has fallen under the lower warning threshold.

Cause: The mmlsenclosure all -L -Y command reports critical low


voltage for a sensor.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

voltage_sensor_failed STATE_CHANGE WARNING servic Message: Voltage sensor {0} is FAILED.


e
ticket Description: A voltage sensor might be broken.

Cause: The mmlsenclosure all -L -Y command reports that a voltage


sensor state is FAILED.

User Action: Issue the mmlsenclosure all -L command. If the


warning message persists, then contact IBM support.

voltage_sensor_ok STATE_CHANGE INFO no Message: Voltage sensor {0} is OK.

Description: The voltage sensor state is OK.

Cause: N/A

User Action: N/A

Encryption events
The following table lists the events that are created for the Encryption component.
Table 80. Events for the Encryption component

Event Event Severity Call Details


Type Home

encryption_configured INFO_ADD_ENTITY INFO no Message: New encryption provider for {id} is configured.

Description: A new encryption provider is configured.

Cause: N/A

User Action: N/A

Chapter 42. References 593


Table 80. Events for the Encryption component (continued)

Event Event Severity Call Details


Type Home

encryption_removed INFO_DELETE_ENTITY INFO no Message: An encryption provider for {id} is removed.

Description: An encryption provider is removed.

Cause: N/A

User Action: N/A

rkmconf_backend_err STATE_CHANGE ERROR no Message: RKM backend server {0} returned an unrecoverable error {1}.

Description: The RKM backend server failed.

Cause: The RKM backend server encountered an unrecoverable error.

User Action: Ensure that the specification of the backend key management
server in the RKM instance is correct and the key server is running on the
specified host. The event can be manually cleared by using the mmhealth
event resolve rkmconf_backend_err <event id> command.

rkmconf_backenddown_err STATE_CHANGE ERROR no Message: The RKM backend server {0} cannot be reached.

Description: The RKM backend server cannot be reached.

Cause: The RKM backend server is down or unreachable.

User Action: Ensure that the specification of the backend key management
server in the RKM instance is correct and the key server is running
on the specified host. The event can be manually cleared by using
the mmhealth event resolve rkmconf_backenddown_err <event
id> command.

rkmconf_certexp_err STATE_CHANGE ERROR no Message: Key server certificate error: {0}

Description: The RKM client or server certificate expired.

Cause: The client or server certificate for the key server expired.

User Action: Follow the documented procedure to update the key server
and/or RKM configuration with a new client or server certificate. The
event can be manually cleared by using the mmhealth event resolve
rkmconf_certexp_err command.

rkmconf_certexp_ok STATE_CHANGE INFO no Message: No expired certificates are encountered.

Description: Certificates that are related to RKM backend configuration are


valid.

Cause: N/A

User Action: N/A

rkmconf_certexp_warn TIP TIP no Message: Key server certificate warning: {0}

Description: The RKM client or server certificate can expire soon.

Cause: The client or server certificate for the key server approaches its
expiration time.

User Action: Follow the documented procedure to update the key server
and/or RKM configuration with a new client or server certificate. The
event can be manually cleared by using the mmhealth event resolve
rkmconf_ccertexp_warn command.

rkmconf_certwarn_ok STATE_CHANGE INFO no Message: No certificates that are approaching the expiration time are
encountered.

Description: Certificates that are related to RKM backend configuration are


valid.

Cause: N/A

User Action: N/A

594 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 80. Events for the Encryption component (continued)

Event Event Severity Call Details


Type Home

rkmconf_configuration_err STATE_CHANGE ERROR no Message: RKM configuration error: {0}

Description: The content of the RKM configuration file cannot be parsed


correctly.

Cause: The RKM configuration file contains incorrect data.

User Action: Ensure that the content of the RKM configuration file
conforms with the documented format (regular setup), or that the
arguments that are provided to the mmkeyserv command conform to the
documentation (simplified setup). The event can be manually cleared by
using the mmhealth event resolve rkmconf_configuration_err
command.

rkmconf_enckey_ok STATE_CHANGE INFO no Message: Event for {id} is marked as resolved.

Description: The RKM backend configuration for encryption key retrieval is


working correctly.

Cause: N/A

User Action: N/A

rkmconf_filenotfound_err STATE_CHANGE ERROR no Message: The mmfsd daemon is not able to read the RKM configuration
file.

Description: Cannot read the RKM configuration file.

Cause: The file does not exist or its content is not valid.

User Action: Check that either the '/var/mmfs/etc/RKM.conf' exists


(regular setup only), or the file system encryption was enabled by using
the simplified setup. The event can be manually cleared by using the
mmhealth event resolve rkmconf_filenotfound_err command.

rkmconf_fileopen_err STATE_CHANGE ERROR no Message: Cannot open RKM configuration file for reading {0}.

Description: Cannot open the RKM configuration file for reading.

Cause: The RKM configuration file exists but cannot be opened for reading.

User Action: Check that, as root, you can open the RKM configuration
file with a text editor. The event can be manually cleared by using the
mmhealth event resolve rkmconf_fileopen_err command.

rkmconf_fileread_err STATE_CHANGE ERROR no Message: Cannot read RKM configuration file {0}.

Description: Cannot read the RKM configuration file.

Cause: The content of the RKM configuration file might be corrupted.

User Action: Check that, as root, you can open the RKM configuration
file with a text editor. The event can be manually cleared by using the
mmhealth event resolve rkmconf_fileread_err command.

rkmconf_getkey_err STATE_CHANGE ERROR no Message: MEK {0} is not available from RKM backend server {1}.

Description: Cannot get key from RKM backend server.

Cause: Failed to retrieve the MEK from the RKM backend servers.

User Action: Ensure that the MEK specified by the UUID provided is
available from the RKM specified by using the mmkeyserv key show
command. The event can be manually cleared by using the mmhealth
event resolve rkmconf_getkey_err <event id> command.

rkmconf_instance_err STATE_CHANGE ERROR no Message: RKM instance error: {0}

Description: RKM instance configuration error.

Cause: The RKM instance configuration is not correct. One of the attributes
is not valid or out of range.

User Action: Ensure that the definition of the RKM instance is


correct and its attributes conform to their defined format. The event
can be manually cleared by using the mmhealth event resolve
rkmconf_instance_err command.

Chapter 42. References 595


Table 80. Events for the Encryption component (continued)

Event Event Severity Call Details


Type Home

rkmconf_keystore_err STATE_CHANGE ERROR no Message: Keystore file error: {0}

Description: Keystore file error.

Cause: The keystore file for the key management server is not accessible
or its content is not valid, or the ownership and/or permissions are too
permissive.

User Action: Ensure that the content of the keystore file conforms with
the documented format and that only root can read and write the file. The
event can be manually cleared by using the mmhealth event resolve
rkmconf_keystore_err command.

rkmconf_ok STATE_CHANGE INFO no Message: The RKM backend configuration is correct and working as
expected.

Description: The RKM backend configuration is working correctly.

Cause: N/A

User Action: N/A

rkmconf_permission_err STATE_CHANGE ERROR no Message: Incorrect ownership and/or file system permissions for RKM
configuration file {0}.

Description: The RKM configuration file has incorrect file system


permissions.

Cause: The RKM configuration file was created with incorrect file system
permissions.

User Action: Check that the RKM.conf file is owned by root:root,


and has read and write permission for owner only. The event
can be manually cleared by using the mmhealth event resolve
rkmconf_permission_err command.

rkm_no_access STATE_CHANGE ERROR no Message:

Description:

Cause:

User Action:

rkm_duplicate STATE_CHANGE ERROR no Message: RKM.conf contains duplicate RKM IDs {id}.

Description: RKM.conf contains duplicate RKM IDs.

Cause: RKM.conf contains duplicate RKM IDs.

User Action: Verify that the rkmid is unique in all the stanzas in all the
RKM.conf files.

Srkm_cert_expired STATE_CHANGE ERROR no Message: Key server certificate error: {id}.

Description: The RKM client or server certificate expired.

Cause: The client or server certificate for the key server expired.

User Action: Follow the documented procedure to update the key server
and/or RKM configuration with a new client or server certificate.

rkm_keyring STATE_CHANGE ERROR no Message: Could not open keyring file: {id}.

Description: The RKM client is not able to open the keyring file.

Cause: The RKM client is not able to open the keyring file.

User Action: Ensure that the content of the keystore file conforms with the
documented format and that only root can read and write the file.

rkm_no_label STATE_CHANGE FAILED no Message: Key Label not found: {id}.

Description: Key Label not found.

Cause: Key Label not found.

User Action: Verify that a key with the specified label exists in the client
keystore.

596 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 80. Events for the Encryption component (continued)

Event Event Severity Call Details


Type Home

rkm_ok STATE_CHANGE FAILED no Message: All checks are OK for {id}.

Description: RKM.conf setup is OK.

Cause: N/A

User Action: N/A

rkm_passphrase STATE_CHANGE ERROR no Message: Incorrect passphrase: {id}.

Description: Incorrect passphrase.

Cause: Incorrect passphrase.

User Action: Verify that the passphrase for the client keystore is correct.

File Audit Logging events


The following table lists the events that are created for the File Audit Logging component.
Table 81. Events for the File Audit Logging component

Event Event Severity Call Details


Type Home

auditc_auditlogfile STATE_CHANGE ERROR no Message: Unable to open or append to the auditLog {1} files for file system
{0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check whether the audited file system is mounted on the
node.

auditc_auth_failed STATE_CHANGE ERROR no Message: Authentication error encountered in audit consumer for group {1}
for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: For more information, check the /var/adm/ras/


mmaudit.log.

auditc_brokerconnect STATE_CHANGE ERROR no Message: Unable to connect to Kafka broker server {1} for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Run the mmmsgqueue status -v command to check


message queue status and ensure that the message queue is in a HEALTHY
state. You might disable and re-enable the message queue in case the
message queue remains unhealthy.

auditc_compress STATE_CHANGE WARNING no Message: Could not compress for audit log file {1}.

Description: Warning encountered in audit consumer.

Cause: N/A

User Action: For more information, check the /var/adm/ras/


mmaudit.log file. Ensure that the audit fileset is mounted and the file
system is in a HEALTHY state.

auditc_createkafkahandle STATE_CHANGE ERROR no Message: Failed to create audit consumer Kafka handle for file system {0}.

Description: Warning encountered in audit consumer.

Cause: N/A

User Action: Ensure that gpfs.librdkafka packages are installed. Then,


enable or disable audit with the mmaudit enable/disable command.

Chapter 42. References 597


Table 81. Events for the File Audit Logging component (continued)

Event Event Severity Call Details


Type Home

auditc_err STATE_CHANGE ERROR no Message: Error encountered in audit consumer for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: For more information, check the /var/adm/ras/


mmaudit.log.

auditc_flush_auditlogfile STATE_CHANGE ERROR no Message: Unable to flush the auditLog {1} files for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check whether the file system is mounted on the node.

auditc_flush_errlogfile STATE_CHANGE ERROR no Message: Unable to flush the errorLog file for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check whether the file system is mounted on the node.

auditc_found INFO_ADD_ENTITY INFO no Message: Audit consumer for file system {0} was found.

Description: An audit consumer listed in the IBM Storage Scale


configuration was detected.

Cause: N/A

User Action: N/A

auditc_initlockauditfile STATE_CHANGE ERROR no Message: Failed to indicate to systemctl on successful consumer startup
sequence for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Disable and re-enable auditing by using the mmaudit


command.

auditc_loadkafkalib STATE_CHANGE ERROR no Message: Unable to initialize file audit consumer for file system {0}. Failed
to load librdkafka library.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check the installation of librdkafka libraries and retry the
mmaudit command.

auditc_mmauditlog STATE_CHANGE ERROR no Message: Unable to append to file {1} for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check that the audited file system is mounted on the node.
Ensure the file system to be audited is in a HEALTHY state and then, retry
by using the mmaudit disable/enable command.

auditc_nofs STATE_CHANGE INFO no Message: No file system is audited.

Description: No file system is audited.

Cause: N/A

User Action: N/A

auditc_offsetfetch STATE_CHANGE ERROR no Message: Failed to fetch topic ({1}) offset for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check on the topicName by using the mmmsgqueue list


--topics command. If topic is listed, then try restarting consumers with
the mmaudit all consumerStop/consumerStart -N <node(s)>
command. If the problem persists, then try disabling and re-enabling audit
with the mmaudit enable/disable command.

598 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 81. Events for the File Audit Logging component (continued)

Event Event Severity Call Details


Type Home

auditc_offsetstore STATE_CHANGE ERROR no Message: Failed to store an offset for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check on the topicName by using the mmmsgqueue list


--topics command. If topic is listed, then try restarting consumers with
the mmaudit all consumerStop/consumerStart -N <node(s)>
command. If the problem persists, then try disabling and re-enabling audit
with the mmaudit enable/disable command.

auditc_ok STATE_CHANGE INFO no Message: File Audit consumer for file system {0} is running.

Description: File Audit consumer is running.

Cause: N/A

User Action: N/A

auditc_service_failed STATE_CHANGE ERROR no Message: File audit consumer {1} for file system {0} is not running.

Description: File audit consumer service is not running.

Cause: N/A

User Action: For more information, use the systemctl status


<consumername> command and see the /var/adm/ras/mmaudit.log.

auditc_service_ok STATE_CHANGE INFO no Message: File audit consumer service for file system {0} is running.

Description: File audit consumer service is running.

Cause: N/A

User Action: N/A

auditc_setconfig STATE_CHANGE ERROR no Message: Failed to set configuration for audit consumer for file system {0}
and group {1}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: For more information, check the /var/adm/ras/


mmaudit.log file. Attempt to fix by disabling and re- enabling the file
audit logging by using the mmaudit disable/enable command.

auditc_setimmutablity STATE_CHANGE WARNING no Message: Could not set immutability on for auditLogFile {1}.

Description: Warning encountered in audit consumer.

Cause: N/A

User Action: For more information, check the /var/adm/ras/


mmaudit.log file. Ensure that the audit fileset is mounted and the file
system is in a HEALTHY state.

auditc_topicsubscription STATE_CHANGE ERROR no Message: Failed to subscribe to topic ({1}) for file system {0}.

Description: Error encountered in audit consumer.

Cause: N/A

User Action: Check on the topicName and initial configuration details


by using the mmmsgqueue list --topics command and retry the
mmaudit command.

auditc_vanished INFO_DELETE_ENTITY INFO no Message: Audit consumer for file system {0} has vanished.

Description: An audit consumer that was listed in the IBM Storage Scale
configuration was removed.

Cause: N/A

User Action: N/A

Chapter 42. References 599


Table 81. Events for the File Audit Logging component (continued)

Event Event Severity Call Details


Type Home

auditc_warn STATE_CHANGE WARNING no Message: Warning encountered in audit consumer for file system {0}.

Description: Warning encountered in audit consumer.

Cause: N/A

User Action: For more information, check the /var/adm/ras/


mmaudit.log.

auditp_auth_err STATE_CHANGE ERROR no Message: Error obtaining authentication credentials or configuration for
producer; error message: {2}.

Description: Authentication error encountered in event producer.

Cause: N/A

User Action: Verify that the file audit log is properly configured. Disable
and enable the file audit log by using the mmmsgqueue and the mmaudit
commands.

auditp_auth_info TIP TIP no Message: Authentication or configuration information is not present or


outdated. Request to load credentials or configuration is started and new
credentials or configuration is used on next event. Message: {2}.

Description: Event producer has no or outdated authentication


information.

Cause: N/A

User Action: N/A

auditp_auth_warn STATE_CHANGE WARNING no Message: Authentication credentials for Kafka could not be obtained. An
attempt to update credentials is performed later. Message: {2}.

Description: Event producer failed to obtain authentication information.

Cause: N/A

User Action: N/A

auditp_create_err STATE_CHANGE ERROR no Message: Error encountered while creating a new (loading or configuring)
event producer; error message: {2}.

Description: error encountered when creating a new event producer.

Cause: N/A

User Action: Verify that the correct gpfs.librdkafka is installed. For more
information, check /var/adm/ras/mmfs.log.latest.

auditp_found INFO_ADD_ENTITY INFO no Message: New event producer for file system {2} was configured.

Description: A new event producer was configured.

Cause: N/A

User Action: N/A

auditp_log_err STATE_CHANGE ERROR no Message: Error opening or writing to event producer log file.

Description: Log error encountered in event producer.

Cause: N/A

User Action: Verify that the log directory and /var/adm/ras/


mmaudit.log file are present and writeable. For more information,
check /var/adm/ras/mmfs.log.latest.

auditp_msg_send_err STATE_CHANGE WARNING no Message: Failed to send message to target sink for file system {2};
errormessage: {3}.

Description: Error sending messages encountered in event producer.

Cause: N/A

User Action: Check connectivity to Kafka broker and topic and check
whether the broker can accept new messages. For more information,
check the /var/adm/ras/mmfs.log.latest and /var/adm/ras/
mmmsgqueue.log files.

600 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 81. Events for the File Audit Logging component (continued)

Event Event Severity Call Details


Type Home

auditp_msg_send_stop STATE_CHANGE ERROR no Message: Failed to send more than {2} messages to target sink. Producer is
now shutdown. No more messages are sent.

Description: Producer reached the error threshold. The producer no longer


attempts to send messages.

Cause: N/A

User Action: To re-enable events, and disable and re-enable file audit
logging, run the mmaudit <device> disable/enable command. If file
audit logging fails again, then you might need to disable and re-enable
message queue. Run the mmmsgqueue enable/disable command, and
then enable the file audit logging. If file audit logging continues to fail,
then run the mmmsgqueue config --remove command. Now, enable
the message queue and then enable the file audit logging.

auditp_msg_write_err STATE_CHANGE WARNING no Message: Failed to write message to Audit log for file system {2}; error
message: {3}.

Description: Error writing messages encountered in event producer.

Cause: N/A

User Action: Ensure that the Audit fileset is healthy. For more information,
check the /var/adm/ras/mmfs.log.latest and /var/adm/ras/
mmmsgqueue.log.

auditp_msg_write_stop STATE_CHANGE ERROR no Message: Failed to write more than {2} messages to Audit log. Producer is
now shutdown. No more messages are sent.

Description: Producer reached the error threshold. The producer no longer


attempts to send messages.

Cause: N/A

User Action: To re-enable events, run the mmaudit all


producerRestart -N <Node> command. If that does not succeed,
then ensure that the Audit fileset is healthy and disable or enable the
file audit logging with the mmaudit command. For more information,
check the /var/adm/ras/mmfs.log.latest and /var/adm/ras/
mmmsgqueue.log.

auditp_msgq_unsupported STATE_CHANGE ERROR no Message: Message queue is no longer supported and no clustered watch
folder or file audit logging commands can be run until the message queue is
removed.

Description: Message queue is no longer supported in IBM Storage Scale


and must be removed.

Cause: N/A

User Action: For more information, see the mmmsgqueue config --


remove-msgqueue command page.

auditp_ok STATE_CHANGE INFO no Message: Event producer for file system {2} is OK.

Description: Event producer is OK.

Cause: N/A

User Action: N/A

auditp_vanished INFO_DELETE_ENTITY INFO no Message: An event producer for file system {2} was removed.

Description: An event producer was removed.

Cause: N/A

User Action: N/A

Chapter 42. References 601


File system events
The following table lists the events that are created for the File System component.
Table 82. Events for the file system component

Event Event Severity Call Details


Type Home

clear_mountpoint_tip TIP INFO no Message:File system {0} was unmounted or mounted at its default
mountpoint.

Description: Clear any previous tip for this file system about using a non-
default mountpoint.

Cause: N/A

User Action: N/A

desc_disk_quorum_fail STATE_CHANGE WARNING no Message: Sufficient healthy descriptor disks are not found for file system
{0} quorum.

Description: Sufficient healthy descriptor disks are not found.

Cause: Sufficient healthy descriptor disks are not found. No quorum is


found for the file system.

User Action: Check the health state of disks, which are declared as
descriptor disks for the file system. An insufficient number of healthy
descriptor disks might lead to a data access loss. For more information, see
the 'Disk issues' section in the IBM Storage Scale: Problem Determination
Guide.

desc_disk_quorum_ok STATE_CHANGE INFO no Message: Sufficient healthy descriptor disks are found for file system {0}
quorum.

Description: Sufficient healthy descriptor disks are found.

Cause: N/A

User Action: N/A

exported_fs_available STATE_CHANGE INFO no Message: The file system {0} used for exports is available.

Description: A file system used for export is available.

Cause: N/A

User Action: N/A

exported_path_available STATE_CHANGE INFO no Message: All NFS or SMB exported path with undeclared mount points are
available.

Description: All NFS or SMB exported paths are available, which may
include automounted folders..

Cause: N/A

User Action: N/A

exported_path_unavail STATE_CHANGE WARNING no Message: At least for one NFS or SMB export ({0}), no GPFS file system is
mounted to the exported path.

Description: At least for one NFS or SMB exported path the intended file
system is unclear or unmounted. Those exports cannot be used and can
lead to a failure of the service.

Cause: At least one NFS or SMB exported path does not point to a mounted
GPFS file system according to /proc/mounts. The intended file system is
unknown because the export path does not match the default mountpoint
of any GPFS file system due to the use autofs or bind-mount.

User Action: Check the mount states for NFS and SMB exported file
systems. This message can be related to autofs or bind-mounted file
systems.

filesystem_found INFO_ADD_ENTITY INFO no Message: File system {0} was found.

Description: A file system listed in the IBM Storage Scale configuration was
detected.

Cause: N/A

User Action: N/A

602 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

filesystem_vanished INFO_DELETE_ENTITY INFO no Message: File system {0} has vanished.

Description: A file system listed in the IBM Storage Scale configuration was
not detected.

Cause: A file system, listed in the IBM Storage Scale configuration as


mounted before, is not found. This can be a valid situation.

User Action: To verify that all expected file systems are mounted, run the
mmlsmount all_local -L command.

fs_forced_unmount STATE_CHANGE ERROR no Message: The file system {0} was {1} (forced) to unmount.

Description: A file system was forced to unmount by IBM Storage Scale.

Cause: A situation, like a SGPanic or a quorum loss can initiate the


unmount process.

User Action: Check error messages and error log for further details. For
more information, see the 'File system forced unmount' topic in the IBM
Storage Scale: Problem Determination Guide.

fs_maintenance_mode STATE_CHANGE INFO no Message: File system {id} is set to maintenance mode.

Description: The file system is in maintenance mode.

Cause: N/A

User Action: N/A

fs_preunmount_panic STATE_CHANGE ERROR no Message: The file system {0} is unmounted because of an SGPanic
situation.

Description: A file system is unmounted because of an SGPanic situation.

Cause: A file system is set to preunmount state because of an SGPanic


situation.

User Action: For more information, check error messages and error log in
the mmfs.log file.

fs_remount_mount STATE_CHANGE_EXTE INFO no Message: The file system {0} was mounted {1}.
RNAL
Description: A file system was mounted.

Cause: N/A

User Action: N/A

fs_unmount_info INFO_EXTERNAL INFO no Message: The file system {0} was unmounted {1}.

Description: A file system was unmounted.

Cause: A file system was unmounted.

User Action: N/A

fs_working_mode STATE_CHANGE INFO no Message: File system {id} is not in maintenance mode.

Description: The file system is not in maintenance mode.

Cause: N/A

User Action: N/A

fserrallocblock STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Corrupted alloc segment is detected while attempting to alloc
disk block.

Cause: A file system corruption is detected.

User Action: Check the error message and the mmfs.log.latest log for
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

Chapter 42. References 603


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

fserrbadaclref STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: File references indicate invalid ACL.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbaddirblock STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Invalid directory block detected.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbaddiskaddrindex STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Bad disk index is detected in disk address.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbaddiskaddrsector STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Bad sector number in disk address or start sector plus length
would exceed the disk size.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbaddittoaddr STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Invalid ditto address is detected.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbadinodeorgen STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Deleted inode has dir entry or the generation number do not
match to the directory.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

604 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

fserrbadinodestatus STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Inode status is bad, and is expected to indicate deleted
status.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbadptrreplications STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Invalid computed pointer indicates replication factors.

Cause: A file system corruption is detected.

User Action: Check error message for details and the mmfs.log.latest log
for further details. For more information, see the Checking and repairing
a file system topic in the IBM Storage Scale: Administration Guide. If the
file system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbadreplicationcounts STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Invalid current, or maximum data or metadata replication
counts.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrbadxattrblock STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Indicates invalid extended attribute block.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrcheckheaderfailed STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: CheckHeader returned an error.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrclonetree STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Invalid cloned in file tree structure.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

Chapter 42. References 605


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

fserrdeallocblock STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Corrupt allocated segment is detected while attempting to
deallocate disk block.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrdotdotnotfound STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Unable to locate an entry.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrgennummismatch STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Found generation number in '..' entry does not match
actual generation number of parent directory.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinconsistentfilesetrootdir STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Inconsistent fileset or root directory, which means fileset is
inUse, root dir '..' points to itself.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
systemtopic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinconsistentfilesetsnapsh STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


ot uploa
d Description: Inconsistent fileset or snapshot records, which means fileset
snapList points to a SnapItem that does not exist.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinconsistentinode STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Size data in inode is inconsistent.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

606 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

fserrindirectblock STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Indirect block header does not match info in inode.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrindirectionlevel STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Indicates invalid indirection level in inode.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinodecorrupted STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Code to catch infinite loop in the lfs layer in case of a
corrupted inode or directory entry.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinodenummismatch STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Found inode number in '..' entry does not match actual
inode number of parent directory.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinvalid STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Unknown error={2}.


uploa
d Description: Unrecognized FSSTRUCT error received. Check IBM
documentation.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrinvalidfilesetmetadatarec STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


ord uploa
d Description: Indicates invalid fileset metadata record.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

Chapter 42. References 607


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

fserrinvalidsnapshotstates STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Indicates invalid snapshot states, which means more than
one in an inode space being emptied (SnapBeingDeletedOne).

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrsnapinodemodified STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Inode was modified without saving old content to shadow
inode file.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fserrvalidate STATE_CHANGE ERROR FTDC Message: FS={0},ErrNo={1},Msg={2}.


uploa
d Description: Validation routine has failed on a disk read.

Cause: A file system corruption is detected.

User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.

fsstruct_fixed STATE_CHANGE INFO no Message: A file system {id} structure error has been marked as fixed.

Description: A file system structure error was declared or detected as


fixed.

Cause: N/A

User Action: N/A

ill_exposed_fs STATE_CHANGE WARNING no Message: The file system {0} has a data exposure risk as there are file(s)
where all replicas are on suspended disks, which makes it vulnerable to
potential data loss when a disk fails.

Description: A configuration change is causing the file system to have a


data exposure risks.

Cause: The mmfsadm eventsExporter get fs <filesystem>


command reports that the file system has a data exposure risk.

User Action: Run the mmrestripefs command against the file system.

ill_replicated_fs STATE_CHANGE WARNING no Message: The file system {0} is not properly replicated.

Description: A configuration change is causing the file system to no longer


being properly replicated.

Cause: The mmfsadm eventsExporter get fs <filesystem>


command reports that the file system is no longer being properly
replicated.

User Action: Run the mmrestripefs command against the file system.

608 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

ill_unbalanced_fs TIP TIP no Message: The file system {0} is not properly balanced.

Description: A configuration change is causing the file system to no longer


being properly balanced.

Cause: The mmfsadm eventsExporter get fs <filesystem>


command reports that the file system is no longer properly balanced.

User Action: Run the mmrestripefs command against the file system.

inode_high_error STATE_CHANGE ERROR no Message: The inode usage of fileset {id[1]} in file system {id[0]} reached a
nearly exhausted level. {0}.

Description: The inode usage in the fileset reached a nearly exhausted


level.

Cause: The inode usage in the fileset reached a nearly exhausted level.

User Action: Run the mmdf command and the mmlsfileset -i -L


commands to check the inode consumption. Run the mmchfileset
command to increase the inode space. Refer to the Monitoring > Events
page in the IBM Storage Scale GUI to fix this event with the help of a
directed maintenance procedure. Select the event from the list of events
and select Actions > Run Fix Procedure to launch the DMP.

inode_high_warn STATE_CHANGE WARNING no Message: The inode usage of fileset {id[1]} in file system {id[0]} has
reached a warning level. {0}.

Description: The inode usage in the fileset has reached warning level.

Cause: The inode usage in the fileset has reached warning level.

User Action: Delete data.

inode_no_data STATE_CHANGE INFO no Message: No inode usage data is used for fileset {id[1]} in file system
{id[0]}.

Description: No inode usage data is used in performance monitoring.

Cause: N/A

User Action: N/A

inode_normal STATE_CHANGE INFO no Message: The inode usage of fileset {id[1]} in file system {id[0]} reached a
normal level.

Description: The inode usage in the fileset reached a normal level.

Cause: N/A

User Action: N/A

inode_removed STATE_CHANGE INFO no Message: No inode usage data is used for fileset {id[1]} in file system
{id[0]}.

Description: No inode usage data is used in performance monitoring.

Cause: N/A

User Action: N/A

local_exported_fs_unavail STATE_CHANGE ERROR no Message: The local file system {0} that is used for exports is not mounted.

Description: A local file system that is used for export is not mounted.

Cause: A local file system for a declared export is not available.

User Action: Check the local mounts.

low_disk_space_info INFO INFO no Message: Low disk space. StoragePool {1} in file system {0} has reached
the threshold as configured in a migration policy.

Description: Low disk space. A file system reached the threshold as


configured in a migration policy.

Cause: A lowDiskSpace callback was received. A file system has reached


its high occupancy threshold.

User Action: For more information, check warning message and the
mmfs.log.latest log. Check also the Using thresholds to migrate data
between pools section in the IBM Storage Scale: Administration Guide.

Chapter 42. References 609


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

low_disk_space_warn STATE_CHANGE WARNING no Message: Low disk space. File system {0} has reached its high occupancy
threshold. StoragePool={1}.

Description: Low disk space. A file system has reached its high occupancy
threshold.

Cause: Low disk space. A file system has reached its high occupancy
threshold.

User Action: For more information, check the warning message and the
mmfs.log.latest log. Clear this event by using the mmhealth event
resolve low_disk_space_warn <fsname> command.

mounted_fs_check STATE_CHANGE INFO no Message: The file system {0} is mounted.

Description: A mounted file system is reported correctly.

Cause: N/A

User Action: N/A

no_disk_space_clear STATE_CHANGE INFO no Message: A disk space warning has been marked as fixed. A disk space
issue was resolved.

Description: A file system low disk or inode space warning was declared or
detected as resolved.

Cause: N/A

User Action: N/A

no_disk_space_inode STATE_CHANGE ERROR no Message: Fileset {2} runs out of space. Filesystem={0}, StoragePool={1},
reason={3}.

Description: A fileset runs out of inode space.

Cause: A fileset does not have sufficient inode space. Triggered by the
'noDiskSpace' callback.

User Action: For more information, check the error message and the
mmfs.log.latest log. Clear this event by using the mmhealth event
resolve no_disk_space_inode <fsname> command.

no_disk_space_warn STATE_CHANGE ERROR no Message: File system {0} runs out of space. StoragePool={1}, FSet={2},
reason={3}.

Description: A file system runs out of disk space.

Cause: A file system does not have sufficient disk space. Triggered by the
'noDiskSpace' callback.

User Action: For more information, check the error message and the
mmfs.log.latest log. Clear this event by using the mmhealth event
resolve no_disk_space_warn <fsname> command.

not_default_mountpoint TIP TIP no Message: The mountpoint for file system {0} differs from the declared
default value.

Description: The mountpoint differs from the declared default value.

Cause: The mountpoint differs from the declared default value.

User Action: Review your file system configuration.

ok_exposed_fs STATE_CHANGE INFO no Message: The file system {0} has no data exposure risk.

Description: The file system has no data exposure risk.

Cause: N/A

User Action: N/A

ok_replicated_fs STATE_CHANGE INFO no Message: The file system {0} is properly replicated.

Description: The file system is properly replicated.

Cause: N/A

User Action: N/A

610 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

ok_unbalanced_fs STATE_CHANGE INFO no Message: The file system {0} is properly balanced.

Description: The file system is properly balanced.

Cause: N/A

User Action: N/A

pool-data_high_error STATE_CHANGE ERROR no Message: The pool {id[1]} of file system {id[0]} has reached a nearly
exhausted data level. {0}.

Description: The pool has reached a nearly exhausted level.

Cause: The pool has reached a nearly exhausted level.

User Action: For more information, add more capacity to pool or move data
to different pool or delete data and/or snapshots.

pool-data_high_warn STATE_CHANGE WARNING no Message: The pool {id[1]} of file system {id[0]} has reached a warning level
for data. {0}.

Description: The pool has reached a warning level.

Cause: The pool has reached a warning level.

User Action: For more information, add more capacity to pool or move data
to different pool or delete data and/or snapshots.

pool-data_no_data STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.

Description: No pool usage data in performance monitoring.

Cause: N/A

User Action: N/A

pool-data_normal STATE_CHANGE INFO no Message: The pool {id[1]} of file system {id[0]} has reached a normal data
level.

Description: The pool has reached a normal level.

Cause: The pool has reached a normal level.

User Action: N/A

pool-data_removed STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.

Description: No pool usage data in performance monitoring.

Cause: N/A

User Action: N/A

pool-metadata_high_error STATE_CHANGE ERROR no Message: The pool {id[1]} of file system {id[0]} has reached a nearly
exhausted metadata level. {0}.

Description: The pool has reached a nearly exhausted level.

Cause: N/A

User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.

pool-metadata_high_warn STATE_CHANGE WARNING no Message: The pool {id[1]} of file system {id[0]} has reached a warning level
for metadata. {0}.

Description: The pool has reached a warning level.

Cause: The pool has reached a warning level.

User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.

pool-metadata_no_data STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.

Description: No pool usage data in performance monitoring.

Cause: N/A

User Action: N/A

Chapter 42. References 611


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

pool-metadata_normal STATE_CHANGE INFO no Message: The pool {id[1]} of file system {id[0]} has reached a normal
metadata level.

Description: The pool has reached a normal level.

Cause: N/A

User Action: N/A

pool-metadata_removed STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.

Description: No pool usage data in performance monitoring.

Cause: N/A

User Action: N/A

pool_high_error STATE_CHANGE ERROR no Message: The pool {id[1]} of file system {id[0]} has reached a nearly
exhausted level. {0}.

Description: The pool has reached a nearly exhausted level.

Cause: The pool has reached a nearly exhausted level.

User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.

pool_high_warn STATE_CHANGE WARNING no Message: The pool {id[1]} of file system {id[0]} reached a warning level. {0}.

Description: The pool has reached a warning level.

Cause: The pool has reached a warning level.

User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.

pool_no_data INFO INFO no Message: The state of pool {id[1]} in file system {id[0]} is unknown.

Description: Cannot determine fill state of the pool.

Cause: Cannot determine fill state of the pool.

User Action: N/A

pool_normal STATE_CHANGE INFO no Message: The pool {id[1]} of file system {id[0]} has reached a normal level.

Description: The pool has reached a normal level.

Cause: N/A

User Action: N/A

remote_exported_fs_unavail STATE_CHANGE ERROR no Message: The remote file system {0} that is used for exports is not
mounted.

Description: A remote file system that is used for export is not mounted.

Cause: A remote file system for a declared export is not available.

User Action: Check the remote mount states and exports of the remote
cluster.

stale_mount STATE_CHANGE ERROR no Message: Found stale mounts for {0}.

Description: A mount state information mismatch was detected between


the mmlsmount command and /proc/mounts.

Cause: A file system may not be fully mounted or unmounted.

User Action: To verify that all expected file systems are mounted, run the
mmlsmount all_local -L command.

unmounted_fs_check STATE_CHANGE WARNING no Message: The file system {0} is probably needed, but not mounted.

Description: An internally mounted or a declared, but not mounted file


system was detected.

Cause: A declared file system is not mounted.

User Action: To verify that all expected file systems are mounted, run the
mmlsmount all_local -L command.

612 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 82. Events for the file system component (continued)

Event Event Severity Call Details


Type Home

unmounted_fs_ok STATE_CHANGE INFO no Message: The file system {0} is probably needed, but not automounted or
automount is prevented.

Description: An internally mounted or a declared but not mounted file


system was detected.

Cause: N/A

User Action: N/A

File system manager events


The following table lists the events that are created for the Filesysmgr component.
Table 83. Events for the file system manager component

Event Event Severity Call Details


Type Home

filesystem_mgr INFO_ADD_ENTITY INFO no Message: File system {0} is managed by this node.

Description: A file system managed by this node was detected.

Cause: N/A

User Action: N/A

filesystem_no_mgr INFO_DELETE_ENTITY INFO no Message: File system {0} no longer managed by this node.

Description: A previously managed file system is no longer managed by


this node.

Cause: A previously managed file system is not managed by this node


anymore.

User Action: N/A

managed_by_this_node STATE_CHANGE INFO no Message: The file system {0} is managed by this node.

Description: This file system is managed by this node.

Cause: The file system is managed by this node.

User Action: N/A

qos_check_done INFO INFO no Message: The QOS check cycle did not succeed.

Description: The QOS check cycle did succeed.

Cause: The QOS check cycle did succeed.

User Action: N/A

qos_check_ok STATE_CHANGE INFO no Message: File system {0} QOS check is OK.

Description: The QOS check succeeded for this file system.

Cause: The QOS check succeeded for this file system.

User Action: N/A

qos_check_warn INFO WARNING no Message: The QOS check cycle did not succeed.

Description: The QOS check cycle did not succeed.

Cause: The QOS check cycle did not succeed.

User Action: This might be a temporary issue. Check the configuration and
the mmqos command output.

qos_not_active STATE_CHANGE WARNING no Message: File system {0} QOS is enabled but throttling or monitoring is not
active.

Description: The QOS is declared as enabled but throttling or monitoring is


not active.

Cause: The QOS is declared as enabled but throttling or monitoring is not


active.

User Action: Check configuration.

Chapter 42. References 613


Table 83. Events for the file system manager component (continued)

Event Event Severity Call Details


Type Home

qos_sensors_active TIP INFO no Message: The perfmon sensor GPFSQoS is active.

Description: The GPFSQoS perfmon sensor is active.

Cause: The GPFSQoS perfmon sensors' period attribute is greater than 0.

User Action: N/A

qos_sensors_clear TIP INFO no Message: Clear any previous bad QoS sensor state.

Description: Clear any previous bad GPFSQoS sensor state flag.

Cause: Clear any previous bad GPFSQoS sensor state.

User Action: N/A

qos_sensors_inactive TIP TIP no Message: The perfmon sensor GPFSQoS is inactive.

Description: The perfmon sensor GPFSQoS is not active.

Cause: The GPFSQoS perfmon sensors' period attribute is 0.

User Action: Set the period attribute of the GPFSQoS sensor greater than
0. Therefore, use the mmperfmon config update GPFSQoS.period=N
command, where N is a number greater 0. On the other hand, you can hide
this event by using the mmhealth event hide qos_sensor_inactive
command. Enable the perfmon GPFSQoS sensor.

qos_sensors_not_configured TIP TIP no Message: The QoS perfmon sensor GPFSQoS is not configured.

Description: The QoS perfmon sensor does not exist in the mmperfmon
config show command.

Cause: The QoS perfmonance sensor GPFSQoS is not configured in the


sensors configuration file.

User Action: Include the sensors into the perfmon configuration


by using the mmperfmon config add --sensors /opt/IBM/
zimon/defaults/ZIMonSensors_GPFSQoS.cfg command. For more
information, see the mmperfmon command section in the 'Command
Reference Guide'.

qos_sensors_not_needed TIP TIP no Message: Qos is not configured but performance sensor GPFSQoS period is
declared.

Description: There is an active perfmonance sensor GPFSQoS period


declared.

Cause: QoS is not configured but the perfmonance sensor GPFSQoS is


active.

User Action: Disable the perfmonance GPFSQoS sensor.

qos_state_mismatch TIP TIP no Message: File system {0} has an enablement mismatch in QOS state.

Description: Mismatch between the declared QOS state and the current
state. One state is enabled and the other state is not enabled.

Cause: There is a mismatch between the declared QOS state and the
current state.

User Action: Check configuration.

qos_version_mismatch TIP TIP no Message: File system {0} has a version mismatch in QOS state.

Description: There is a mismatch between the declared QOS version and


the current version.

Cause: There is a mismatch between the declared QOS version and the
current version.

User Action: Check configuration.

614 IBM Storage Scale 5.1.9: Problem Determination Guide


GDS events
The following table lists the events that are created for the GDS component.
Table 84. Events for the GDS component

Event Event Severity Call Details


Type Home

gds_check_bad STATE_CHANGE ERROR no Message: The GDS check has failed.

Description: The GDS test program gdscheck did not return the expected
success message.

Cause: The GDS check has failed.

User Action: Check the GDS configuration and settings for the
gdscheckfile parameter. For more information, see the 'gpudirect'
section of the '/var/mmfs/mmsysmon/mmsysmonitor.conf' system health
monitor configuration file.

gds_check_ok STATE_CHANGE INFO no Message: The GDS check is OK.

Description: The GDS test program gdscheck returned successfully.

Cause: The GDS check is OK.

User Action: N/A

gds_check_warn INFO WARNING no Message: The GDS state cannot be determined.

Description: The GDS test program has failed or ran into a timeout.

Cause: The GDS test program failed or ran into a timeout.

User Action: Check the GDS configuration and settings for the
gdscheckfile parameter. For more information, see the 'gpudirect'
section of the '/var/mmfs/mmsysmon/mmsysmonitor.conf' file.

gds_prerequisite_bad STATE_CHANGE ERROR no Message: The GDS prerequisite check for enabled RDMA failed.

Description: The verbsRdma, which is an RDMA prerequisite, is not


enabled in the IBM Storage Scale configuration.

Cause: The GDS fails when the RDMA is disabled.

User Action: Check the RDMA configuration and enable verbsRdma by


using the mmchconfig command.

GPFS events
The following table lists the events that are created for the GPFS component.
Table 85. Events for the GPFS component

Event Event Severity Call Details


Type Home

callhome_enabled TIP INFO no Message: Call home is installed, configured, and enabled.

Description: With enabling the call home functionality, you are providing
useful information to the developers, which helps to improve the product.

Cause: Call home packages are installed. Call home is configured and
enabled.

User Action: N/A

callhome_not_enabled TIP TIP no Message: Call home is not installed, configured, or enabled.

Description: Call home is a functionality that uploads cluster configuration


and log files onto the IBM ECuRep servers. It provides helpful information
for the developers to improve the product as well as for the support to help
in PMR cases.

Cause: Call home packages are not installed, there is no call home
configuration, there are no call home groups, or no call home group was
enabled.

User Action: Use the mmcallhome command to setup call home.

Chapter 42. References 615


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

callhome_not_monitored TIP INFO no Message: Call home status is not monitored on the current node.

Description: Call home status is not monitored on the current node, but
was, when it was the cluster manager.

Cause: Previously, this node was a cluster manager, and Call Home
monitoring was running on it.

User Action: N/A

callhome_without_schedule TIP TIP no Message: Call home is enabled, but, neither daily nor weekly schedule is
configured.

Description: Call home is enabled, but, neither daily nor weekly schedule
is configured. It is recommended to enable daily or weekly call home
schedules.

Cause: Call home is enabled, but, neither daily nor weekly schedule is
configured.

User Action: Enable daily call home uploads by using the mmcallhome
schedule add --task DAILY command.

ccr_auth_keys_disabled STATE_CHANGE INFO no Message: The security file that is used by GPFS CCR is not checked on this
node.

Description: The check for the security file used by GPFS CCR is disabled
on this node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_auth_keys_fail STATE_CHANGE ERROR FTDC Message: The security file that is used by GPFS CCR is corrupt.
uploa Item={0},ErrMsg={1},Failed={2}.
d
Description: The security file used by GPFS CCR is corrupt. For more
information, see message.

Cause: Either the security file is missing or corrupt.

User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.

ccr_auth_keys_ok STATE_CHANGE INFO no Message: The security file that is used by GPFS CCR is OK {0}.

Description: The security file used by GPFS CCR is OK.

Cause: N/A

User Action: N/A

ccr_client_init_disabled STATE_CHANGE INFO no Message: GPFS CCR client initialization is not checked on this node.

Description: The check for GPFS CCR client initialization is disabled on this
node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_client_init_fail STATE_CHANGE ERROR no Message: GPFS CCR client initialization has failed.
Item={0},ErrMsg={1},Failed={2}.

Description: The GPFS CCR client initialization has failed. For more
information, see message.

Cause: The item specified in the message is either not available or corrupt.

User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.

616 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

ccr_client_init_ok STATE_CHANGE INFO no Message: GPFS CCR client initialization is OK {0}.

Description: GPFS CCR client initialization is OK.

Cause: N/A

User Action: N/A

ccr_client_init_warn STATE_CHANGE WARNING no Message: GPFS CCR client initialization has failed.
Item={0},ErrMsg={1},Failed={2}.

Description: The GPFS CCR client initialization has failed. For more
information, see message.

Cause: The item specified in the message is either not available or corrupt.

User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.

ccr_comm_dir_disabled STATE_CHANGE INFO no Message: The files that are commited to the GPFS CCR are not checked on
this node.

Description: The check for the files that are commited to the GPFS CCR is
disabled on this node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_comm_dir_fail STATE_CHANGE ERROR FTDC Message: The files committed to the GPFS CCR are not complete or
uploa corrupt. Item={0},ErrMsg={1},Failed={2}.
d
Description: The files committed to the GPFS CCR are not complete or
corrupt. For more information, see message.

Cause: The local disk might be full.

User Action: Check the local disk space and remove not necessary
files. Recover this degraded node from a still intact node by using the
mmsdrrestore -p <NODE> command with NODE specifing intact node.
For more information, see the mmsdrrestore command in the Command
Reference Guide.

ccr_comm_dir_ok STATE_CHANGE INFO no Message: The files committed to the GPFS CCR are complete and intact
{0}.

Description: The files committed to the GPFS CCR are complete and intact.

Cause: N/A

User Action: N/A

ccr_comm_dir_warn STATE_CHANGE WARNING no Message: The files that are committed to the GPFS CCR are not complete
or corrupt. Item={0},ErrMsg={1},Failed={2}.

Description: The files that are committed to the GPFS CCR are not
complete or corrupt. For more information, see message.

Cause: The local disk may be full.

User Action: Check the local disk space and remove not necessary
files. Recover this degraded node from a still intact node by using the
mmsdrrestore -p <NODE> command with NODE specifing intact node.
For more information, see the mmsdrrestore command in the Command
Reference Guide.

ccr_ip_lookup_disabled STATE_CHANGE INFO no Message: The IP address lookup for the GPFS CCR component is not
checked on this node.

Description: The check for the IP address lookup for the GPFS CCR
component is disabled on this node, since it is not a quorum node.

Cause: N/A

User Action: N/A

Chapter 42. References 617


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

ccr_ip_lookup_ok STATE_CHANGE INFO no Message: The IP address lookup for the GPFS CCR component is OK {0}.

Description: The IP address lookup for the GPFS CCR component is OK.

Cause: N/A

User Action: N/A

ccr_ip_lookup_warn STATE_CHANGE WARNING no Message: The IP address lookup for the GPFS CCR component takes too
long. Item={0},ErrMsg={1},Failed={2}.

Description: The IP address lookup for the GPFS CCR component takes too
long, resulting in slow administration commands. For more information, see
message.

Cause: Either the local network or the DNS is misconfigured.

User Action: Check the local network and DNS configuration.

ccr_local_server_disabled STATE_CHANGE INFO no Message: The local GPFS CCR server is not checked on this node.

Description: The check for the local GPFS CCR server is disabled on this
node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_local_server_ok STATE_CHANGE INFO no Message: The local GPFS CCR server is reachable {0}.

Description: The local GPFS CCR server is reachable.

Cause: N/A

User Action: N/A

ccr_local_server_warn STATE_CHANGE WARNING no Message: The local GPFS CCR server is not reachable.
Item={0},ErrMsg={1},Failed={2}.

Description: The local GPFS CCR server is not reachable. For more
information, see message.

Cause: Either the local network or firewall is configured wrong, or the local
GPFS daemon does not respond.

User Action: Check the network and firewall configuration with regards to
the used GPFS communication port (default: 1191). Restart GPFS on this
node.

ccr_paxos_12_disabled STATE_CHANGE INFO no Message: The stored GPFS CCR state is not checked on this node.

Description: The check for the stored GPFS CCR state is disabled on this
node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_paxos_12_fail STATE_CHANGE ERROR FTDC Message: The stored GPFS CCR state is corrupt.
uploa Item={0},ErrMsg={1},Failed={2}.
d
Description: The stored GPFS CCR state is corrupt. For more information,
see message.

Cause: The CCR on quorum nodes has inconsistent states. Use the mmccr
check -e command to check the detailed status.

User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.

ccr_paxos_12_ok STATE_CHANGE INFO no Message: The stored GPFS CCR state is OK {0}.

Description: The stored GPFS CCR state is OK.

Cause: N/A

User Action: N/A

618 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

ccr_paxos_12_warn STATE_CHANGE WARNING no Message: The stored GPFS CCR state is corrupt.
Item={0},ErrMsg={1},Failed={2}.

Description: The stored GPFS CCR state is corrupt. For more information,
see message.

Cause: One stored GPFS state file is missing or corrupt.

User Action: No user action necessary. GPFS repairs this automatically.

ccr_paxos_cached_disabled STATE_CHANGE INFO no Message: The stored GPFS CCR state is not checked on this node.

Description: The check for the stored GPFS CCR state is disabled on this
node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_paxos_cached_fail STATE_CHANGE ERROR no Message: The stored GPFS CCR state is corrupt.
Item={0},ErrMsg={1},Failed={2}.

Description: The stored GPFS CCR state is corrupt. For more information,
see message.

Cause: Either the stored GPFS CCR state file is corrupt or empty.

User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifing intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.

ccr_paxos_cached_ok STATE_CHANGE INFO no Message: The stored GPFS CCR state is OK {0}.

Description: The stored GPFS CCR state is OK.

Cause: N/A

User Action: N/A

ccr_quorum_nodes_disabled STATE_CHANGE INFO no Message: The quorum nodes reachability is not checked on this node.

Description: The check for the reachability of the quorum nodes is


disabled on this node, since this node is not a quorum node.

Cause: N/A

User Action: N/A

ccr_quorum_nodes_fail STATE_CHANGE ERROR no Message: A majority of the quorum nodes are not reachable over the
management network. Item={0},ErrMsg={1},Failed={2}.

Description: A majority of the quorum nodes are not reachable over the
management network. GPFS declares quorum loss. For more information,
see message.

Cause: The quorum nodes cannot communicate with each other caused by
a network or firewall misconfiguration.

User Action: Check the network or firewall (default port 1191 must not be
blocked) configuration of the not reachable quorum nodes.

ccr_quorum_nodes_ok STATE_CHANGE INFO no Message: All quorum nodes are reachable {0}.

Description: All quorum nodes are reachable.

Cause: N/A

User Action: N/A

ccr_quorum_nodes_warn STATE_CHANGE WARNING no Message: At least one quorum node is non-reachable.


Item={0},ErrMsg={1},Failed={2}.

Description: At least one quorum node is not reachable. For more


information, see message.

Cause: The quorum node is not reachable caused by a network or firewall


misconfiguration.

User Action: Check the network or firewall (default port 1191 must not be
blocked) configuration of the not reachable quorum node.

Chapter 42. References 619


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

ccr_tiebreaker_dsk_disabled STATE_CHANGE INFO no Message: The accessibility of the tiebreaker disks that are used by the
GPFS CCR is not checked on this node.

Description: The accessibility check for the tiebreaker disks that are used
by the GPFS CCR is disabled on this node, since it is not a quorum node.

Cause: N/A

User Action: N/A

ccr_tiebreaker_dsk_fail STATE_CHANGE ERROR no Message: Access to tiebreaker disks have failed.


Item={0},ErrMsg={1},Failed={2}.

Description: Access to all tiebreaker disks have failed. For more


information, see message.

Cause: Corrupt disk.

User Action: Check whether the tiebreaker disks are available.

ccr_tiebreaker_dsk_ok STATE_CHANGE INFO no Message: All tiebreaker disks that are used by the GPFS CCR are accessible
{0}.

Description: All tiebreaker disks that are used by the GPFS CCR are
accessible.

Cause: N/A

User Action: N/A

ccr_tiebreaker_dsk_warn STATE_CHANGE WARNING no Message: At least one tiebreaker disk is not accessible.
Item={0},ErrMsg={1},Failed={2}.

Description: At least one tiebreaker disk is not accessible. For more


information, see message.

Cause: Corrupt disk.

User Action: Check whether the tiebreaker disk is accessible.

cluster_connections_bad STATE_CHANGE WARNING no Message: Connection to cluster node {0} has {1} bad connection(s).
(Maximum {2}).

Description: The cluster internal network to a node is in a bad state. Not all
possible connections do work.

Cause: The cluster internal network to a node is in a bad state. Not all
possible connections do work.

User Action: Check whether the cluster network is good. The event
can be manually cleared by using the mmhealth event resolve
cluster_connections_bad command.

cluster_connections_clear STATE_CHANGE INFO no Message: Cleared all cluster internal connection states.

Description: The cluster internal network is in a good state. All possible


connections are working.

Cause: N/A

User Action: N/A

cluster_connections_down STATE_CHANGE WARNING no Message: Connection to cluster node {0} has all {1} connection(s) down.
(Maximum {2}).

Description: The cluster internal network to a node is in a bad state. All


possible connections are down.

Cause: The cluster internal network to a node is in a bad state. All possible
connections are down.

User Action: Check whether the cluster network is good. The event
can be manually cleared by using the mmhealth event resolve
cluster_connections_down command.

620 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

cluster_connections_ok STATE_CHANGE INFO no Message: All connections are good for target ip {0}.

Description: The cluster internal network is in a good state. All possible


connections do work.

Cause: N/A

User Action: N/A

csm_resync_forced STATE_CHANGE_EXTE INFO no Message: All events and state are transferred to the cluster manager.
RNAL
Description: All events and state are transferred to the cluster manager.

Cause: The mmhealth node show --resync command was executed.

User Action: N/A

csm_resync_needed STATE_CHANGE_EXTE WARNING no Message: Forwarding of an event to the cluster manager failed multiple
RNAL times.

Description: Forwarding of an event to the cluster manager failed multiple


times, which causes the mmhealth cluster show command to show
stale data.

Cause: Cluster manager node cannot be reached.

User Action: Check state and connection of the cluster manager node.
Then, run the mmhealth node show --resync command.

deadlock_detected STATE_CHANGE WARNING no Message: The cluster detected a file system deadlock in the IBM Storage
Scale file system.

Description: The cluster detected a deadlock in the IBM Storage Scale file
system.

Cause: High file system activity might cause this issue.

User Action: The problem may be temporary or remain. For more


information, check the /var/adm/ras/mmfs.log.latest log files.

disk_call_home INFO_EXTERNAL ERROR servic Message: Disk requires replacement: event:{0}, eventName:{1}, rgName:
e {2}, daName:{3}, pdName:{4}, pdLocation:{5}, pdFru:{6}, rgErr:{7},
ticket rgReason:{8}.

Description: Disk requires replacement.

Cause: Hardware monitoring callback reported a faulty disk.

User Action: Contact IBM support for further guidance.

disk_call_home2 INFO_EXTERNAL ERROR servic Message: Disk requires replacement: event:{0}, eventName:{1}, rgName:
e {2}, daName:{3}, pdName:{4}, pdLocation:{5}, pdFru:{6}, rgErr:{7},
ticket rgReason:{8}.

Description: Disk requires replacement.

Cause: Hardware monitoring callback reported a faulty disk.

User Action: Contact IBM support for further guidance.

ess_ptf_update_available TIP TIP no Message: For the currently installed IBM ESS packages, the PTF update {0}
PTF {1} is available.

Description: For the currently installed IBM ESS packages a PTF update is
available.

Cause: PTF updates are available for the currently installed 'gpfs.ess.tools'
package.

User Action: Visit IBM Fix Central to download and install the updates.

Chapter 42. References 621


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

event_hidden INFO_EXTERNAL INFO no Message: The event {0} was hidden.

Description: An event used in the system health framework was hidden.


It can still be seen with the '--verbose' flag in the mmhealth node show
ComponentName command when it is active. However, it does not affect
the component state anymore.

Cause: The mmhealth event hide command was used.

User Action: Use mmhealth event list hidden command to see all
hidden events. Use the mmhealth event unhide command to show the
event again.

event_test_info INFO INFO no Message: Test info event that is received from GPFS daemon. Arg0:{0}
Arg1:{1}.

Description: Test event that is raised by using the mmfsadm test


raiserasevent command.

Cause: N/A

User Action: To raise this test event, run the mmfsadm test
raiseRASEvent 0 arg1txt arg2txt command. The event shows up in
the event log. For more information, see the mmhealth node eventlog
command.

event_test_ok STATE_CHANGE INFO no Message: Test OK event that is received from GPFS daemon for entity: {id}
Arg0:{0} Arg1:{1}.

Description: Test OK event that is raised by using the mmfsadm test


raiserasevent command.

Cause: N/A

User Action: N/A

event_test_statechange STATE_CHANGE WARNING no Message: Test State-Change event that is received from GPFS daemon for
entity: {id} Arg0:{0} Arg1:{1}.

Description: Test State-Change event that is raised by using the mmfsadm


test raiserasevent command.

Cause: This event was raised by the user. It is a test event.

User Action: To raise this test event, run the mmfsadm test
raiseRASEvent 1 id arg1txt arg2txt command. The event
changes the GPFS state to DEGRADED. For more information, see the
mmhealth node show command. Raise the 'event_test_ok' event to
change state back to HEALTHY.

event_unhidden INFO_EXTERNAL INFO no Message: The event {0} was unhidden.

Description: An event was unhidden, which means that the event affects
its component's state when it is active. Furthermore, it is shown in the
event table of the mmhealth node show ComponentName command
without the '--verbose' flag.

Cause: The mmhealth event unhide command was used.

User Action: If this is an active TIP event, then fix it or hide by using the
mmhealth event hide command.

gpfs_cache_cfg_high TIP TIP no Message: The GPFS cache settings may be too high for the installed total
memory.

Description: The cache settings for maxFilesToCache, maxStatCache, and


pagepool are close to the amount of total memory.

Cause: The configured cache settings are close to the total memory.
The settings for pagepool, maxStatCache, and maxFilesToCache, in total,
exceed the recommended value, which is 90% by default.

User Action: For more information on maxStatCache size, see the 'Cache
usage' section in the Administration Guide. Check whether there is enough
memory available.

622 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

gpfs_cache_cfg_ok TIP INFO no Message: The GPFS cache memory configuration is OK.

Description: The GPFS cache memory configuration is OK. The values for
maxFilesToCache, maxStatCache, and pagepool fit to the amount of total
memory and configured services.

Cause: The GPFS cache memory configuration is OK.

User Action: N/A

gpfs_deadlock_detection_dis TIP TIP no Message: The GPFS deadlockDetectionThreshold is set to 0.


abled
Description: Automated deadlock detection monitors waiters. The
deadlock detection relies on a configurable threshold to determine whether
a deadlock is in- progress.

Cause: Automated deadlock detection is disabled in IBM Storage Scale.

User Action: Configure the 'deadlockDetectionThreshold' parameter value


to a positive value by using the mmchconfig command.

gpfs_deadlock_detection_ok TIP INFO no Message: The GPFS deadlockDetectionThreshold is greater than zero.

Description: Automated deadlock detection monitors waiters. The


deadlock detection relies on a configurable threshold to determine whether
a deadlock is in- progress.

Cause: Automated deadlock detection is enabled in IBM Storage Scale.

User Action: N/A

gpfs_down STATE_CHANGE ERROR no Message: The IBM Storage Scale service process not running on this node.
Normal operation cannot be done.

Description: The IBM Storage Scale service is not running. This can be an
expected state when the IBM Storage Scale service is shutdown.

Cause: The IBM Storage Scale service is not running.

User Action: Check the state of the IBM Storage Scale file system daemon,
and check for the root cause in the /var/adm/ras/mmfs.log.latest
log.

gpfs_maxfilestocache_ok TIP INFO no Message: The GPFS maxFilesToCache is greater than 100,000.

Description: The GPFS maxFilesToCache is greater than 100,000. Consider


that the actively used configuration is monitored. You can see the actively
used configuration by using the mmdiag --config command.

Cause: The GPFS maxFilesToCache is greater than 100,000.

User Action: N/A

Chapter 42. References 623


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

gpfs_maxfilestocache_small TIP TIP no Message: The GPFS maxfilestocache is smaller than or equal to 100,000.

Description: The size of maxFilesToCache is essential to achieve optimal


performance, especially on protocol nodes. With a larger maxFilesToCache
size, IBM Storage Scale can handle more concurrently open files and
is able to cache more recently used files, which makes IO operations
more efficient. This event is raised because the maxFilesToCache value is
configured less than or equal to 100,000 on a protocol node.

Cause: The size of maxFilesToCache is essential to achieve optimal


performance, especially on protocol nodes. With a larger maxFilesToCache
size, IBM Storage Scale can handle more concurrently open files and
is able to cache more recently used files, which makes IO operations
more efficient. This event is raised because the maxFilesToCache value is
configured less than or equal to 100,000 on a protocol node.

User Action: For more information on maxFilesToCache size, see


the 'Cache usage' section in the Administration Guide. Although, the
maxFilesToCache size should be greater than 100,000, there are
situations in which the administrator decides against a maxFilesToCache
size that is greater than 100,000. In this case or in case that the
current setting fits, hide the event by using the GUI or the mmhealth
event hide command. The maxFilesToCache can be changed by
using the mmchconfig command. The gpfs_maxfilestocache_small event
automatically disappears as soon as the new maxFilesToCache value is
greater than 100,000 is active. Restart the gpfs daemon, if required.
Consider that the actively used configuration is monitored. You can list
the actively used configuration by using the mmdiag --config command,
which includes changes that are not activated as yet.

gpfs_maxstatcache_high TIP TIP no Message: The GPFS maxStatCache is greater than 0 on a Linux system.

Description: The size of maxStatCache is useful to improve the


performance of both the system and IBM Storage Scale stat() calls for
applications with a working set that does not fit in the regular file
cache. Nevertheless, the stat cache is not effective on Linux platform.
Therefore, it is recommended to set the maxStatCache attribute to 0 on
a Linux platform. This event is raised because the maxStatCache value is
configured greater than 0 on a Linux system.

Cause: The size of maxStatCache is useful to improve the performance of


both the system and IBM Storage Scale stat() calls for applications with
a working set that does not fit in the regular file cache. Nevertheless, the
stat cache is not effective on Linux platform. Therefore, it is recommended
to set the maxStatCache attribute to 0 on a Linux platform. This event is
raised because the maxStatCache value is configured greater than 0 on a
Linux system.

User Action: For more information on the maxStatCache size, see


the 'Cache usage' section in the Administration Guide. Although, the
maxStatCache size should be 0 on a Linux system, there are situations
in which the administrator decides against a maxStatCache size of 0. In this
case or in case that the current setting fits, hide the event either by using
the GUI or the mmhealth event hide command. The maxStatCache can
be changed with the mmchconfig command. The gpfs_maxstatcache_high
event automatically disappears as soon as the new maxStatCache value of
0 is active. Restart the gpfs daemon, if required. Consider that the actively
used configuration is monitored. You can list the actively used configuration
by using the mmdiag --config command, which includes changes that
are not activated as yet.

gpfs_maxstatcache_low TIP TIP no Message: The GPFS maxStatCache is smaller than the maxFilesToCache
setting.

Description: The size of maxStatCache is useful to improve the


performance of both the system and IBM Storage Scale stat() calls for
applications with a working set that does not fit in the regular file cache.

Cause: The GPFS maxStatCache is smaller than the maxFilesToCache


setting.

User Action: For more information on the maxStatCache size, see the
'Cache usage' section in the Administration Guide. In case that the current
setting fits your needs, hide the event either by using the GUI or the
mmhealth event hide command. The maxStatCache can be changed
by using the mmchconfig command. Consider that the actively used
configuration is monitored. You can list the actively used configuration by
using the mmdiag --config command, which includes changes that are
not activated as yet.

624 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

gpfs_maxstatcache_ok TIP INFO no Message: The GPFS maxStatCache is set to default or at least to the
maxFilesToCache value.

Description: The GPFS maxStatCache value is set to 0 on a Linux system


as default. Consider that the actively used configuration is monitored. You
can see the actively used configuration by using the mmdiag --config
command.

Cause: The GPFS maxStatCache is set to default or at least to the


maxFilesToCache value.

User Action: N/A

gpfs_pagepool_ok TIP INFO no Message: The GPFS pagepool is greater than 1G.

Description: The GPFS pagepool is greater than 1G. Consider that the
actively used configuration is monitored. You can see the actively used
configuration by using the mmdiag --config command.

Cause: The GPFS pagepool is greater than 1G.

User Action: N/A

gpfs_pagepool_small TIP TIP no Message: The GPFS pagepool is less than or equal to 1G.

Description: The size of the pagepool is essential to achieve optimal


performance. With a larger pagepool, IBM Storage Scale can cache or
prefetch more data, which makes IO operations more efficient. This event
is raised because the pagepool is configured less than or equal to 1G.

Cause: The size of the pagepool is essential to achieve optimal


performance. With a larger pagepool, IBM Storage Scale can cache or
prefetch more data, which makes IO operations more efficient. This event
is raised because the pagepool is configured less than or equal to 1G.

User Action: For more information on the pagepool size, see the 'Cache
usage' section in the Administration Guide. Although, the pagepool should
be greater than 1G, there are situations in which the administrator decides
against a pagepool that is greater than 1G. In this case or in case that
the current setting fits, hide the event by using the GUI or the mmhealth
event hide command. The pagepool can be changed by using the
mmchconfig command. The 'gpfs_pagepool_small' event automatically
disappears as soon as the new pagepool value, which is greater than
1G is active. Use the mmchconfig -i flag command or restart if
required. For more information, see the mmchconfig command in the
Command Reference Guide. Consider that the actively used configuration is
monitored. You can list the actively used configuration by using the mmdiag
--config command, which includes changes that are not activated as yet.

gpfs_unresponsive STATE_CHANGE ERROR no Message: The IBM Storage Scale service process is unresponsive on this
node. Normal operation cannot be done.

Description: The IBM Storage Scale service is unresponsive. This can be an


expected state when the IBM Storage Scaleservice is shut down.

Cause: The IBM Storage Scale service is unresponsive.

User Action: Check the state of the IBM Storage Scale file system daemon,
and check for the root cause in the /var/adm/ras/mmfs.log.latest
log.

gpfs_up STATE_CHANGE INFO no Message: The IBM Storage Scale service process is running.

Description: The IBM Storage Scale service is running.

Cause: N/A

User Action: N/A

gpfs_warn INFO WARNING no Message: The IBM Storage Scale process monitoring returned unknown
result. This can be a temporary issue.

Description: Check whether the IBM Storage Scale file system daemon
returned an unknown result. This can be a temporary issue, like a timeout
during the check procedure.

Cause: The IBM Storage Scale file system daemon state cannot be
determined due to a problem.

User Action: Find potential issues for this kind of failure in


the /var/adm/ras/mmsysmonitor.log file.

Chapter 42. References 625


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

gpfsport_access_down STATE_CHANGE ERROR no Message: No access to IBM Storage Scale ip {0} port {1}. Check the firewall
settings.

Description: The access check of the local IBM Storage Scale file system
daemon port has failed.

Cause: The port is probably blocked by a firewall rule.

User Action: Check whether the IBM Storage Scale file system daemon is
running and check the firewall for blocking rules on this port.

gpfsport_access_up STATE_CHANGE INFO no Message: Access to IBM Storage Scale ip {0} port {1} is OK.

Description: The TCP access check of the local IBM Storage Scale file
system daemon port was successful.

Cause: N/A

User Action: N/A

gpfsport_access_warn INFO WARNING no Message: IBM Storage Scale access check ip {0} port {1} failed. Check for a
valid IBM Storage Scale-IP.

Description: The access check of the IBM Storage Scale file system
daemon port has returned an unknown result.

Cause: The IBM Storage Scale file system daemon port access cannot be
determined due to a problem.

User Action: Find potential issues for this kind of failure in the logs.

gpfsport_down STATE_CHANGE ERROR no Message: IBM Storage Scale port {0} is not active.

Description: The expected local IBM Storage Scale file system daemon
port was not detected.

Cause: The IBM Storage Scale file system daemon is not running.

User Action: Check whether the IBM Storage Scale service is running.

gpfsport_up STATE_CHANGE INFO no Message: IBM Storage Scale port {0} is active.

Description: The expected local IBM Storage Scale file system daemon
port was detected.

Cause: N/A

User Action: N/A

gpfsport_warn INFO WARNING no Message: IBM Storage Scale monitoring ip {0} port {1} has returned an
unknown result.

Description: The check of the IBM Storage Scale file system daemon port
has returned an unknown result.

Cause: The IBM Storage Scale file system daemon port cannot be
determined due to a problem.

User Action: Find potential issues for this kind of failure in the logs.

info_on_duplicate_events INFO INFO no Message: The event {0} {id} was repeated {1} times.

Description: Multiple messages of the same type were de-duplicate to


avoid log flooding.

Cause: Multiple events of the same type are processed.

User Action: N/A

kernel_io_hang_detected STATE_CHANGE ERROR no Message: A kernel IO hang has been detected on disk {0} affecting file
system {1}.

Description: I/Os to the underlying storage system have been


pending for more than the configured threshold time, which is
'ioHangDetectorTimeout'. When panicOnIOHang is enabled, this can force a
kernel panic.

Cause: N/A

User Action: Check the underlying storage system and reboot the node to
revolve the current hang condition.

626 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

kernel_io_hang_resolved STATE_CHANGE INFO no Message: A kernel IO hang on disk {id} has been resolved.

Description: Pending I/Os to the underlying storage system have been


resolved manually.

Cause: N/A

User Action: N/A

local_fs_filled STATE_CHANGE WARNING no Message: The local file system with the mount point {1} used for {0}
reached a warning level with less than 1000 MB, but, more than 100 MB
free space.

Description: The monitored file system has reached an available space


value less than 1000 MB, but, more than 100 MB.

Cause: The local file systems reached a warning level of under 1000 MB.

User Action: Detect large file on the local file system by using the 'du -cks *
|sort -rn |head -11' command, and delete or move data to free space.

local_fs_full STATE_CHANGE ERROR no Message: The local file system with the mount point {1} used for {0}
reached a nearly exhausted level, which is less than 100 MB free space.

Description: The monitored file system has reached an available space


value, which is less than 100 MB.

Cause: The local file systems have reached a nearly exhausted level, which
is less than 100 MB.

User Action: Detect large file on the local file system by using the 'du -cks *
|sort -rn |head -11' command, and delete or move data to free space.

local_fs_normal STATE_CHANGE INFO no Message: The local file system with the mount point {1} used for {0}
reached a normal level with more than 1000 MB free space.

Description: The monitored file system has an available space value of


over 1000 MB.

Cause: N/A

User Action: N/A

local_fs_path_not_found STATE_CHANGE INFO no Message: The configured dataStructureDump path {0} does not exists. Skip
monitoring.

Description: The configured dataStructureDump path does not exist yet,


therefore, the disk capacity monitoring is skipped.

Cause: N/A

User Action: N/A

local_fs_unknown INFO WARNING no Message: The fill level of local file systems is unknown because of a non-
expected output of the df command. Return Code: {0} Error: {1}.

Description: The df command returned a return code unequal to 0 or an


unexpected output.

Cause: Cannot determine the fill states of the local file systems, which may
be caused through a return code unequal to 0 from the df command or an
unexpected output format.

User Action: Check whether the df command exists on the node, and
whether there are time issues with the df command or it may run into a
timeout.

longwaiters_found STATE_CHANGE ERROR no Message: Detected IBM Storage Scale longwaiter threads.

Description: Longwaiter threads are found in the IBM Storage Scale file
system.

Cause: The mmdiag --deadlock command reports longwaiter threads,


most likely due to a high IO load.

User Action: Check log files and the output of the mmdiag --waiters
command to identify the root cause. This can be also due to a temporary
issue.

Chapter 42. References 627


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

longwaiters_warn INFO WARNING no Message: IBM Storage Scale longwaiters monitoring has returned an
unknown result.

Description: The longwaiters check has returned an unknown result.

Cause: The IBM Storage Scale file system longwaiters check cannot be
determined due to a problem.

User Action: Find potential issues for this kind of failure in the logs.

mmfsd_abort_clear STATE_CHANGE INFO no Message: Resolve event for IBM Storage Scale issue signal.

Description: Resolve event for IBM Storage Scaleissue signal.

Cause: N/A

User Action: N/A

mmfsd_abort_warn STATE_CHANGE WARNING FTDC Message: IBM Storage Scale reported an issue {0}.
uploa
d Description: The mmfsd daemon process may have terminated
abnormally.

Cause: IBM Storage Scale signaled an issue. The mmfsd daemon process
might have terminated abnormally.

User Action: Check the mmfs.log.latest and mmfs.log.previous files for


crash and restart hints. Check for mmfsd daemon status. Run the
mmhealth event resolve mmfsd_abort_warn command to remove
this warning event from the mmhealth command.

monitor_started INFO INFO no Message: The IBM Storage Scale monitoring service has been started.

Description: The IBM Storage Scale monitoring service has been started
and is actively monitoring the system components.

Cause: N/A

User Action: Use the mmhealth command to query the monitoring status.

no_longwaiters_found STATE_CHANGE INFO no Message: No IBM Storage Scale longwaiters are found.

Description: No longwaiter threads are found in the IBM Storage Scale file
system.

Cause: N/A

User Action: N/A

no_rpc_waiters STATE_CHANGE INFO no Message: No pending RPC messages were found.

Description: No pending RPC messages were found.

Cause: N/A

User Action: N/A

node_call_home INFO_EXTERNAL ERROR servic Message: OPAL logs reported a problem: event:{0}, eventId:{1}, myNode:
e {2}.
ticket
Description: OPAL logs reported a problem via callhomemon.sh, which
requires IBM support attention.

Cause: OPAL logs reported a problem via callhomemon.sh.

User Action: Contact IBM support for further guidance.

node_call_home2 INFO_EXTERNAL ERROR servic Message: OPAL logs reported a problem: event:{0}, eventId:{1}, myNode:
e {2}.
ticket
Description: OPAL logs reported a problem via callhomemon.sh, which
requires IBM support attention.

Cause: OPAL logs reported a problem via callhomemon.sh.

User Action: Contact IBM support for further guidance.

628 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

nodeleave_info INFO_EXTERNAL INFO no Message: A CES node left the cluster: Node {0}.

Description: Informational. Shows the name of the node name that is


leaving the cluster. This event may be logged on a different node, and not
necessarily on the node that is leaving the cluster.

Cause: A CES node left the cluster. The name of the node that is leaving the
cluster is provided.

User Action: N/A

nodestatechange_info INFO_EXTERNAL INFO no Message: A CES node state change: Node {0} {1} {2} flag.

Description: Informational. Shows the modified node state, such as the


node changed to suspended mode, network down, or others.

Cause: A node state change was detected. Details are shown in the
message.

User Action: N/A

numaMemoryInterleave_not_ TIP INFO no Message: The numaMemoryInterleave parameter is set to no.


set
Description: The numaMemoryInterleave parameter is set to no. The
numactl tool is not required.

Cause: The mmlsconfig command indicates that numaMemoryInterleave


is disabled.

User Action: N/A

numactl_installed TIP INFO no Message: The numactl tool is installed.

Description: To use the mmchconfig numaMemoryInterleave


parameter, the numactl tool is required. Detected numactl is installed.

Cause: The required '/usr/bin/numactl' command is installed correctly.

User Action: N/A

numactl_not_installed TIP TIP no Message: The numactl tool is not found, but needs to be installed.

Description: When the configuration attribute 'numaMemoryInterleave' is


enabled, the mmfsd daemon is supposed to allocate memory from CPU
bound NUMA nodes. However, when the numactl tool is missing, mmfsd
allocates memory only from a single NUMA memory region. This action
might impact the performance and lead to memory allocation issues even
when other NUMA regions still have plenty of memory left.

Cause: The mmlsconfig command indicates that numaMemoryInterleave


is enabled, but the required '/usr/bin/numactl' command is missing.

User Action: Install the required numactl tool. For example, run the 'yum
install numactl' command on RHEL. In case the numactl is not available
for your operating system, disable numaMemoryInterleave setting on this
node.

operating_system_ok STATE_CHANGE INFO no Message: A supported operating system was detected.

Description: A supported operating system was detected.

Cause: N/A

User Action: N/A

out_of_memory STATE_CHANGE WARNING no Message: Detected out-of-memory killer conditions in system log.

Description: In an out-of-memory condition, the OOM killer terminates the


process with the largest memory utilization score. This may affect the IBM
Storage Scale processes and cause subsequent issues.

Cause: The dmesg command returned log entries, which are written by the
OOM killer.

User Action: Check the memory usage on the node. Identify the reason
for the out-of-memory condition and check the system log to find out
which processes have been killed by OOM killer. You might need to recover
these processes manually or reboot the system to get to a clean state. Run
the mmhealth event resolve out_of_memory command once you
recovered the system to remove this warning event from the mmhealth
command.

Chapter 42. References 629


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

out_of_memory_ok STATE_CHANGE INFO no Message: Out-of-memory issue is resolved.

Description: Resolve event for the out-of-memory degraded state.

Cause: N/A

User Action: N/A

passthrough_query_hang STATE_CHANGE ERROR no Message: An SCSI pass-through query request hang has been detected on
disk {0} affecting file system {1}. Reason: {2}.

Description: SCSI pass-through query request to storage system have


been pending for more than the configured threshold time, which is
'passthrough_query_hang_detected'. When panicOnIOHang is enabled, this
can force a kernel panic.

Cause: N/A

User Action: Check the underlying storage system and reboot the node to
revolve the current hang condition.

quorum_down STATE_CHANGE ERROR no Message: The node is not able to reach enough quorum nodes/disks to
work properly.

Description: Reasons can be network or hardware issues, or a shutdown of


the cluster service. The event does not necessarily indicate an issue with
the cluster quorum state.

Cause: The node is trying to form a quorum with the other available nodes.
The cluster service may not be running or the communication with other
nodes is faulty.

User Action: Check whether the cluster service is running and other
quorum nodes can be reached over the network. Check the local firewall
settings.

quorum_even_nodes_no_tieb STATE_CHANGE TIP no Message: No tiebreaker disk is defined with an even number of quorum
reaker nodes.

Description: No tiebreaker disk is defined.

Cause: You have not configured any tiebreaker disk.

User Action: Add 1 or 3 tiebreaker disks.

quorum_ok STATE_CHANGE INFO no Message: The quorum configuration corresponds to the best practices.

Description: The quorum configuration is as recommended.

Cause: N/A

User Action: N/A

quorum_too_little_nodes TIP TIP no Message: An odd number of at least 3 quorum nodes is recommended.

Description: Only one quorum node is defined.

Cause: 3, 5, or 7 quorum nodes is recommended. This is not configured.

User Action: Add quorum nodes.

quorum_two_tiebreaker_coun STATE_CHANGE TIP no Message: Change number of tiebreaker disks to an odd number.
t
Description: Number of tiebreaker disks is two.

Cause: The number of tiebreaker disks is not as recommended.

User Action: Use an odd number of tiebreaker disks.

quorum_up STATE_CHANGE INFO no Message: Quorum is achieved.

Description: The monitor has detected a valid quorum.

Cause: N/A

User Action: N/A

630 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

quorum_warn INFO WARNING no Message: The IBM Storage Scale quorum monitor cannot be executed. This
can be a timeout issue.

Description: Check whether the quorum state returned an unknown result.


This may be due to a temporary issue, like a timeout during the check
procedure.

Cause: The quorum state cannot be determined due to a problem.

User Action: Find potential issues for this kind of failure in


the /var/adm/ras/mmsysmonitor.log file.

quorumloss INFO_EXTERNAL WARNING no Message: The cluster has detected a quorum loss.

Description: The cluster may get into an inconsistent or split-brain state.


The reason for this issue can be due to any network or hardware issues,
or when quorum nodes were removed from the cluster. The event may not
necessaily be logged on the node, which causes the quorum loss.

Cause: The number of required quorum nodes does not match the
minimum requirements. This can be an expected situation.

User Action: Ensure the required cluster quorum nodes are up and
running.

quorumreached_detected INFO_EXTERNAL INFO no Message: Quorum reached event.

Description: The cluster has reached quorum.

Cause: The cluster has reached quorum.

User Action: N/A

reconnect_aborted STATE_CHANGE INFO no Message: Reconnect to {0} is aborted.

Description: Reconnect failed, which may due to a network error. Check for
a network error.

Cause: N/A

User Action: N/A

reconnect_done STATE_CHANGE INFO no Message: Reconnected to {0}.

Description: The TCP connection is reconnected.

Cause: N/A

User Action: N/A

reconnect_failed INFO ERROR no Message: Reconnect to {0} has failed.

Description: Reconnect failed, which may due to a network error.

Cause: The network is in bad state.

User Action: Check whether the network is good.

reconnect_start STATE_CHANGE WARNING no Message: Attempting to reconnect to {0}.

Description: The TCP connection is in abnormal state and tries try to


reconnect.

Cause: The TCP connection is in abnormal state.

User Action: Check whether the network is good.

rpc_waiters STATE_CHANGE WARNING no Message: Pending RPC messages were found for the nodes: {0}.

Description: Nodes took more time to respond to pending RPC messages.

Cause: The mmdiag --network command returned


pending RPC messages that took more time than the
mmhealthPendingRPCWarningThreshold value.

User Action: If nodes do not respond to pending RPC messages, you might
need to expel the nodes by using the mmexpelnode -N <ip> command.

Chapter 42. References 631


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

rpc_waiters_expel INFO WARNING no Message: A request to expel the node {id} was sent to the cluster node {1}
because of pending RPC messages.

Description: A node is expelled automatically because of pending RPC


messages on the node.

Cause: The mmdiag --network command returned a


pending RPC message that took more time than the
mmhealthPendingRPCExpelThreshold value.

User Action: Verify the logs in the expelled node to find the reason for
the pending RPC messages. For example, node resources, such as memory,
might be exhausted. Use the mmexpelnode -r -N <ip> command to
allow the node to join the cluster again.

scale_ptf_update_available TIP TIP no Message: For the currently installed IBM Storage Scale packages, the PTF
update {0} PTF {1} is available.

Description: For the currently installed IBM Storage Scale packages, a PTF
update is available.

Cause: PTF updates are available for the currently installed 'gpfs.base'
package.

User Action: Visit IBM Fix Central to download and install the updates.

scale_up_to_date STATE_CHANGE INFO no Message: The last software update check showed no available updates.

Description: The last software update check showed no available updates.

Cause: N/A

User Action: N/A

scale_updatecheck_disabled STATE_CHANGE INFO no Message: The IBM Storage Scale software update check feature is
disabled.

Description: The IBM Storage Scale software update check feature is


disabled. Enable call home by using the mmcallhome capability
enable command, and set 'monitors_enabled = true' in the
mmsysmonitor.conf file.

Cause: N/A

User Action: N/A

shared_root_acl_bad STATE_CHANGE WARNING no Message: Shared root ACLs not default.

Description: The CES shared root file system's ACLs are different to default
in CCR. If this ACLs prohibits read access of rpc.stadt, then NFS do not
work correctly.

Cause: The CES framework detects that the ACLs of the CES shared root
file system are different the default in CCR.

User Action: Verify that the user assigned to rpc.statd (such as, rpcuser)
has read access to the CES shared root file system.

shared_root_acl_good STATE_CHANGE INFO no Message: Default shared root ACLs.

Description: The CES shared root file system's ACLs are default. These
ACLs give read access to rpc.stadt when default GPFS user settings are
used.

Cause: N/A

User Action: N/A

shared_root_bad STATE_CHANGE WARNING no Message: Shared root is unavailable.

Description: The CES shared root file system is bad or not available. This
file system is required to run the cluster because it stores cluster wide
information. This problem triggers a failover.

Cause: The CES framework detects the CES shared root file system to be
unavailable on the node.

User Action: Check whether the CES shared root file system and other
expected IBM Storage Scale file systems are mounted properly.

632 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

shared_root_ok STATE_CHANGE INFO no Message: Shared root is available.

Description: The CES shared root file system is available. This file system is
required to run the cluster because it stores cluster wide information.

Cause: N/A

User Action: N/A

test_call_home INFO_EXTERNAL ERROR servic Message: A test call home ticket is created.
e
ticket Description: A test call home ticket is created.

Cause: ESS tooling triggered a test call home to verify that tickets can be
created from this system.

User Action: No action required if a service ticket is successfully created.


Otherwise connectivity, entitlement, etc must be checked.

tiebreaker_disks_ok TIP INFO no Message: The number of tiebreaker disks is as recommended.

Description: The number of tiebreaker disks is correct.

Cause: The number of tiebreaker disks is as recommended.

User Action: N/A

total_memory_ok TIP INFO no Message: The total memory configuration is OK.

Description: The total memory configuration is within the recommended


range for CES nodes running protocol services.

Cause: The total memory configuration is OK.

User Action: N/A

total_memory_small TIP TIP no Message: The total memory is less than the recommended value.

Description: The total memory is less than the recommended value when
CES protocol services are enabled.

Cause: The total memory is less than the recommendation for the currently
enabled services, which is 128GB if SMB is enabled, or 64 GB for each, NFS
and object.

User Action: For more information on CES memory recommendations,


see the 'Planning for protocols' in the Concepts, Planning, and Installation
Guide.

unexpected_operating_syste TIP TIP no Message: An unexpected operating system was detected.


m
Description: An unexpected operating system was detected. A 'clone' OS
may affect the support you get from IBM.

Cause: An unexpected OS was detected.

User Action: A list of supported operating systems and versions can be


seen in the documentation.

waitfor_verbsport INFO_EXTERNAL INFO no Message: Waiting for verbs ports to become active.

Description: verbsPortsWaitTimeout is enabled, waiting for verbs ports


becomes active.

Cause: N/A

User Action: N/A

waitfor_verbsport_done INFO_EXTERNAL INFO no Message: Waiting for verbs ports is done {0}.

Description: Waiting for verbs ports is done.

Cause: N/A

User Action: N/A

Chapter 42. References 633


Table 85. Events for the GPFS component (continued)

Event Event Severity Call Details


Type Home

waitfor_verbsport_failed INFO_EXTERNAL ERROR no Message: Fail to startup because some IB ports or Ethernet devices, which
in verbsPorts are inactive: {0}.

Description: verbsRdmaFailBackTCPIfNotAvailable is disabled and some


IB ports or Ethernet devices, which in verbsPorts are inactive.

Cause: IB ports or Ethernet devices are inactive.

User Action: Check IB ports and Ethernet devices, which are listed in
verbsPorts configuration. Increase verbsPortsWaitTimeout or enable the
verbsRdmaFailBackTCPIfNotAvailable configuration.

waitfor_verbsport_ibstat_faile INFO_EXTERNAL ERROR no Message: verbsRdmaFailBackTCPIfNotAvailable is disabled but '/usr/sbin/


d ibstat' is not found.

Description: verbsRdmaFailBackTCPIfNotAvailable is disabled but '/usr/


sbin/ibstat' is not found.

Cause: verbsRdmaFailBackTCPIfNotAvailable is disabled but '/usr/sbin/


ibstat' is not found.

User Action: Install '/usr/sbin/ibstat' or enable


verbsRdmaFailBackTCPIfNotAvailable configuration.

GUI events
The following table lists the events that are created for the GUI component.
Table 86. Events for the GUI component

Event Event Severity Call Details


Type Home

bmc_connection_error STATE_CHANGE ERROR no Message: Unable to connect to BMC of POWER server {0} because an
error occurred when running the '/opt/ibm/ess/tools/bin/esshwinvmon.py
-t check -n {1}' command.

Description: The GUI checks the connection to the BMC of the POWER
server.

Cause: The GUI cannot query the BMC of the POWER server because of an
error that occurred in the 'esshwinvmon.py' script.

User Action: Run the '/opt/ibm/ess/tools/bin/esshwinvmon.py -t check -n


[node_name]' command to check error.

bmc_connection_failed STATE_CHANGE ERROR no Message: Unable to connect to BMC of POWER server {0}.

Description: The GUI checks the connection to the BMC of the POWER
server.

Cause: The GUI cannot connect to the BMC of the POWER server.

User Action: Check whether the BMC IPs and passwords are correct
defined in the '/opt/ibm/ess/tools/conf/hosts.yml' configuration file on the
GUI node. Run the '/opt/ibm/ess/tools/bin/esshwinvmon.py -t check -n
[node_name]' command to check connection to BMC.

bmc_connection_ok STATE_CHANGE INFO no Message: The connection to the BMC of POWER server {0} is OK.

Description: The GUI checks the connection to the BMC of the POWER
server.

Cause: The GUI can communicate to the BMC of the POWER server
successfully.

User Action: N/A

634 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

bmc_connection_unconfigure STATE_CHANGE ERROR no Message: Unable to query health state of POWER server {0} from the BMC.
d The '/opt/ibm/ess/tools/conf/hosts.yml'configuration file does not contain
a section for node {1}.

Description: The GUI checks the connection to the BMC of the POWER
server.

Cause: The GUI cannot connect to the BMC of the POWER server because
of a misconfiguration.

User Action: Add a section for the specified node to the '/opt/ibm/ess/
tools/conf/hosts.yml' configuration file on the GUI node. Run the
'/opt/ibm/ess/tools/bin/esshwinvmon.py -t check -n [node_name]'
command to check connection to BMC.

gui_cluster_down STATE_CHANGE ERROR no Message: The GUI detected that the cluster is down.

Description: The GUI checks the cluster state.

Cause: The GUI calculated that an insufficient number of quorum nodes is


up and running.

User Action: Check for the reason that led the cluster to lost quorum.

gui_cluster_state_unknown STATE_CHANGE WARNING no Message: The GUI cannot determine the cluster state.

Description: The GUI checks the cluster state.

Cause: The GUI cannot determine whether enough quorum nodes are up
and running.

User Action: N/A

gui_cluster_up STATE_CHANGE INFO no Message: The GUI detected that the cluster is up and running.

Description: The GUI checks the cluster state.

Cause: The GUI calculated that a sufficient amount of quorum nodes is up


and running.

User Action: N/A

gui_config_cluster_id_mismat STATE_CHANGE ERROR no Message: The cluster ID of the current cluster '{0}', and the cluster ID in the
ch database do not match ('{1}'). It seems that the cluster was re-created.

Description: When a cluster is deleted and created again, the cluster ID


changes, but the GUI's database still references the old cluster ID.

Cause: N/A

User Action: Clear the GUI database of the old cluster information by
dropping all 'psql postgres postgres -c' tables by using the 'drop schema
fscc cascade' command. Then, restart the GUI by using the systemctl
restart gpfsgui command.

gui_config_cluster_id_ok STATE_CHANGE INFO no Message: The cluster ID of the current cluster '{0}' matches the cluster ID
in the database.

Description: No problems that are regarding the current configuration of


the GUI and the cluster were found.

Cause: N/A

User Action: N/A

gui_config_command_audit_o STATE_CHANGE WARNING no Message: Command Audit is turned off at the cluster level.
ff_cluster
Description: Command Audit is turned off at the cluster level. This
configuration leads to lags in the refresh of data displayed in the GUI.

Cause: Command Audit is turned off at the cluster level.

User Action: Change the cluster configuration option 'commandAudit'


to 'on' by using the mmchconfig commandAudit=on command, or
'syslogonly' by using the mmchconfig commandAudit=syslogonly
command. This way the GUI refreshes the data that it displays
automatically when IBM Storage Scale commands are run by using the CLI
on other nodes in the cluster.

Chapter 42. References 635


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_config_command_audit_o STATE_CHANGE WARNING no Message: Command Audit is turned off on the following nodes: {1}.
ff_nodes
Description: Command Audit is turned off on some nodes. This
configuration leads to lags in the refresh of data that is displayed in the
GUI.

Cause: Command Audit is turned off on some nodes.

User Action: Change the cluster configuration option 'commandAudit'


to 'on' by using the mmchconfig commandAudit=on -N [node
name]) command, or 'syslogonly' by using the mmchconfig
commandAudit=syslogonly -N [node name] command for the
affected nodes. This way the GUI refreshes the data that it displays
automatically when IBM Storage Scale commands are run by using the CLI
on other nodes in the cluster.

gui_config_command_audit_o STATE_CHANGE INFO no Message: Command Audit is turned on at the cluster level.
k
Description: Command Audit is turned on at the cluster level. This way
the GUI refreshes the data that it displays automatically when IBM Storage
Scale commands are run by using the CLI on other nodes in the cluster.

Cause: N/A

User Action: N/A

gui_config_sudoers_error STATE_CHANGE ERROR no Message: There is a problem with the '/etc/sudoers' configuration. The
secure_path of the IBM Storage Scale management user 'scalemgmt' is not
correct. Current value: {0} / Expected value: {1}.

Description: There is a problem with the '/etc/sudoers' configuration.

Cause: N/A

User Action: Ensure that '#includedir /etc/sudoers.d' directive is set


in '/etc/sudoers', so the sudoers' configuration drop-in file for the IBM
Storage Scale management user 'scalemgmt' (which the GUI process uses)
is loaded from '/etc/sudoers.d/scalemgmt_sudoers'. Also, make sure that
the #includedir directive is the last line in the '/etc/sudoers' configuration
file.

gui_config_sudoers_ok STATE_CHANGE INFO no Message: The '/etc/sudoers' configuration is correct.

Description: The '/etc/sudoers' configuration is correct.

Cause: N/A

User Action: N/A

gui_database_cleared_cluster INFO WARNING no Message: The cluster ID has changed.


_change
Description: The cluster ID stored in the database no longer matches the
installed cluster ID.

Cause: A new GPFS cluster has been installed.

User Action: Events that are marked as read are now displayed as unread.
Mark all notices as read if they are no longer valid after the cluster change.

gui_database_cleared_downg INFO WARNING no Message: The GUI version read from the database({0}) is later than the GUI
rade code version({1}).

Description: The GUI might have been moved to an older version.

Cause: The GUI version that is stored in the database is greater than the
GUI code version.

User Action: Events that are marked as read are now displayed as unread.
Mark all notices as read if they are no longer valid after the GUI is moved to
an older version.

636 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_database_dropped INFO_EXTERNAL WARNING no Message: The database version ({0}) mismatches the PostgreSQL
version({1}).

Description: The PostgreSQL internal storage format might have changed


following a major version upgrade.

Cause: There is a mismatch between the database version and the


PostgreSQL server version.

User Action: Events that are marked as read are now displayed as unread.
Mark all notices as read if they are no longer valid after the upgrade.

gui_db_ok STATE_CHANGE INFO no Message: The GUI reported correct connection to postgres database in the
cluster.

Description: The connection to postgres database works properly.

Cause: The GUI reported correct connection to postgres database.

User Action: N/A.

gui_db_warn STATE_CHANGE WARNING no Message: The GUI reported incorrect connection to postgres database.

Description: The connection to postgres database cannot be established.

Cause: The GUI reported incorrect connection to postgres database.

User Action: Check if postgres container works properly in the GUI pod.

gui_down STATE_CHANGE ERROR no Message: The GUI service should be {0}, but it is {1}. If there are no
other GUI nodes up and running, then no snapshots are created and email
notifications are not sent anymore.

Description: The GUI service is down.

Cause: The GUI service is not running on this node, although it has the
'GUI_MGMT_SERVER_NODE' node class.

User Action: Restart the GUI service or change the node class for this
node.

gui_email_server_reachable STATE_CHANGE INFO no Message: The email server {0} is reachable.

Description: The specified email server is reachable.

Cause: N/A

User Action: N/A

gui_email_server_unreachabl STATE_CHANGE ERROR no Message: The email server {0} is unreachable {1}.
e
Description: The specified email server does not respond to any messages.

Cause: The configuration or firewall setting is wrong.

User Action: Check the email server configuration (hostname, port,


username, and password). Ensure that the email server is up and running
and no firewall is blocking the access.

gui_external_authentication_f INFO ERROR no Message: The GUI cannot connect to the external LDAP or AD server: {0}.
ailed
Description: The GUI cannot connect to one or more of the specified LDAP
or AD servers.

Cause: The LDAP or AD server is not reachable because it is not running or


due to a network issue.

User Action: Verify that the configured LDAP or AD servers are up and
running and reachable from the GUI node.

gui_login_attempt_failed INFO_EXTERNAL WARNING no Message: A login attempt failed for the user {0} from the source IP address
{1}.

Description: A login attempt for the specified user failed.

Cause: A wrong password was entered.

User Action: N/A

Chapter 42. References 637


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_mount_allowed_on_gui_ STATE_CHANGE INFO no Message: Mount operation is allowed for all file systems on the GUI node.
node
Description: Mount operation is allowed for all file systems on the GUI
node.

Cause: Mount operation is allowed for all file systems on the GUI node.

User Action: N/A

gui_mount_prevented_on_gui STATE_CHANGE WARNING no Message: Mount operation is prevented for {1} file systems on the GUI
_node node {0}.

Description: Mount operation for specific file systems is prevented on the


GUI node.

Cause: Mount operation is prevented for specific file systems on the GUI
node.

User Action: Run the fix procedure or go to the file system panel, and allow
mount operation for mentioned file systems on the GUI node.

gui_node_update_failure STATE_CHANGE ERROR no Message: GUI node class cannot be updated.

Description: The node class update got failed.

Cause: The node class update got failed.

User Action: N/A

gui_node_update_successful STATE_CHANGE INFO no Message: GUI node class got updated successfully.

Description: The node class got updated successfully.

Cause: The node class got updated successfully.

User Action: N/A

gui_out_of_memory INFO ERROR no Message: The GUI reported an internal out-of-memory state. Restart the
GUI.

Description: A GUI internal process ran into an out-of-memory state,


which might impact GUI functions partially or fully.

Cause: The Java virtual machine of the GUI reported an internal out-of-
memory state.

User Action: Restart the GUI or recreate the liberty container.

gui_pmcollector_connection_f STATE_CHANGE ERROR no Message: The GUI cannot connect to the pmcollector that is running on {0}
ailed using port {1}.

Description: The GUI checks the connection to the pmcollector.

Cause: The GUI cannot connect to the pmcollector.

User Action: Check whether the pmcollector service is running and verify
the firewall or network settings. If the problem still persists, then check
whether the GUI node is specified for the 'colCandidates' attribute in the
mmperfmon config show command.

gui_pmcollector_connection_ STATE_CHANGE INFO no Message: The GUI can connect to the pmcollector that is running on {0}
ok using port {1}.

Description: The GUI checks the connection to the pmcollector.

Cause: The GUI can connect to the pmcollector.

User Action: N/A

638 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_pmsensors_connection_f STATE_CHANGE ERROR no Message: The performance monitoring sensor service 'pmsensors' on node
ailed {0} is not sending any data.

Description: The GUI checks whether data can be retrieved from the
pmcollector service for this node.

Cause: The performance monitoring sensor service 'pmsensors' is not


sending any data. The service might be down or the time of the node
is more than 15 minutes away from the time on the node hosting the
performance monitoring collector service 'pmcollector'. The connection
probe uses data from the CPU sensor to check whether nodes send data.
Also, check whether the CPU sensor was disabled. If the CPU sensor was
disabled, then enable the sensor with a period of 10 or smaller value.

User Action: Check by using the systemctl status pmsensors


command. If pmsensors service is 'inactive', then run the systemctl
start pmsensors command. Check whether the CPU sensor was
disabled. If it was disabled, then run the mmperfmon config update
CPU.period=1' 'CPU.restrict=all'' command.

gui_pmsensors_connection_o STATE_CHANGE INFO no Message: The state of performance monitoring sensor service 'pmsensor'
k on node {0} is OK.

Description: The GUI checks whether data can be retrieved from the
pmcollector service for this node.

Cause: The state of performance monitoring sensor service 'pmsensor' is


OK and it is sending data.

User Action: N/A

gui_quorum_ok STATE_CHANGE INFO no Message: The GUI reported correct quorum in the cluster.

Description: Quorum is reached in the cluster.

Cause: The GUI reported correct quorum in the cluster.

User Action: N/A.

gui_quorum_warn STATE_CHANGE WARNING no Message: The GUI reported quorum loss in the cluster.

Description: The GUI reported quorum loss in the cluster.

Cause: The GUI reported quorum loss in the cluster.

User Action: Check if quorum is correct in your cluster.

gui_reachable_node STATE_CHANGE INFO no Message: The GUI can reach the node {0}.

Description: The GUI checks the reachability of all nodes.

Cause: The specified node can be reached by the GUI node.

User Action: N/A

gui_refresh_task_failed STATE_CHANGE WARNING no Message: The following GUI refresh task(s) failed: {0}.

Description: One or more GUI refresh tasks failed, which means that data
in the GUI is outdated.

Cause: There can be several reasons.

User Action: Check whether there is additional information available by


using the '/usr/lpp/mmfs/gui/cli/lstasklog [taskname]' command. Then,
run the specified task manually on the CLI by using the '/usr/lpp/
mmfs/gui/cli/runtask [taskname] --debug' command. Check the GUI logs
under '/var/log/cnlog/mgtsrv' and contact IBM Support if this error persists
or occurs more often.

gui_refresh_task_successful STATE_CHANGE INFO no Message: All GUI refresh tasks are running fine.

Description: All GUI refresh tasks are running fine.

Cause: N/A

User Action: N/A

Chapter 42. References 639


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_response_ok STATE_CHANGE INFO no Message: The GUI is responsive to the test query.

Description: The GUI is responsive to the test query.

Cause: The GUI responded to the test query '/usr/lpp/mmfs/gui/cli/debug


platform'.

User Action: N/A.

gui_response_warn STATE_CHANGE WARNING no Message: The GUI is unresponsive to the test query.

Description: The GUI is unresponsive to the test query.

Cause: The GUI did not respond with the expected data for the test query
(debug platform).

User Action: Restart the GUI by using the systemctl restart


gpfsgui command or wait for the liberty container to be re-created.

gui_snap_create_failed_fs INFO ERROR no Message: A snapshot creation invoked by rule {1} failed on file system {0}.

Description: The snapshot was not created according to the specified rule.

Cause: A snapshot creation invoked by a rule fails.

User Action: Try to create the snapshot again manually.

gui_snap_create_failed_fset INFO ERROR no Message: A snapshot creation that is invoked by rule {1} failed on file
system {2}, fileset {0}.

Description: The snapshot was not created according to the specified rule.

Cause: A snapshot creation that is invoked by a rule fails.

User Action: Try to create the snapshot again manually.

gui_snap_delete_failed_fs INFO ERROR no Message: A snapshot deletion that is invoked by rule {1} failed on file
system {0}.

Description: The snapshot was not deleted according to the specified rule.

Cause: A snapshot deletion that is invoked by a rule fails.

User Action: Try to manually delete the snapshot.

gui_snap_delete_failed_fset INFO ERROR no Message: A snapshot deletion that is invoked by rule {1} failed on file
system {2}, fileset {0}.

Description: The snapshot was not deleted according to the specified rule.

Cause: A snapshot deletion that is invoked by a rule fails.

User Action: Try to manually delete the snapshot.

gui_snap_rule_ops_exceeded INFO WARNING no Message: The number of pending operations exceeds {1} operations for
rule {2}.

Description: The number of pending operations for a rule exceed a


specified value.

Cause: The number of pending operations for a rule exceed a specified


value.

User Action: N/A

gui_snap_running INFO WARNING no Message: Operations for rule {1} are still running at the start of the next
management of rule {1}.

Description: Operations for a rule are still running at the start of the next
management of that rule.

Cause: Operations for a rule are still running.

User Action: N/A

640 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_snap_time_limit_exceede INFO WARNING no Message: A snapshot operation exceeds {1} minutes for rule {2} on file
d_fs system {0}.

Description: The snapshot operation that is resulting from the rule is


exceeding the established time limit.

Cause: A snapshot operation exceeds a specified number of minutes.

User Action: N/A

gui_snap_time_limit_exceede INFO WARNING no Message: A snapshot operation exceeds {1} minutes for rule {2} on file
d_fset system {3}, fileset {0}.

Description: The snapshot operation that is resulting from the rule is


exceeding the established time limit.

Cause: A snapshot operation exceeds a specified number of minutes.

User Action: N/A

gui_snap_total_ops_exceeded INFO WARNING no Message: The total number of pending operations exceeds {1} operations.

Description: The total number of pending operations exceed a specified


value.

Cause: The total number of pending operations exceed a specified value.

User Action: N/A

gui_ssh_ok STATE_CHANGE INFO no Message: The GUI reported correct ssh connection in the cluster.

Description: The ssh connection works properly.

Cause: The GUI reported correct ssh connection.

User Action: N/A.

gui_ssh_warn STATE_CHANGE WARNING no Message: The GUI reported incorrect ssh connection.

Description: The connection over ssh cannot be established.

Cause: The GUI reported incorrect ssh connection.

User Action: Check if ssh connection works properly in your cluster.

gui_ssl_certificate_expired STATE_CHANGE ERROR no Message: The SSL certificate that is used by the GUI expired. Expiration
date was {0}.

Description: SSL certificate expired.

Cause: The SSL certificate that is used by the GUI expired.

User Action: On the CLI, run the '/usr/lpp/mmfs/gui/cli/rmhttpskeystore'


command to return to the default certificate. On the GUI, go to 'Service' and
select 'GUI' to create or upload a new certificate.

gui_ssl_certificate_is_about_t STATE_CHANGE WARNING no Message: The SSL certificate that is used by the GUI is about to expire.
o_expire Expiration date is {0}.

Description: SSL certificate is about to expire.

Cause: The SSL certificate that is used by the GUI is about to expire.

User Action: Go to the Service panel and select 'GUI'. On the 'Nodes' tab,
select an option to create a new certificate request, self-signed certificate,
or upload your own certificate.

gui_ssl_certificate_ok STATE_CHANGE INFO no Message: The SSL certificate that is used by the GUI is valid. Expiration
date is {0}.

Description: GUI SSL certificates are valid.

Cause: N/A

User Action: N/A

Chapter 42. References 641


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

gui_unreachable_node STATE_CHANGE ERROR no Message: The GUI cannot reach the node {0}.

Description: The GUI checks the reachability of all nodes.

Cause: The specified node cannot be reached by the GUI node.

User Action: Check your firewall or network setup, and whether the
specified node is up and running.

gui_up STATE_CHANGE INFO no Message: The status of the GUI service is {0} as expected.

Description: The GUI service is running.

Cause: The GUI service is running as expected.

User Action: N/A

gui_warn INFO INFO no Message: The GUI service returned an unknown result.

Description: The GUI service returned an unknown result.

Cause: The service or systemctl command returned unknown results


about the gpfsgui service.

User Action: Check whether the gpfsgui is in the expected status in


the service or systemctl command. Also, check whether there is no
gpfsgui service, although the node has the 'GUI_MGMT_SERVER_NODE'
node class. For more information, see the IBM Documentation. Otherwise,
monitor the issue if this warning appears more often.

host_disk_filled STATE_CHANGE WARNING no Message: A local file system on node {0} reached a warning level {1}.

Description: The GUI checks the fill level of the local file systems.

Cause: The local file systems reached a warning level.

User Action: Delete data on the local disk.

host_disk_full STATE_CHANGE ERROR no Message: A local file system on node {0} reached a nearly exhausted level
{1}.

Description: The GUI checks the fill level of the local file systems.

Cause: The local file systems reached a nearly exhausted level.

User Action: Delete data on the local disk.

host_disk_normal STATE_CHANGE INFO no Message: The local file systems on node {0} reached a normal level.

Description: The GUI checks the fill level of the local file systems.

Cause: The fill level of the local file systems is OK.

User Action: N/A

host_disk_unknown STATE_CHANGE WARNING no Message: The fill level of local file systems on node {0} is unknown.

Description: The GUI checks the fill level of the local file systems.

Cause: Cannot determine fill state of the local file systems.

User Action: N/A

sudo_admin_not_configured STATE_CHANGE ERROR no Message: Sudo wrappers are enabled on the cluster '{0}', but the GUI is not
configured to use Sudo wrappers.

Description: Sudo wrappers are enabled on the cluster, but the value for
GPFS_ADMIN in '/usr/lpp/mmfs/gui/conf/gpfsgui.properties' was either not
set or is still set to root. The value of 'GPFS_ADMIN' is set to the username
for which sudo wrappers were configured on the cluster.

Cause: N/A

User Action: Ensure that sudo wrappers were correctly configured for a
user that is available on the GUI node and all other nodes of the cluster.
This username is set as the value of the 'GPFS_ADMIN' option in the
'/usr/lpp/mmfs/gui/conf/gpfsgui.properties' file. After the restart, the GUI
starts by using the systemctl restart gpfsgui command.

642 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

sudo_admin_not_exist STATE_CHANGE ERROR no Message: Sudo wrappers are enabled on the cluster '{0}', but there
is a misconfiguration that is regarding the user '{1}' that was set as
'GPFS_ADMIN' in the GUI properties file.

Description: Sudo wrappers are enabled on the cluster, but the username
that was set as GPFS_ADMIN in the GUI properties file at '/usr/lpp/
mmfs/gui/conf/gpfsgui.properties' does not exist on the GUI node.

Cause: N/A

User Action: Ensure that sudo wrappers were correctly configured for
a user that is available on the GUI node and all other nodes of the
cluster. This username is set as the value of the 'GPFS_ADMIN' option in
the '/usr/lpp/mmfs/gui/conf/gpfsgui.properties'. After that restart, the GUI
starts by using the systemctl restart gpfsgui command.

sudo_admin_set_but_disable STATE_CHANGE WARNING no Message: Sudo wrappers are not enabled on the cluster '{0}', but
d 'GPFS_ADMIN' was set to a non-root user.

Description: Sudo wrappers are not enabled on the cluster, but the value
for 'GPFS_ADMIN' in the '/usr/lpp/mmfs/gui/conf/gpfsgui.properties' was
set to a non-root user. The value of 'GPFS_ADMIN' is set to 'root' when
sudo wrappers are not enabled on the cluster.

Cause: N/A

User Action: Set 'GPFS_ADMIN' in '/usr/lpp/mmfs/gui/conf/


gpfsgui.properties' to 'root'. After that restart, the GUI starts by using the
systemctl restart gpfsgui command.

sudo_connect_error STATE_CHANGE ERROR no Message: Sudo wrappers are enabled on the cluster '{0}', but the GUI
cannot connect to other nodes with the username '{1}' that was defined as
'GPFS_ADMIN' in the GUI properties file.

Description: When sudo wrappers are configured and enabled on a cluster,


the GUI does not run commands as root, but as the user for which sudo
wrappers were configured. This user is set as 'GPFS_ADMIN' in the GUI
properties file at '/usr/lpp/mmfs/gui/conf/gpfsgui.properties'.

Cause: N/A

User Action: Ensure that sudo wrappers were correctly configured for
a user that is available on the GUI node and all other nodes of the
cluster. This username is set as the value of the 'GPFS_ADMIN' option in
the '/usr/lpp/mmfs/gui/conf/gpfsgui.properties'. After that restart, the GUI
starts by using the systemctl restart gpfsgui command.

sudo_ok STATE_CHANGE INFO no Message: Sudo wrappers were enabled on the cluster and the GUI
configuration for the cluster '{0}' is correct.

Description: No problems were found with the current GUI and cluster
configurations.

Cause: N/A

User Action: N/A

time_in_sync STATE_CHANGE INFO no Message: The time on node {0} is in sync with the cluster median.

Description: The GUI checks the time on all nodes.

Cause: The time on the specified node is in sync with the cluster median.

User Action: N/A

time_not_in_sync STATE_CHANGE ERROR no Message: The time on node {0} is not in sync with the cluster median.

Description: The GUI checks the time on all nodes.

Cause: The time on the specified node is not in sync with the cluster
median.

User Action: Synchronize the time on the specified node.

time_sync_unknown STATE_CHANGE WARNING no Message: The time on node {0} cannot be determined.

Description: The GUI checks the time on all nodes.

Cause: The time on the specified node cannot be determined.

User Action: Check whether the node is reachable from the GUI.

Chapter 42. References 643


Table 86. Events for the GUI component (continued)

Event Event Severity Call Details


Type Home

xcat_nodelist_missing STATE_CHANGE ERROR no Message: The node {0} is unknown by xCAT.

Description: The GUI checks whether xCAT can manage the node.

Cause: The xCAT does not know about the node.

User Action: Add the node to xCAT. Ensure that the hostname that is used
in xCAT matches the hostname that is known by the node itself.

xcat_nodelist_ok STATE_CHANGE INFO no Message: The node {0} is known to the xCAT.

Description: The GUI checks whether xCAT can manage the node.

Cause: xCAT knows about the node and manages it.

User Action: N/A

xcat_nodelist_unknown STATE_CHANGE WARNING no Message: State of the node {0} in xCAT is unknown.

Description: The GUI checks whether xCAT can manage the node.

Cause: The state of the node within xCAT cannot be determined.

User Action: N/A

xcat_state_error STATE_CHANGE INFO no Message: The xCAT on node {1} failed to operate properly on cluster {0}.

Description: The GUI checks the xCAT state.

Cause: The node specified as xCAT host is reachable, but either xCAT is not
installed on the node or not operating properly.

User Action: Check xCAT installation and try xCAT command nodes, rinv,
and rvitals for errors.

xcat_state_invalid_version STATE_CHANGE WARNING no Message: The xCAT service has not the recommended version ({1} actual/
recommended).

Description: The GUI checks the xCAT state.

Cause: The reported version of xCAT is not compliant with the


recommendation.

User Action: Install the recommended xCAT version.

xcat_state_no_connection STATE_CHANGE ERROR no Message: Unable to connect to xCAT node {1} on cluster {0}.

Description: The GUI checks the xCAT state.

Cause: Cannot connect to the node specified as xCAT host.

User Action: Check whether the IP address is correct and ensure that root
has key-based SSH set up to the xCAT node.

xcat_state_ok STATE_CHANGE INFO no Message: The availability of xCAT on cluster {0} is OK.

Description: The GUI checks the xCAT state.

Cause: The availability and state of xCAT is OK.

User Action: N/A

xcat_state_unconfigured STATE_CHANGE WARNING no Message: The xCAT host is not configured on cluster {0}.

Description: The GUI checks the xCAT state.

Cause: The host where xCAT is located is not specified.

User Action: Specify the hostname or IP where xCAT is located.

xcat_state_unknown STATE_CHANGE WARNING no Message: Availability of xCAT on cluster {0} is unknown.

Description: The GUI checks the xCAT state.

Cause: The availability and state of xCAT cannot be determined.

User Action: N/A

644 IBM Storage Scale 5.1.9: Problem Determination Guide


Hadoop connector events
The following table lists the events that are created for the Hadoop connector component.
Table 87. Events for the Hadoop connector component

Event Event Severity Call Details


Type Home

hadoop_datanode_down STATE_CHANGE ERROR no Message: Hadoop DataNode service is down.

Description: The Hadoop DataNode service is down.

Cause: The Hadoop DataNode process is not running.

User Action: Start the Hadoop DataNode service.

hadoop_datanode_up STATE_CHANGE INFO no Message: Hadoop DataNode service is up.

Description: The Hadoop DataNode service is running.

Cause: The Hadoop DataNode process is running.

User Action: N/A

hadoop_datanode_warn INFO WARNING no Message: Hadoop DataNode monitoring returned unknown results.

Description: The Hadoop DataNode service check returned unknown


results.

Cause: The Hadoop DataNode service status check returned unknown


results.

User Action: If this status persists after a few minutes, then restart the
DataNode service.

hadoop_namenode_down STATE_CHANGE ERROR no Message: Hadoop NameNode service is down.

Description: The Hadoop NameNode service is down.

Cause: The Hadoop NameNode process is not running.

User Action: Start the Hadoop NameNode service.

hadoop_namenode_up STATE_CHANGE INFO no Message: Hadoop NameNode service is up.

Description: The Hadoop NameNode service is running.

Cause: The Hadoop NameNode process is running.

User Action: N/A

hadoop_namenode_warn INFO WARNING no Message: Hadoop NameNode monitoring returned unknown results.

Description: The Hadoop NameNode service status check returned


unknown results.

Cause: N/A

User Action: If this status persists after a few minutes, then restart the
NameNode service.

HDFS data node events


The following table lists the events that are created for the HDFS data node component.
Table 88. Events for the HDFS data node component

Event Event Severity Call Details


Type Home

hdfs_datanode_config_missin STATE_CHANGE WARNING no Message: HDFS DataNode configuration is missing.


g
Description: The HDFS DataNode configuration for the hdfs cluster is
missing on this node.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs config get core-site.xml


-k dfs.nameservices' command did not report a valid hdfs cluster name.

User Action: Ensure that the configuration is uploaded by using the


mmhdfs command.

Chapter 42. References 645


Table 88. Events for the HDFS data node component (continued)

Event Event Severity Call Details


Type Home

hdfs_datanode_process_dow STATE_CHANGE ERROR no Message: HDFS DataNode process for hdfs cluster {0} is down.
n
Description: The HDFS DataNode process is down.

Cause: The '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon status datanode'


command reported that the process is dead.

User Action: Start the Hadoop DataNode process again by using the
'/usr/lpp/mmfs/hadoop/bin/hdfs --daemon start datanode' command.

hdfs_datanode_process_unkn STATE_CHANGE WARNING no Message: HDFS DataNode process for hdfs cluster {0} is unknown.
own
Description: The HDFS DataNode process is unknown.

Cause: The '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon status datanode'


command reported unexpected results.

User Action: Check the HDFS DataNode service. If needed, then restart
it by using the '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon start datanode'
command.

hdfs_datanode_process_up STATE_CHANGE INFO no Message: HDFS DataNode process for hdfs cluster {0} is OK.

Description: The HDFS DataNode process is running.

Cause: The '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon status datanode'


command reported a running process.

User Action: N/A

HDFS name node events


The following table lists the events that are created for the HDFS name node component.
Table 89. Events for the HDFS name node component

Event Event Severity Call Details


Type Home

hdfs_namenode_active STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is ACTIVE.

Description: The HDFS NameNode service is in ACTIVE state, as expected.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned an ACTIVE serviceState.

User Action: N/A

hdfs_namenode_config_missi STATE_CHANGE WARNING no Message: HDFS NameNode configuration for cluster {0} is missing.
ng
Description: The HDFS NameNode configuration for the HDFS cluster is
missing on this node.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs config get core-site.xml


-k dfs.nameservices' command did not report a valid HDFS cluster name.

User Action: Ensure that the configuration is uploaded by using the


mmhdfs command.

hdfs_namenode_error STATE_CHANGE ERROR no Message: HDFS NameNode health for HDFS cluster {0} is invalid.

Description: The HDFS NameNode service has an invalid health state.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned with error.

User Action: Validate that the HDFS configuration is valid and try to start
the NameNode service manually.

hdfs_namenode_failed STATE_CHANGE ERROR no Message: HDFS NameNode health for HDFS cluster {0} failed.

Description: The HDFS NameNode service is failed.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned a FAILED healthState.

User Action: Start the Hadoop NameNode service.

646 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 89. Events for the HDFS name node component (continued)

Event Event Severity Call Details


Type Home

hdfs_namenode_initializing STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is
INITIALIZING.

Description: The HDFS NameNode service is in INITIALIZING state.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned an INITIALIZING serviceState.

User Action: Wait for the service to finish initialization.

hdfs_namenode_krb_auth_fai STATE_CHANGE WARNING no Message: HDFS NameNode check health state failed with kinit error for
led cluster {0}.

Description: The kerberos authentication that is required to query whether


the health state is failed.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command failed with rc=2 (kinit error).

User Action: Ensure that the that the 'KINIT_KEYTAB' and


'KINIT_PRINCIPAL' Hadoop environment variables are correctly configured
in the 'hadoop-env.sh'.

hdfs_namenode_ok STATE_CHANGE INFO no Message: HDFS NameNode health for HDFS cluster {0} is OK.

Description: The HDFS NameNode service is running.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned a OK healthState.

User Action: N/A

hdfs_namenode_process_do STATE_CHANGE ERROR no Message: HDFS NameNode process for HDFS cluster {0} is down.
wn
Description: The HDFS NameNode process is down.

Cause: The '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon status namenode'


command reported that the process is dead.

User Action: Start the Hadoop NameNode process by using the mmces
service start hdfs command.

hdfs_namenode_process_unk STATE_CHANGE WARNING no Message: HDFS NameNode process for HDFS cluster {0} is unknown.
nown
Description: The HDFS NameNode process is unknown.

Cause: The '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon status namenode'


command reported unexpected results.

User Action: Check the HDFS Namenode service and if needed, then
restart it by using the mmces service start hdfs command.

hdfs_namenode_process_up STATE_CHANGE INFO no Message: HDFS NameNode process for HDFS cluster {0} is OK.

Description: The HDFS NameNode process is running.

Cause: The '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon status namenode'


command reported a running process.

User Action: N/A

hdfs_namenode_standby STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is in
STANDBY.

Description: The HDFS NameNode service is in STANDBY state, as


expected.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned a STANDBY serviceState.

User Action: N/A

hdfs_namenode_stopping STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is STOPPING.

Description: The HDFS NameNode service is in STOPPING state.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned a STOPPING serviceState.

User Action: Wait for the service to finish stopping.

Chapter 42. References 647


Table 89. Events for the HDFS name node component (continued)

Event Event Severity Call Details


Type Home

hdfs_namenode_unauthorize STATE_CHANGE WARNING no Message: HDFS NameNode check health state failed for cluster {0}.
d
Description: Failed to query the health state because of missing or wrong
authentication token.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command failed with rc=3 (permission error).

User Action: Ensure that the 'KINIT_KEYTAB' and 'KINIT_PRINCIPAL'


Hadoop environment variables are configured in the 'hadoop-env.sh'.

hdfs_namenode_unknown_st STATE_CHANGE WARNING no Message: HDFS NameNode service state for HDFS cluster {0} is
ate UNKNOWN.

Description: The HDFS NameNode service is in UNKNOWN state, as


expected.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned an UNKNOWN serviceState.

User Action: N/A

hdfs_namenode_wrong_state STATE_CHANGE WARNING no Message: HDFS NameNode service state for HDFS cluster {0} is
unexpected {1}.

Description: The HDFS NameNode service state is not as expected. For


example, in STANDBY but is supposed to be ACTIVE or vice versa.

Cause: The '/usr/lpp/mmfs/hadoop/sbin/mmhdfs monitor checkHealth -Y'


command returned serviceState, which does not match the expected state
when looking at the assigned ces IP attributes.

User Action: N/A

Keystone events
The following table lists the events that are created for the Keystone component.
Table 90. Events for the Keystone component

Event Event Severity Call Details


Type Home

ks_failed STATE_CHANGE ERROR FTDC Message: The keystone (HTTPd) process should be {0}, but is {1}.
uploa
d Description: The keystone (HTTPd) process is in an unexpected mode.

Cause: If the object authentication is local, AD or LDAP, then the process


failed. If the object authentication is none or user-defined, then the
process is expected to be stopped, but was running.

User Action: Ensure that the process is in the expected state.

ks_ok STATE_CHANGE INFO no Message: The keystone (HTTPd) process as expected, state is {0}.

Description: The keystone (HTTPd) process is in the expected state.

Cause: If the object authentication is local, AD or LDAP, then the process


is running. If the object authentication is none or user-defined, then the
process stopped as expected.

User Action: N/A

ks_restart INFO WARNING no Message: The {0} service failed. Trying to recover.

Description: A service was not in the expected state.

Cause: A service might have stopped unexpectedly.

User Action: N/A

ks_url_exfail STATE_CHANGE WARNING no Message: Keystone request failed by using {0}.

Description: A request to an external keystone URL failed.

Cause: An HTTP request to an external keystone server failed.

User Action: Check that HTTPd or keystone is running on the expected


server, and is accessible with the defined ports.

648 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 90. Events for the Keystone component (continued)

Event Event Severity Call Details


Type Home

ks_url_failed STATE_CHANGE ERROR no Message: Keystone request failed by using {0}.

Description: A keystone URL request failed.

Cause: An HTTP request to keystone failed.

User Action: Check whether the HTTPd or keystone is running on the


expected server, and is accessible with the defined ports.

ks_url_ok STATE_CHANGE INFO no Message: Keystone request is successful by using {0}.

Description: A keystone URL request was successful.

Cause: An HTTP request to keystone returned successfully.

User Action: N/A

ks_url_warn INFO WARNING no Message: Keystone request {0} returned an unknown result.

Description: A keystone URL request returned an unknown result.

Cause: A simple HTTP request to keystone returned an unexpected error.

User Action: Check whether HTTPd or keystone is running on the expected


server, and is accessible with the defined ports.

ks_warn INFO WARNING no Message: The keystone (HTTPd) process monitoring returned an unknown
result.

Description: The keystone (HTTPd) monitoring returned an unknown


result.

Cause: A status query for HTTPd returned an unexpected error.

User Action: Check service script and settings of HTTPd.

ldap_reachable STATE_CHANGE INFO no Message: The external LDAP server {0} is up.

Description: The external LDAP server is operational.

Cause: N/A

User Action: N/A

ldap_unreachable STATE_CHANGE ERROR no Message: The external LDAP server {0} is unresponsive.

Description: The external LDAP server is unresponsive.

Cause: The local node is unable to connect to the LDAP server.

User Action: Verify network connection and check whether that LDAP
server is operational.

postgresql_failed STATE_CHANGE ERROR FTDC Message: The 'postgresql-obj' process should be {0}, but is {1}.
uploa
d Description: The 'postgresql-obj' process is in an unexpected mode.

Cause: The database backend for object authentication is supposed to run


on a single node when either database is not running on the designated
node or it is running on a different node.

User Action: Check that the 'postgresql-obj' process is running on the


expected server.

postgresql_ok STATE_CHANGE INFO no Message: The 'postgresql-obj' process as expected, state is {0}.

Description: The 'postgresql-obj' process is in the expected mode.

Cause: The database backend for object authentication is supposed to


running on the right node while being stopped on others.

User Action: N/A

postgresql_warn INFO WARNING no Message: The 'postgresql-obj' process monitoring returned an unknown
result.

Description: The 'postgresql-obj' process monitoring returned an unknown


result.

Cause: A status query for 'postgresql-obj' returned an unexpected error.

User Action: Check PostgresSQL database engine.

Chapter 42. References 649


Local Cache events
The following table lists the events that are created for the Local Cache component.
Table 91. Events for the Local cache component

Event Event Severity Call Details


Type Home

lroc_buffer_desc_autotune_ti TIP TIP no Message: For optimal LROC tuning, based on average cached data
p block sizes of {0} in previously observed workloads, a value of {1} for
maxBufferDescs configuration on this node is recommended.

Description: Not enough buffer descriptors available for optimal LROC


performance.

Cause: For optimal LROC performance, based on average cached data


block sizes calculated, a larger amount of buffer descriptors for this node is
recommended.

User Action: Run the mmchconfig


maxBufferDescs={recommendedBufferDescs} command. After
setting the 'maxBufferDescs' setting to new limit, run the mmshutdown or
mmstartup for changes to take effect.

lroc_clear_tips STATE_CHANGE INFO no Message: LROC is healthy.

Description: Clear TIP events from LROC.

Cause: Local Cache device check is OK.

User Action: N/A

lroc_daemon_idle STATE_CHANGE WARNING no Message: LROC is not running.

Description: LROC daemon check reported that it is in an idle state, but


local cache disks are configured.

Cause: LROC disks are configured, but the LROC daemon is currently idle,
which can be a valid transitional state.

User Action: N/A

lroc_daemon_running STATE_CHANGE INFO no Message: LROC is normal.

Description: LROC daemon check reported that it is running.

Cause: The result of checking the status of the LROC daemon was OK.

User Action: N/A

lroc_daemon_shutdown STATE_CHANGE WARNING no Message: LROC is down.

Description: LROC daemon check reported that it is in a shutdown state.

Cause: The result of checking the status of the LROC daemon reported that
it is down.

User Action: N/A

lroc_daemon_unknown_state STATE_CHANGE WARNING no Message: LROC status cannot be determined.

Description: LROC daemon check reported that it is in an unknown state.

Cause: The result of checking the status of the LROC daemon reported that
it is in an unknown state.

User Action: N/A

lroc_disk_failed STATE_CHANGE ERROR no Message: LROC {0} device is not OK.

Description: Local cache disk is defined, but is not configured with LROC.

Cause: The result of checking the configured local cache device was not
OK.

User Action: Check the physical status of the LROC device, and the LROC
configuration by using the mmlsnsd and mmdiag commands.

lroc_disk_found INFO_ADD_ENTITY INFO no Message: The local cache disk {0} was found.

Description: A local cache disk was detected.

Cause: A local cache disk was detected.

User Action: N/A

650 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 91. Events for the Local cache component (continued)

Event Event Severity Call Details


Type Home

lroc_disk_normal STATE_CHANGE INFO no Message: LROC {0} device is normal.

Description: Local Cache device check is OK.

Cause: The result of examining the configured local cache device was OK.

User Action: N/A

lroc_disk_unhealthy STATE_CHANGE WARNING no Message: LROC {0} device is not OK.

Description: Local cache disk is defined, but LROC daemon is not


configured to use it.

Cause: The result of checking the configured local cache device was not
OK.

User Action: Check the local cache configuration by using the mmlsnsd
and mmdiag commands.

lroc_disk_vanished INFO_DELETE_ENTITY INFO no Message: The disk {0} vanished.

Description: A previously declared local cache disk was not detected.

Cause: A local cache disk is not in use, which can be a valid situation.

User Action: N/A

lroc_sensors_active TIP INFO no Message: The perfmon sensor GPFSLROC is active.

Description: The GPFSLROC perfmon sensor is active.

Cause: The GPFSLROC perfmon sensors' period attribute is greater than 0.

User Action: N/A

lroc_sensors_clear STATE_CHANGE INFO no Message: Clear any previous bad GPFSLROC sensor state.

Description: Clear any previous bad GPFSLROC sensor state flag.

Cause: Clear any previous bad GPFSLROC sensor state.

User Action: N/A

lroc_sensors_inactive TIP TIP no Message: The perfmon sensor GPFSLROC is inactive.

Description: The perfmon sensor GPFSLROC is not active.

Cause: The GPFSLROC perfmon sensors' period attribute is 0.

User Action: Enable the perfmon GPFSLROC sensor. Set the period
attribute of the GPFSLROC sensor greater than 0 (default is 10).
Therefore, use the mmperfmon config update GPFSLROC.period=N
command, where 'N' is a natural number greater 0. On the other
hand, you can hide this event by using the mmhealth event hide
lroc_sensors_inactive command.

lroc_sensors_not_configured TIP TIP no Message: The GPFSLROC perfmon sensor is not configured.

Description: The GPFSLROC perfmon sensor does not exist in the


mmperfmon config show command.

Cause: The GPFSLROC perfmon sensor is not configured in the sensors


configuration file.

User Action: Include the GPFSLROC sensors into the perfmon


configuration by using the mmperfmon config update --config-
file InputFile command. The default file is '/opt/IBM/zimon/defaults/
ZIMonSensors_GPFSLROC.cfg'. An example for the configuration file can
be found in the mmperfmon command section in the Command Reference
Guide.

Chapter 42. References 651


Table 91. Events for the Local cache component (continued)

Event Event Severity Call Details


Type Home

lroc_sensors_not_needed TIP TIP no Message: LROC is not configured, but performance sensor GPFSLROC
period is declared.

Description: There is an active perfmonance sensor GPFSLROC period


declared.

Cause: LROC is not configured, but the perfmonance sensor GPFSLROC is


active.

User Action: Disable the perfmon GPFSLROC sensor. Set the period
attribute of the GPFSLROC sensor to 0. Therefore, use the mmperfmon
config update GPFSLROC.period=0 command. On the other hand,
you can hide this event by using the mmhealth event hide
lroc_sensors_not_needed command.

lroc_set_buffer_desc_tip TIP TIP no Message: This node has LROC devices with a total capacity of {0}
GB. Optimal LROC performance requires setting the 'maxBufferDescs'
configuration option. The value of desired buffer descriptors for this node is
'{1}', based on assumed 4 MB data block size.

Description: Not enough buffer descriptors are available for optimal LROC
performance.

Cause: The 'maxBufferDescs' configuration setting was not set to a value.


For optimal LROC performance, IBM Storage Scale daemon requires more
buffer descriptors than the default.

User Action: Run the mmchconfig


maxBufferDescs={desiredBufferDescs} command. After setting the
'maxBufferDescs' setting to new limit, run the mmshutdown or the
mmstartup command for changes to take effect.

Network events
The following table lists the events that are created for the Network component.
Table 92. Events for the network component

Event Event Severity Call Details


Type Home

bond_degraded STATE_CHANGE WARNING no Message: Some secondaries of the network bond {0} went down.

Description: Some of the network bond parts are malfunctioning.

Cause: Some secondaries of the network bond are not functioning


properly.

User Action: Check the bonding configuration, the network configuration,


and cabling of the malfunctioning secondaries of the network bond.

bond_down STATE_CHANGE ERROR no Message: All secondaries of the network bond {0} are down.

Description: All secondaries of a network bond are down.

Cause: All secondaries of this network bond went down.

User Action: Check the bonding configuration, the network configuration,


and cabling of all secondaries of the network bond.

bond_nic_recognized STATE_CHANGE INFO no Message: Bond NIC {id} was recognized. Children {0}.

Description: The specified network bond NIC was correctly recognized for
usage by IBM Storage Scale.

Cause: The specified network bond NIC is reported in the mmfsadm dump
verbs command.

User Action: N/A

bond_up STATE_CHANGE INFO no Message: All secondaries of the network bond {0} are working as expected.

Description: This network bond is functioning properly.

Cause: All secondaries of this network bond are functioning properly.

User Action: N/A

652 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

expected_file_missing INFO WARNING no Message: The expected configuration or program file {0} was not found.

Description: An expected configuration or program file was not found.

Cause: An expected configuration or program file was not found.

User Action: Check for the existence of the file. If necessary, then install
required packages.

ib_rdma_disabled STATE_CHANGE INFO no Message: InfiniBand in RDMA mode is disabled.

Description: InfiniBand in RDMA mode is not enabled for IBM Storage


Scale.

Cause: The user did not enable verbsRdma by using the mmchconfig
command.

User Action: N/A

ib_rdma_enabled STATE_CHANGE INFO no Message: InfiniBand in RDMA mode is enabled.

Description: Infiniband in RDMA mode is enabled for IBM Storage Scale.

Cause: The user enabled verbsRdma by using the mmchconfig command.

User Action: N/A

ib_rdma_ext_port_speed_low TIP TIP no Message: InfiniBand RDMA NIC {id} uses a smaller extended port speed
than supported.

Description: The currently active extended link speed is less than the
supported value.

Cause: The currently active extended link speed is less than the supported
value.

User Action: Check the settings of the specified InfiniBand RDMA NIC
(ibportstate).

ib_rdma_ext_port_speed_ok TIP INFO no Message: InfiniBand RDMA NIC {id} uses maximum supported port speed.

Description: The currently enabled extended link speed is equal to the


supported speed.

Cause: N/A

User Action: N/A

ib_rdma_libs_found STATE_CHANGE INFO no Message: All checked library files can be found.

Description: All checked library files (librdmacm and libibverbs) can be


found with expected path names.

Cause: The library files are in the expected directories with expected
names.

User Action: N/A

ib_rdma_libs_wrong_path STATE_CHANGE ERROR no Message: The library files cannot be found.

Description: At least one of the library files (librdmacm and libibverbs)


cannot be found with an expected pathname.

Cause: Either the libraries are missing or their path names are wrongly set.

User Action: Check whether the libraries, 'librdmacm', and 'libibverbs' are
installed. Also, check whether they can be found by the names that are
referenced in the mmfsadm test verbs config command.

ib_rdma_link_down STATE_CHANGE ERROR no Message: InfiniBand RDMA NIC {id} is down.

Description: The physical link of the specified InfiniBand RDMA NIC is


down.

Cause: Physical state of the specified InfiniBand RDMA NIC is not 'LinkUp'
according to ibstat.

User Action: Check the cabling of the specified InfiniBand RDMA NIC.

Chapter 42. References 653


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

ib_rdma_link_up STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} is up.

Description: The physical link of the specified InfiniBand RDMA NIC is up.

Cause: Physical state of the specified InfiniBand RDMA NIC is 'LinkUp'


according to ibstat.

User Action: N/A

ib_rdma_nic_down STATE_CHANGE ERROR no Message: NIC {id} is down according to ibstat.

Description: The specified InfiniBand RDMA NIC is down.

Cause: The specified InfiniBand RDMA NIC is down according to ibstat.

User Action: Enable the specified InfiniBand RDMA NIC.

ib_rdma_nic_found INFO_ADD_ENTITY INFO no Message: InfiniBand RDMA NIC {id} was found.

Description: A new InfiniBand RDMA NIC was found.

Cause: A new relevant InfiniBand RDMA NIC is listed by ibstat.

User Action: N/A

ib_rdma_nic_recognized STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} was recognized.

Description: The specified InfiniBand RDMA NIC was correctly recognized


for usage by IBM Storage Scale.

Cause: The specified InfiniBand RDMA NIC is reported in the mmfsadm


dump verbs command.

User Action: N/A

ib_rdma_nic_unrecognized STATE_CHANGE ERROR no Message: InfiniBand RDMA NIC {id} was not recognized.

Description: The specified InfiniBand RDMA NIC was not correctly


recognized for usage by IBM Storage Scale.

Cause: The specified InfiniBand RDMA NIC is not reported in the mmfsadm
dump verbs command.

User Action: Check '/var/adm/ras/mmfs.log.latest' for VERBS RDMA error


messages.

ib_rdma_nic_up STATE_CHANGE INFO no Message: NIC {id} is up according to ibstat.

Description: The specified InfiniBand RDMA NIC is up.

Cause: The specified InfiniBand RDMA NIC is up according to ibstat.

User Action: N/A

ib_rdma_nic_vanished INFO_DELETE_ENTITY INFO no Message: InfiniBand RDMA NIC {id} vanished.

Description: The specified InfiniBand RDMA NIC cannot be detected


anymore.

Cause: One of the previously monitored InfiniBand RDMA NICs is not listed
by ibstat anymore.

User Action: N/A

ib_rdma_port_speed_low STATE_CHANGE WARNING no Message: InfiniBand RDMA NIC {id} uses a smaller port speed than
enabled.

Description: The currently active link speed is lesser than the enabled
maximum link speed.

Cause: The currently active link speed is lesser than the enabled maximum
link speed.

User Action: Check the settings of the specified IB RDMA NIC (ibportstate).

654 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

ib_rdma_port_speed_ok STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} uses maximum enabled port speed.

Description: The currently active link speed equal to the enabled


maximum link speed.

Cause: The currently active link speed is equal to the enabled maximum
link speed.

User Action: N/A

ib_rdma_port_speed_optimal TIP INFO no Message: InfiniBand RDMA NIC {id} uses maximum supported port speed.

Description: The currently enabled link speed is equal to the supported


maximum link speed.

Cause: The currently enabled link speed is equal to the supported


maximum link speed.

User Action: N/A

ib_rdma_port_speed_subopti TIP TIP no Message: InfiniBand RDMA NIC {id} uses a smaller port speed than
mal supported.

Description: The currently enabled link speed is lesser than the supported
maximum link speed.

Cause: The currently enabled link speed is lesser than the supported
maximum link speed.

User Action: Check the settings of the specified InfiniBand RDMA NIC
(ibportstate).

ib_rdma_port_width_low STATE_CHANGE WARNING no Message: InfiniBand RDMA NIC {id} uses a smaller port width than
enabled.

Description: The currently active link width is lesser than the enabled
maximum link width.

Cause: The currently active link width is lesser than the enabled maximum
link width.

User Action: Check the settings of the specified InfiniBand RDMA NIC
(ibportstate).

ib_rdma_port_width_ok STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} uses maximum enabled port width.

Description: The currently active link width equal to the enabled maximum
link width.

Cause: The currently active link width is equal to the enabled maximum
link width.

User Action: N/A

ib_rdma_port_width_optimal TIP INFO no Message: InfiniBand RDMA NIC {id} uses maximum supported port width.

Description: The currently enabled link width is equal to the supported


maximum link width.

Cause: The currently enabled link width is equal to the supported


maximum link width.

User Action: N/A

ib_rdma_port_width_subopti TIP TIP no Message: InfiniBand RDMA NIC {id} uses a smaller port width than
mal supported.

Description: The currently enabled link width is lesser than the supported
maximum link width.

Cause: The currently enabled link width is lesser than the supported
maximum link width.

User Action: Check the settings of the specified IB RDMA NIC (ibportstate).

ib_rdma_ports_ok STATE_CHANGE INFO no Message: verbsPorts is correctly set for InfiniBand RDMA.

Description: The verbsPorts setting has a correct value.

Cause: The user correctly configured verbsPorts.

User Action: N/A

Chapter 42. References 655


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

ib_rdma_ports_undefined STATE_CHANGE ERROR no Message: No NICs and ports are set up for InfiniBand RDMA.

Description: No NICs and ports are set up for InfiniBand RDMA.

Cause: The user did not configure verbsPorts by using the mmchconfig
command.

User Action: Set up the NICs and ports to use with the verbsPorts setting
in the mmchconfig command.

ib_rdma_ports_wrong STATE_CHANGE ERROR no Message: verbsPorts is incorrectly set for InfiniBand RDMA.

Description: verbsPorts setting has wrong contents.

Cause: The user incorrectly configured verbsPorts by using the


mmchconfig command.

User Action: Check the format of the verbsPorts setting in the


mmlsconfig command.

ib_rdma_verbs_failed STATE_CHANGE ERROR no Message: VERBS RDMA was not started.

Description: IBM Storage Scale cannot start VERBS RDMA.

Cause: The InfiniBand RDMA-related libraries are improperly installed or


configured.

User Action: Check '/var/adm/ras/mmfs.log.latest' for the root cause hints.


Also, check whether all relevant InfiniBand libraries are installed and
correctly configured.

ib_rdma_verbs_started STATE_CHANGE INFO no Message: VERBS RDMA was started.

Description: IBM Storage Scale started VERBS RDMA.

Cause: The InfiniBand RDMA-related libraries, which IBM Storage Scale


uses, are working properly.

User Action: N/A

many_tx_errors STATE_CHANGE ERROR FTDC Message: NIC {0} had many TX errors since the last monitoring cycle.
uploa
d Description: The network adapter had many TX errors since the last
monitoring cycle.

Cause: The '/proc/net/dev' folder lists much more TX errors for this
adapter since the last monitoring cycle.

User Action: Check the network cabling and network infrastructure.

network_connectivity_down STATE_CHANGE ERROR no Message: NIC {0} cannot connect to the gateway.

Description: This network adapter cannot connect to the gateway.

Cause: The gateway does not respond to the sent connections-checking


packets.

User Action: Check the network configuration of the network adapter, path
to the gateway, and gateway itself.

network_connectivity_up STATE_CHANGE INFO no Message: NIC {0} can connect to the gateway.

Description: This network adapter can connect to the gateway.

Cause: The gateway responds to the sent connections-checking packets.

User Action: N/A

network_down STATE_CHANGE ERROR no Message: NIC {0} is down.

Description: This network adapter is down.

Cause: This network adapter is disabled.

User Action: Enable this network adapter.

656 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

network_found INFO_ADD_ENTITY INFO no Message: NIC {0} was found.

Description: A new network adapter was found.

Cause: A new NIC, which is relevant for the IBM Storage Scale monitoring,
is listed by ip a.

User Action: N/A

network_ips_down STATE_CHANGE ERROR no Message: No relevant NICs detected.

Description: No relevant network adapters detected.

Cause: No network adapters have the IBM Storage Scale-relevant IPs.

User Action: Find out, why the IBM Storage Scale-relevant IPs were not
assigned to any NICs.

network_ips_partially_down STATE_CHANGE ERROR no Message: Some relevant IPs are not served by found NICs: {0}.

Description: Some relevant IPs are not served by network adapters.

Cause: At least one IBM Storage Scale-relevant IP is not assigned to a


network adapter.

User Action: Find out why the specified IBM Storage Scale-relevant IPs
were not assigned to any NICs.

network_ips_up STATE_CHANGE INFO no Message: Relevant IPs are served by found NICs.

Description: Relevant IPs are served by network adapters.

Cause: At least one IBM Storage Scale-relevant IP is assigned to a network


adapter.

User Action: N/A

network_link_down STATE_CHANGE ERROR no Message: Physical link of the NIC {0} is down.

Description: The physical link of this adapter is down.

Cause: The 'LOWER_UP' flag is not set for this NIC in the output of ip a.

User Action: Check the network cabling and network infrastructure.

network_link_up STATE_CHANGE INFO no Message: Physical link of the NIC {0} is up.

Description: The physical link of this adapter is up.

Cause: The 'LOWER_UP' flag is set for this NIC in the output of ip a.

User Action: N/A

network_up STATE_CHANGE INFO no Message: NIC {0} is up.

Description: This network adapter is up.

Cause: This network adapter is enabled.

User Action: N/A

network_vanished INFO_DELETE_ENTITY INFO no Message: NIC {0} vanished.

Description: One of network adapters cannot be detected anymore.

Cause: One of the previously monitored NICs is not listed by ip a


anymore.

User Action: N/A

nic_firmware_not_available STATE_CHANGE WARNING no Message: The expected firmware level of adapter {id} is not available.

Description: The expected firmware level is not available.

Cause: /usr/lpp/mmfs/bin/tslshcafirmware -Y does not return any


expected firmware level for this adapter.

User Action: /usr/lpp/mmfs/bin/tslshcafirmware -Y does not


return any firmware level for the expectedFirmware field and check
whether it is working as expecting. This command uses /usr/lpp/
mmfs/updates/latest/firmware/hca/FirmwareInfo.hca, which
is provided with the ECE packages and check whether the file is available
and accessible.

Chapter 42. References 657


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

nic_firmware_ok STATE_CHANGE INFO no Message: The adapter {id} has the expected firmware level {0}.

Description: The adapter firmware level is as expected.

Cause: /usr/lpp/mmfs/bin/tslshcafirmware -Y returned the same


firmwareLevel and expectedFirmwareLevel for this adapter.

User Action: N/A

nic_firmware_unexpected STATE_CHANGE WARNING no Message: The adapter {id} has firmware level {0} and not the expected
firmware level {1}.

Description: The adapter firmware level is not as expected.

Cause: /usr/lpp/mmfs/bin/tslshcafirmware -Y returned not the


same firmwareLevel and expectedFirmwareLevel for this adapter.

User Action: N/A

no_tx_errors STATE_CHANGE INFO no Message: NIC {0} had no or a tiny number of TX errors.

Description: The NIC had no or an insignificant number of TX errors.

Cause: The '/proc/net/dev' folder lists no or an insignificant number of TX


errors for this adapter since the last monitoring cycle.

User Action: N/A

rdma_roce_cma_tos TIP TIP no Message: NIC {id} The CMA type of service class is not set to the
recommended value.

Description: The CMA type of service class is not set to the recommended
value.

Cause: The CMA type of service class is not set to the recommended value.

User Action: Check the settings of the specified InfiniBand RDMA NIC
by using the cma_roce_tos command and the system health monitor
configuration file by using the mmsysmonitor.conf file.

rdma_roce_cma_tos_ok STATE_CHANGE INFO no Message: NIC {id} The CMA type of service class is set to the
recommended value.

Description: The CMA type of service class is set to the recommended


value.

Cause: The CMA type of service class is set to the recommended value.

User Action: N/A

rdma_roce_mtu_low TIP TIP no Message: NIC {id} The actual MTU size is less than the maximum MTU size.

Description: The actual MTU size is less than the maximum MTU size.

Cause: The actual MTU size is less than the maximum MTU size.

User Action: Check the MTU settings of the specified NIC by using the
'ibv_devinfo' command.

rdma_roce_mtu_ok STATE_CHANGE INFO no Message: NIC {id} The actual MTU size is OK.

Description: The actual MTU size is set to the maximum MTU size.

Cause: The actual MTU size is set to the maximum MTU size.

User Action: N/A

rdma_roce_pfc_prio_buffer_b STATE_CHANGE WARNING no Message: NIC {id} The PFC buffer priority class is not set to the
ad recommended value.

Description: The PFC buffer priority class is not set to the recommended
value, which might lead to a significant decrease in performance.

Cause: The PFC buffer priority class is not set to the recommended value.

User Action: Check the settings of the specified InfiniBand RDMA NIC by
using the mlnx_qos command and the system health monitor configuration
file by using the mmsysmonitor.conf file.

658 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 92. Events for the network component (continued)

Event Event Severity Call Details


Type Home

rdma_roce_pfc_prio_buffer_o STATE_CHANGE INFO no Message: NIC {id} The PFC buffer priority class is set to the recommended
k value.

Description: The PFC buffer priority class is set to the recommended value.

Cause: The PFC buffer priority class is set to the recommended value.

User Action: N/A

rdma_roce_pfc_prio_enabled STATE_CHANGE WARNING no Message: NIC {id} The enabled PFC priority class is not set to the
_bad recommended value.

Description: The enabled PFC priority class is not set to the recommended
value, which might lead to a significant decrease in performance.

Cause: The enabled PFC priority class is not set to the recommended
value.

User Action: Check the settings of the specified NIC (mlnx_qos) and the
system health monitor configuration file (mmsysmonitor.conf).

rdma_roce_pfc_prio_enabled STATE_CHANGE INFO no Message: NIC {id} The enabled PFC priority class is set to the
_ok recommended value.

Description: The enabled PFC priority class is set to the recommended


value.

Cause: The enabled PFC priority class is set to the recommended value.

User Action: N/A

rdma_roce_qos_prio_trust STATE_CHANGE WARNING no Message: NIC {id} The RoCE QoS value for trust is not set to 'dscp'.

Description: The RoCE QoS setting for trust is not set to 'dscp', which might
lead to a significant decrease in performance.

Cause: The RoCE QoS setting for trust is not set to 'dscp'.

User Action: Check the settings of the specified RoCE NIC by using the
'mlnx_qos' command.

rdma_roce_qos_prio_trust_ds STATE_CHANGE INFO no Message: NIC {id} The RoCE QoS setting for trust is set to 'dscp'.
cp
Description: The RoCE QoS setting for trust is set to 'dscp'.

Cause: The RoCE QoS setting for trust is set to 'dscp'.

User Action: N/A

rdma_roce_tclass TIP TIP no Message: NIC {id} The traffic class is not set to the recommended value.

Description: The traffic class is not set to the recommended value.

Cause: The traffic class is not set to the recommended value.

User Action: Check the settings of the specified InfiniBand RDMA


NIC by using the /sys/class/infiniband/<interface>/tc/1/
traffic_class and the system health monitor configuration file by using
the mmsysmonitor.conf file.

rdma_roce_tclass_ok STATE_CHANGE INFO no Message: NIC {id} The traffic class is set to the recommended value.

Description: The traffic class is set to the recommended value.

Cause: The traffic class is set to the recommended value.

User Action: N/A

Chapter 42. References 659


NFS events
The following table lists the events that are created for the NFS component.
Table 93. Events for the NFS component

Event Event Severity Call Details


Type Home

dbus_error STATE_CHANGE WARNING no Message: The DBus availability check failed.

Description: Failed to query DBus when the NFS service is registered.

Cause: The DBus was detected as down, which might cause several issues
on the local node.

User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.

dbus_error_pod STATE_CHANGE WARNING no Message: {id}: The DBus availability check failed.

Description: Failed to query DBus when the NFS service is registered.

Cause: The DBus was detected as down, which might cause several issues
on the local node.

User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.

dir_statd_perm_ok STATE_CHANGE INFO no Message: The permissions of the local NFS statd directory are correct.

Description: The permissions of the local NFS statd directory are correct.

Cause: The permissions of the local NFS statd directory are correct.

User Action: N/A

dir_statd_perm_problem STATE_CHANGE WARNING no Message: The permissions of the local NFS statd directory might be
incorrect for operation. {0}={1} (reference={2}).

Description: The permissions of the local NFS statd directory might be


incorrect for operation.

Cause: The permissions of the local NFS statd directory might be incorrect
for operation.

User Action: Check permissions of the local NFS statd directory.

disable_nfs_service INFO_EXTERNAL INFO no Message: The CES NFS service is disabled.

Description: The NFS service is disabled on this node. Disabling a service


also removes all configuration files, which are different from stopping a
service.

Cause: The user disabled the NFS service by using the mmces service
disable nfs command.

User Action: N/A

enable_nfs_service INFO_EXTERNAL INFO no Message: The CES NFS service was enabled.

Description: The NFS service is enabled on this node. Enabling a protocol


service also automatically installs the required configuration files with the
current valid configuration settings.

Cause: The user enabled the NFS service by using the mmces service
enable nfs command.

User Action: N/A

ganeshaexit INFO_EXTERNAL INFO no Message: The CES NFS service is stopped.

Description: An NFS server instance was terminated.

Cause: An NFS instance was terminated or killed.

User Action: Restart the NFS service when the root cause for this issue is
solved.

660 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

ganeshagrace INFO_EXTERNAL INFO no Message: The CES NFS service is set to a grace mode.

Description: The NFS server is set to a grace mode for a limited time,
which gives the time to previously connected clients to recover their file
locks.

Cause: The grace period is always cluster-wide. The NFS export


configurations might change, and one or more NFS servers get restarted.

User Action: N/A

knfs_available_warn STATE_CHANGE WARNING no Message: The kernel NFS service state is not masked or disabled.

Description: The kernel NFS service is available, but there is a risk that it
can be started, and can cause issues.

Cause: The kernel NFS service might be available.

User Action: Check the NFS setup. The nfs.service should be deactivated
or masked to avoid conflicts with the IBM Storage Scale NFS server.

knfs_disabled_ok STATE_CHANGE INFO no Message: The kernel NFS service is disabled, but it should be masked.

Description: The kernel NFS service is disabled, but there is a risk that it
can be started.

Cause: The kernel NFS service is disabled.

User Action: Check the NFS setup. The nfs.service should be masked to
avoid conflicts with the IBM Storage Scale NFS server.

knfs_masked_ok STATE_CHANGE INFO no Message: The kernel NFS service is masked.

Description: The kernel NFS service is masked to avoid the service from
being accidentally started.

Cause: The kernel NFS service is masked.

User Action: N/A

knfs_running_warn STATE_CHANGE WARNING no Message: The kernel NFS service state is active.

Description: The kernel NFS service is active and can cause conflicts with
the IBM Storage Scale NFS server.

Cause: The kernel NFS service is active.

User Action: Check the NFS setup. The nfs.service is either deactivated or
masked to avoid conflicts with the IBM Storage Scale NFS server.

mountd_rpcinfo_ok STATE_CHANGE INFO no Message: The NFS mountd service is listed by rpcinfo.

Description: A required service is listed.

Cause: The mountd service is listed by rpcinfo.

User Action: N/A

mountd_rpcinfo_unknown STATE_CHANGE WARNING no Message: The NFS mount service is not listed by rpcinfo.

Description: A required service is not listed.

Cause: The mountd service is not listed by rpcinfo, but expected to run.

User Action: Check whether the issue is caused by a rpcbind


crash or kernel NFS usage. Run the systemctl restart nfs-
ganesha.service command.

Chapter 42. References 661


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

nfs3_down INFO WARNING no Message: The NFS v3 NULL check failed.

Description: The NFS v3 NULL check failed when it was expected to be


functioning. This check verifies whether the NFS server reacts to NFS v3
requests. The NFS v3 protocol must be enabled for this check. If this down
state is detected, then further checks are done to figure out whether the
NFS server is still working. If the NFS server seems not to be working, then
a failover is triggered. If NFS v3 and NFS v4 protocols are configured, then
only the v3 NULL test is performed.

Cause: The NFS server is either under high load or hung up, which restricts
the processing of the request.

User Action: Check the health state of the NFS server and restart, if
necessary.

nfs3_up INFO INFO no Message: NFS v3 NULL check is successful.

Description: The NFS v3 NULL check is successful.

Cause: N/A

User Action: N/A

nfs4_down INFO WARNING no Message: NFS v4 NULL check failed.

Description: The NFS v4 NULL check failed. This check verifies whether
the NFS server reacts to NFS v4 requests. The NFS v4 protocol must be
enabled for this check. If this down state is detected, then further checks
are done to figure out whether the NFS server is still working. If the NFS
server seems not to be working, then a failover is triggered.

Cause: The NFS server is either under high load or hung up, which restricts
the processing of the request.

User Action: Check the health state of the NFS server and restart, if
necessary.

nfs4_up INFO INFO no Message: The NFS v4 NULL check is successful.

Description: The NFS v4 NULL check is successful.

Cause: N/A

User Action: N/A

nfs_active STATE_CHANGE INFO no Message: The NFS service is now active.

Description: The NFS service must be up and running, and in a healthy


state to provide the configured file exports.

Cause: The NFS server is detected as alive.

User Action: N/A

nfs_active_pod STATE_CHANGE INFO no Message: {id}: The NFS service is now active.

Description: The NFS service must be up and running, and in a healthy


state to provide the configured file exports.

Cause: The NFS server is detected as alive.

User Action: N/A

nfs_dbus_error STATE_CHANGE WARNING no Message: NFS check by using the DBus failed.

Description: The NFS service must be registered on DBus to be fully


working, which is currently not the case.

Cause: The NFS service is registered on DBus, but there was a problem
while accessing it.

User Action: Check the health state of the NFS service and restart the NFS
service. Check the log files for reported issues.

662 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

nfs_dbus_error_pod STATE_CHANGE WARNING no Message: {id}: NFS check by using the DBus failed.

Description: The NFS service must be registered on DBus to be fully


working, which is currently not the case.

Cause: The NFS service is registered on DBus, but there was a problem
while accessing it.

User Action: Check the health state of the NFS service and restart the NFS
service. Check the log files for reported issues.

nfs_dbus_failed STATE_CHANGE WARNING no Message: NFS check by using the DBus did not return the expected
message.

Description: The NFS service configuration settings or log configuration


settings are queried by using the DBus. The result is checked for expected
keywords.

Cause: The NFS service is registered on DBus, but the check by using the
DBus did not return the expected result.

User Action: Stop the NFS service and start it again. Check the log
configuration of the NFS service.

nfs_dbus_failed_pod STATE_CHANGE WARNING no Message: {id}: NFS check by using the DBus did not return the expected
message.

Description: The NFS service configuration settings or log configuration


settings are queried by using the DBus. The result is checked for expected
keywords.

Cause: The NFS service is registered on DBus, but the check by using the
DBus did not return the expected result.

User Action: Stop the NFS service and start it again. Check the log
configuration of the NFS service.

nfs_dbus_ok STATE_CHANGE INFO no Message: The NFS check by using the DBus is successful.

Description: The check to verify whether the NFS service is registered on


DBus and working, is successful.

Cause: The NFS service is registered on the DBus and working.

User Action: N/A

nfs_dbus_ok_pod STATE_CHANGE INFO no Message: {id}: The NFS check by using the DBus is successful.

Description: The check to verify whether the NFS service is registered on


DBus and working, is successful.

Cause: The NFS service is registered on the DBus and working.

User Action: N/A

nfs_exported_fs_chk STATE_CHANGE_EXTE INFO no Message: The Cluster State Manager (CSM) cleared the
RNAL 'nfs_exported_fs_down' event.

Description: Declared NFS exported file systems are either available again
on this node, or not available on any node.

Cause: The CSM cleared an 'nfs_exported_fs_down' event on this node to


display the self-detected state of this node.

User Action: N/A

nfs_exported_fs_down STATE_CHANGE_EXTE ERROR no Message: One or more declared NFS exported file systems are not
RNAL available on this node.

Description: One or more declared NFS exported file systems are not
available on this node. Other nodes might have those file systems available.

Cause: One or more declared NFS exported file systems are not available
on this node.

User Action: Check NFS export-related local and remote file system states.

Chapter 42. References 663


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

nfs_exports_clear_state STATE_CHANGE INFO no Message: Clear local NFS export down state temporarily.

Description: Clear local NFS export down state temporarily, because an 'all
nodes have the same problem' message is received.

Cause: Clear local NFS export down state temporarily.

User Action: N/A

nfs_exports_down STATE_CHANGE WARNING no Message: One or more declared file systems for NFS exports are not
available.

Description: One or more declared file systems for NFS exports are
unavailable.

Cause: One or more declared file systems for NFS exports are unavailable.

User Action: Check local and remote file system states.

nfs_exports_up STATE_CHANGE INFO no Message: All declared file systems for NFS exports are available.

Description: All declared file systems for NFS exports are available.

Cause: All declared file systems for NFS exports are available.

User Action: N/A

nfs_in_grace STATE_CHANGE WARNING no Message: NFS is in a grace mode.

Description: The monitor detected that CES NFS is in a grace mode. During
this time, the NFS state is shown as degraded.

Cause: The NFS service was started or restarted.

User Action: N/A

nfs_in_grace_pod STATE_CHANGE WARNING no Message: {id}: NFS is in a grace mode.

Description: The monitor detected that CES NFS is in a grace mode. During
this time, the NFS state is shown as degraded.

Cause: The NFS service was started or restarted.

User Action: N/A

nfs_not_active STATE_CHANGE ERROR FTDC Message: NFS service is inactive.


uploa
d Description: A check showed that the CES NFS service, which is supposed
to be running, is inactive.

Cause: Process might be hung.

User Action: Restart the CES NFS service.

nfs_not_active_pod STATE_CHANGE ERROR FTDC Message: {id}: NFS service is inactive.


uploa
d Description: A check showed that the CES NFS service, which is supposed
to be running, is inactive.

Cause: Process might be hung.

User Action: Restart the NFS service.

nfs_not_dbus STATE_CHANGE WARNING no Message: NFS service is unavailable as the DBus service.

Description: The NFS service is currently not registered on DBus. In this


mode, the NFS service is not fully working. Exports cannot be added or
removed, and not set in the grace mode, which is important for data
consistency.

Cause: The NFS service might be started while the DBus was down.

User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.

664 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

nfs_not_dbus_pod STATE_CHANGE WARNING no Message: {id}: NFS service is unavailable as the DBus service.

Description: The NFS service is currently not registered on DBus. In this


mode, the NFS service is not fully working. Exports cannot be added or
removed, and not set in the grace mode, which is important for data
consistency.

Cause: The NFS service might be started while the DBus was down.

User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.

nfs_openConnection INFO WARNING no Message: NFS has invalid open connection to CES IP {0}.

Description: The NFS server did not close a connection to a released


socket.

Cause: The NFS server has an open connection to a not existing CES IP.

User Action: Check for unexpected misbehavior of clients and reconnect


them.

nfs_rpcinfo_ok STATE_CHANGE INFO no Message: The NFS program is listed by rpcinfo.

Description: The required program is listed.

Cause: The NFS program is listed by rpcinfo.

User Action: N/A

nfs_rpcinfo_unknown STATE_CHANGE WARNING no Message: The NFS program is not listed by rpcinfo.

Description: The required program is not listed.

Cause: The NFS program is not listed by rpcinfo, but expected to run.

User Action: Check whether the issue is caused by a rpcbind crash.

nfs_sensors_active TIP INFO no Message: The NFS perfmon sensor {0} is active.

Description: The NFS perfmon sensors are active. This event's monitor is
running only once an hour.

Cause: The NFS perfmon sensors' period attribute is greater than 0.

User Action: N/A

nfs_sensors_inactive TIP TIP no Message: The NFS perfmon sensor {0} is inactive.

Description: The NFS perfmon sensors are inactive. This event's monitor is
running only once an hour.

Cause: The NFS perfmon sensors' period attribute is 0.

User Action: Set the period attribute of the NFS sensors to a value greater
than 0. For more information, use the mmperfmon config update
SensorName.period=N command where 'SensorName' is the name of
a specific NFS sensor and 'N' is a natural number greater than 0. Consider
that this TIP monitor is running only once per hour and it might take up to
one hour to detect the changes in the configuration.

nfs_sensors_not_configured TIP TIP no Message: The NFS perfmon sensor {0} is not configured.

Description: The NFS perfmon sensor does not exist in the mmperfmon
config show command.

Cause: The NFS perfmon sensor is not configured in the sensors'


configuration file.

User Action: Include the sensors into the perfmon configuration by


using the mmperfmon config add --sensors /opt/IBM/zimon/
defaults/ZIMonSensors_nfs.cfg command. An example for the
configuration file can be found in the mmperfmon command page in the
Command Reference Guide.

Chapter 42. References 665


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

nfs_unresponsive STATE_CHANGE WARNING no Message: The NFS service is unresponsive.

Description: A check showed that the CES NFS service, which is supposed
to be running, is unresponsive.

Cause: Process might be hung or busy.

User Action: Restart the CES NFS service when this state persists.

nfs_unresponsive_pod STATE_CHANGE WARNING no Message: {id}: The NFS service is unresponsive.

Description: A check showed that the CES NFS service, which is supposed
to be running, is unresponsive.

Cause: Process might be hung or busy.

User Action: Restart the CES NFS service when this state persists.

nfsd_down STATE_CHANGE ERROR FTDC Message: NFSD process is not running.


uploa
d Description: Checks for an NFS service process.

Cause: The NFS server process was not detected.

User Action: Check the health state of the NFS server and restart, if
necessary. The process might get hung or be in a dysfunctional state.
Ensure that the kernel NFS server is not running.

nfsd_no_restart INFO WARNING no Message: NFSD process cannot be restarted. Reason: {0}.

Description: An expected NFS service process was not running, but cannot
be restarted.

Cause: The NFS server process was not detected and cannot be restarted.

User Action: Check the health state of the NFS server and restart, if
necessary. Check the issues, which lead to the unexpected failure. Ensure
that the kernel NFS server is not running.

nfsd_restart INFO WARNING FTDC Message: NFSD process restarted.


uploa
d Description: An expected NFS service process was not running and then
restarted.

Cause: The NFS server process was not detected and restarted.

User Action: Check the health state of the NFS server and restart, if
necessary. Check the issues, which lead to the unexpected failure. Ensure
that the kernel NFS server is not running.

nfsd_up STATE_CHANGE INFO no Message: NFSD process is running.

Description: The NFS server process is detected.

Cause: N/A

User Action: N/A

nfsd_warn INFO WARNING no Message: NFSD process monitoring returned an unknown result.

Description: The NFS server process monitoring returned an unknown


result.

Cause: The NFS server process state cannot be determined due to a


problem.

User Action: Check the health state of the NFS server and restart, if
necessary.

nfsserver_found_pod INFO_ADD_ENTITY INFO no Message: The NFS server {id} was found.

Description: A NFS server was detected.

Cause: A NFS Server was detected.

User Action: N/A

666 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

nfsserver_vanished_pod INFO_DELETE_ENTITY INFO no Message: The NFS Server {id} has vanished.

Description: A declared NFS server was not detected.

Cause: A NFS server is not in use for an IBM Storage Scale file system,
which can be a valid situation.

User Action: N/A

nlockmgr_rpcinfo_ok STATE_CHANGE INFO no Message: The NFS nlockmgr service is listed by rpcinfo.

Description: A required service is listed.

Cause: The nlockmgr service is listed by rpcinfo.

User Action: N/A

nlockmgr_rpcinfo_unknown STATE_CHANGE WARNING no Message: The NFS nlockmgr service is not listed by rpcinfo.

Description: A required service is not listed.

Cause: The nlockmgr service is not listed by rpcinfo, but expected to run.

User Action: Check whether the issue is caused by a rpcbind


crash or kernel NFS usage. Run the systemctl restart nfs-
ganesha.service command.

portmapper_down STATE_CHANGE ERROR no Message: Portmapper port 111 is inactive.

Description: The portmapper is needed to provide the NFS services to


clients.

Cause: The portmapper is not running on port 111.

User Action: Check whether the portmapper service is running, and if any
services are conflicting with the portmapper service on this system.

portmapper_down_pod STATE_CHANGE ERROR no Message: {id}: Portmapper port 111 is inactive.

Description: The portmapper is needed to provide the NFS services to


clients.

Cause: The portmapper is not running on port 111.

User Action: Check whether the portmapper service is running, and if any
services are conflicting with the portmapper service on this system.

portmapper_up STATE_CHANGE INFO no Message: Portmapper port is now active.

Description: The portmapper is running on port 111.

Cause: N/A

User Action: N/A

portmapper_up_pod STATE_CHANGE INFO no Message: {id}: Portmapper port is now active.

Description: The portmapper is running on port 111.

Cause: N/A

User Action: N/A

portmapper_warn INFO WARNING no Message: Portmapper port monitoring (111) returned an unknown result.

Description: The portmapper process monitoring returned an unknown


result.

Cause: The portmapper status cannot be determined due to a problem.

User Action: Restart the portmapper, if necessary.

portmapper_warn_pod INFO WARNING no Message: {id}: Portmapper port monitoring (111) returned an unknown
result.

Description: The portmapper process monitoring returned an unknown


result.

Cause: The portmapper status cannot be determined due to a problem.

User Action: Restart the portmapper, if necessary.

Chapter 42. References 667


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

postIpChange_info_n INFO_EXTERNAL INFO no Message: IP addresses are modified.

Description: Notification that IP addresses are moved around the cluster


nodes.

Cause: CES IP addresses are moved or added to the node, and activated.

User Action: N/A

rpc_mountd_inv_user STATE_CHANGE WARNING no Message: The mount port {0} does not belong to IBM Storage Scale NFS.

Description: The required daemon process for NFS has an unexpected


owner.

Cause: The mount daemon process does not belong to NFS Ganesha.

User Action: Check the NFS setup and ensure that the kernel NFS is not
activated.

rpc_mountd_ok STATE_CHANGE INFO no Message: The mountd service has the expected user.

Description: A required daemon process for NFS is functional.

Cause: The mount daemon process belongs to NFS Ganesha.

User Action: N/A

rpc_nfs_inv_user STATE_CHANGE WARNING no Message: The NFS port {0} does not belong to IBM Storage Scale NFS.

Description: The required program for NFS has an unexpected owner.

Cause: The NFS program does not belong to NFS Ganesha.

User Action: Check the NFS setup and ensure that the kernel NFS is not
activated.

rpc_nfs_ok STATE_CHANGE INFO no Message: The NFS service has the expected user.

Description: A required program for NFS is functional.

Cause: The NFS program process belongs to NFS Ganesha.

User Action: N/A

rpc_nlockmgr_inv_user STATE_CHANGE WARNING no Message: The nlockmgr port {0} does not belong to IBM Storage Scale NFS.

Description: The required daemon process for NFS has an unexpected


owner.

Cause: The lock manager process does not belong to NFS Ganesha.

User Action: Check the NFS setup and ensure that the kernel NFS is not
activated.

rpc_nlockmgr_ok STATE_CHANGE INFO no Message: The nlockmgr service has the expected user.

Description: A required daemon process for NFS is functional.

Cause: The lock manager process belongs to NFS Ganesha.

User Action: N/A

rpc_rpcinfo_warn INFO WARNING no Message: The rpcinfo check returned an unknown result.

Description: The rpcinfo NFS program check returned an unknown result.

Cause: The rpcinfo status cannot be determined due to a problem.

User Action: Restart the portmapper, if necessary. Check NFS


configuration.

rpcbind_down STATE_CHANGE WARNING no Message: The rpcbind process is not running.

Description: The rpcbind process is used by NFS v3 and v4.

Cause: The rpcbind process is not running.

User Action: Start rpcbind manually by using the systemctl restart


rpcbind command in case it does not auto-starts. A restart of NFS might
be needed to register its services to rpcbind.

668 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

rpcbind_down_pod STATE_CHANGE WARNING no Message: {id}: The rpcbind process is not running.

Description: The rpcbind process is used by NFS v3 and v4.

Cause: The rpcbind process is not running.

User Action: Start rpcbind manually by using the systemctl restart


rpcbind command in case it does not auto-starts. A restart of NFS might
be needed to register its services to rpcbind.

rpcbind_restarted INFO INFO no Message: The rpcbind process is restarted.

Description: The rpcbind process is restarted.

Cause: The rpcbind process is restarted by the NFS monitor.

User Action: Check whether NFS processing works as expected.


Otherwise, restart NFS to get registered to rpcbind.

rpcbind_unresponsive STATE_CHANGE ERROR no Message: The rpcbind process is unresponsive. Attempt to restart.

Description: The rpcbind process does not work. A restart can help.

Cause: The rpcbind process is unresponsive.

User Action: Start rpcbind manually by using the systemctl restart


rpcbind command in case it does not auto-starts.

rpcbind_unresponsive_pod STATE_CHANGE ERROR no Message: {id}: The rpcbind process is unresponsive. Attempt to restart.

Description: The rpcbind process does not work. A restart can help.

Cause: The rpcbind process is unresponsive.

User Action: Start rpcbind manually by using the systemctl restart


rpcbind command in case it does not auto-starts.

rpcbind_up STATE_CHANGE INFO no Message: The rpcbind process is running.

Description: The rpcbind process is running.

Cause: N/A

User Action: N/A

rpcbind_up_pod STATE_CHANGE INFO no Message: {id}: The rpcbind process is running.

Description: The rpcbind process is running.

Cause: N/A

User Action: N/A

rpcbind_warn STATE_CHANGE WARNING no Message: The rpcbind check failed with an issue.

Description: The rpcbind check failed with an issue, which might be


temporary.

Cause: The rpcbind process is unresponsive.

User Action: N/A

rpcbind_warn_pod STATE_CHANGE WARNING no Message: {id}: The rpcbind check failed with an issue.

Description: The rpcbind check failed with an issue, which might be


temporary.

Cause: The rpcbind process is unresponsive.

User Action: N/A

rquotad_down INFO INFO no Message: rpc.rquotad is not running.

Description: Currently, not in use.

Cause: N/A

User Action: N/A

Chapter 42. References 669


Table 93. Events for the NFS component (continued)

Event Event Severity Call Details


Type Home

rquotad_up INFO INFO no Message: rpc.rquotad is running.

Description: Currently, not in use.

Cause: N/A

User Action: N/A

start_nfs_service INFO_EXTERNAL INFO no Message: CES NFS service is started.

Description: Notification about an NFS service start.

Cause: The NFS service was started by issuing the mmces service
start nfs command.

User Action: N/A

start_nfs_service_pod INFO_EXTERNAL INFO no Message: {id}: NFS monitoring service is started.

Description: Notification about an NFS service start.

Cause: The NFS monitoring service was started.

User Action: N/A

statd_down STATE_CHANGE ERROR no Message: The rpc.statd process is not running.

Description: The statd process is used by NFS v3 to handle file locks.

Cause: The statd process is not running.

User Action: Stop and start the NFS service, which also attempts to start
the statd process.

statd_multiple STATE_CHANGE WARNING no Message: The rpc.statd process is running multiple times.

Description: The statd process is used by NFS v3 to handle file locks.

Cause: The statd process, which is running multiple times, either indicates
an issue with rpcbind or manually starts.

User Action: Stop and start the NFS service. This stops and restarts the
statd processes.

statd_up STATE_CHANGE INFO no Message: The rpc.statd process is running.

Description: The statd process is running, which is used by NFS v3 to


handle file locks.

Cause: N/A

User Action: N/A

statd_wrong STATE_CHANGE WARNING no Message: The rpc.statd process is misconfigured.

Description: The statd process is used by NFS v3 to handle file locks.

Cause: The statd process was not started by NFS startup. The command
line parameter mmstatdcallout is missing or has an unexpected owner.

User Action: Stop and start the NFS service. This stops and restarts the
statd processes.

stop_nfs_service INFO_EXTERNAL INFO no Message: CES NFS service is stopped.

Description: Notification about an NFS service stop.

Cause: The NFS service was stopped by using the mmces service stop
nfs command.

User Action: N/A

stop_nfs_service_pod INFO_EXTERNAL INFO no Message: {id}: NFS monitoring service is stopped.

Description: Notification about an NFS service stop.

Cause: The NFS monitoring service was stopped.

User Action: N/A

670 IBM Storage Scale 5.1.9: Problem Determination Guide


NVMe events
The following table lists the events that are created for the NVMe component.
Table 94. Events for the NVMe component

Event Event Severity Call Details


Type Home

nvme_found INFO_ADD_ENTITY INFO no Message: An NVMe controller {0} was found.

Description: An NVMe controller that was listed in the IBM Storage Scale
configuration was detected.

Cause: N/A

User Action: N/A

nvme_lbaformat_not_optimal STATE_CHANGE WARNING no Message: The NVMe device {0} does not show expected format.

Description: The LBA format of NVMe device is not formatted as expected.

Cause: The mmlsnvmestatus command reports that LBA format is not


optimal.

User Action: Check the NVMe device format for metadata size (expect ms:
0) and relative performance (expt rp: 0).

nvme_lbaformat_ok STATE_CHANGE INFO no Message: The NVMe device {0} shows expected format.

Description: The LBA format of NVMe device is formatted as expected.

Cause: N/A

User Action: N/A

nvme_linkstate_not_optimal STATE_CHANGE WARNING no Message: The NVMe device {0} reports a link state that does not match the
capabilities.

Description: The NVMe device does not have optimal link state.

Cause: The mmlsnvmestatus command reports that link state is not


optimal.

User Action: Check PCI link state of the NVMe device.

nvme_linkstate_ok STATE_CHANGE INFO no Message: The NVMe device {0} reports a link state that matches the
capabilities.

Description: The NVMe device reports an optimal link state.

Cause: N/A

User Action: N/A

nvme_needsservice STATE_CHANGE WARNING no Message: The NVMe controller {0} needs service.

Description: The NVMe controller needs service.

Cause: N/A

User Action: N/A

nvme_normal STATE_CHANGE INFO no Message: The NVMe controller {0} is OK.

Description: The NVMe controller state is NORMAL.

Cause: N/A

User Action: N/A

nvme_operationalmode_warn STATE_CHANGE WARNING no Message: The NVMe controller {0} encountered either internal errors or
supercap health issues.

Description: The internal errors or supercap health issues are


encountered.

Cause: N/A

User Action: The user is expected to replace the card.

Chapter 42. References 671


Table 94. Events for the NVMe component (continued)

Event Event Severity Call Details


Type Home

nvme_readonly_mode STATE_CHANGE WARNING no Message: NVMe controller {0} is moved to read-only mode.

Description: The device is moved to read-only mode.

Cause: The device is moved to read-only mode when the power source
does not allow backup or flash spare block count reaches backup, which is
unsupported threshold.

User Action: The user is expected to replace the card.

nvme_sparespace_low STATE_CHANGE WARNING no Message: The NVMe controller {0} either indicates program-erase cycles
greater than 90% or supercap end of lifetime is less than or equal to 2
months.

Description: The remaining vault backups until the end of life.

Cause: N/A

User Action: The user is expected to replace the card.

nvme_state_inconsistent STATE_CHANGE WARNING no Message: The NVMe controller {0} reports inconsistent state information.

Description: The NVMe controller reports that no service is needed, but


overall status has degraded.

Cause: N/A

User Action: N/A

nvme_temperature_warn STATE_CHANGE WARNING no Message: NVMe controller {0} reports whether the CPU, System, or
Supercap temperature is greater than or less than the critical threshold
of a component.

Description: Temperature is greater than or less than a critical threshold.

Cause: N/A

User Action: Check the system cooling, such as air blocked or fan failed.

nvme_vanished INFO_DELETE_ENTITY INFO no Message: An NVMe controller {0} vanished.

Description: An NVMe controller, which was listed in the IBM Storage Scale
configuration, was not detected.

Cause: An NVMe controller, which was previously detected in the IBM


Storage Scale configuration, was not found.

User Action: Run the 'nvme' command to verify that all expected NVMe
adapters exist.

nvmeof_raw_disk_absent STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is expected to be installed, but it is absent.

Description: An NVMeoF disk, which must not be exported to the GNR, is


absent.

Cause: The hardware monitoring system fails to detect an NVMeoF disk,


which must not be exported to the GNR.

User Action: Install or replace the reported NVMeoF disk.

nvmeof_raw_disk_enabled STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is installed, but not configured.

Description: An NVMeoF disk, which must not be exported to the GNR, is


enabled.

Cause: The hardware monitoring system detects an unconfigured NVMeoF


disk, which should not be exported to the GNR.

User Action: Configure the reported NVMeoF disk.

nvmeof_raw_disk_failed STATE_CHANGE WARNING no Message: The NVMeoF disk {id}, which is not exported to the GNR, reports
an unknown failure.

Description: An NVMeoF disk, which is not exported to the GNR, has failed.

Cause: The hardware monitoring system detected an unknown failure of an


NVMeoF disk, which is not exported to the GNR.

User Action: Check whether the NVMeoF disk is correctly installed. For
more information, see the IBM Storage Scale: Problem Determination Guide
of the relevant system. Contact IBM support if you need more help.

672 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 94. Events for the NVMe component (continued)

Event Event Severity Call Details


Type Home

nvmeof_raw_disk_found INFO_ADD_ENTITY INFO no Message: The NVMeoF disk {id}, which is not exported to the GNR, runs as
expected.

Description: NVMeoF disk, which is in the raw mode, is detected.

Cause: N/A

User Action: N/A

nvmeof_raw_disk_ok STATE_CHANGE INFO no Message: The NVMeoF disk {id}, which is not exported to the GNR, runs as
expected.

Description: NVMeoF disk, which is in the raw mode, runs as expected.

Cause: N/A

User Action: N/A

nvmeof_raw_disk_smart_faile STATE_CHANGE WARNING servic Message: The NVMeoF disk {id}, which is not eported to the GNR, should
d e be replaced otherwise a malfunction can occur.
ticket
Description: The smart assessment of an NVMeoF disk, which not
exported to the GNR, has failed.

Cause: An NVMeoF disk has a failed smart assessment. This disk is not
exported to the GNR.

User Action: Replace the disk. Contact IBM support if you need more help.

nvmeof_raw_disk_smart_ok STATE_CHANGE INFO no Message: The smart assessment of an NVMeoF disk {id}, which is not
exported to the GNR, returns a healthy report.

Description: The smart assessment of an NVMeoF disk in the raw mode


returns a healthy report.

Cause: N/A

User Action: N/A

nvmeof_raw_disk_smart_unk STATE_CHANGE WARNING servic Message: The system is likely updating the status of an NVMeoF disk {id}.
nown e The process should be transient.
ticket
Description: No smart information is received from an NVMeoF disk, which
not exported to the GNR.

Cause: An NVMeoF disk does not report a smart assessment. This disk is
not exported to the GNR.

User Action: If the hardware monitoring system continues to fail in getting


the smart information of an NVMeoF disk for more than 15 minutes, then
contact the IBM support for further assistance.

nvmeof_raw_disk_standby_of STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is set to an offline state by the user.
fline
Description: An NVMeoF disk, which should not be exported to the GNR, is
set to an offline state.

Cause: The hardware monitoring system detects an NVMeoF disk that was
put to an offline state. This disk was not exported to the GNR before.

User Action: Activate the offline NVMeoF disk.

nvmeof_raw_disk_unavailable STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is set offline by an unknown reason.
_offline
Description: An NVMeoF disk, which should not be exported to the GNR, is
set offline without known reason.

Cause: The hardware monitoring system detects an offline NVMeoF disk.


This disk must not be exported to the GNR and is set offline without any
known reason.

User Action: Check for possible problems like missing power, etc. For more
information, see the IBM Storage Scale: Problem Determination Guide of the
relevant system. Contact IBM support if you need more help.

Chapter 42. References 673


Table 94. Events for the NVMe component (continued)

Event Event Severity Call Details


Type Home

nvmeof_raw_disk_unknown STATE_CHANGE WARNING no Message: The system is likely updating the status of an NVMeoF disk {id}.
The process should be transient.

Description: The status of an NVMeoF disk, which is not exported to the


GNR, is unknown.

Cause: The hardware monitoring system fails to get the status of an


NVMeoF disk, which not exported to the GNR.

User Action: If the hardware monitoring continues to fail in getting the


NVMeoF disk status for more than 15 minutes, then contact IBM support
for more information.

nvmeof_raw_disk_vanished INFO_DELETE_ENTITY INFO no Message: An NVMeoF disk, which is in raw mode and was previously
reported, is not detected anymore.

Description: An NVMeoF disk in raw mode vanished.

Cause: An NVMeoF disk, which was previously detected in the IBM Storage
Scale configuration, was not found.

User Action: Verify that all expected NVMeoF disks in the raw mode exist in
the IBM Storage Scale configuration.

NVMeoF events
The following table lists the events that are created for the NVMeoF component.
Table 95. Events for the NVMeoF component

Event Event Severity Call Details


Type Home

nvmeof_devices_missing STATE_CHANGE WARNING no Message: The following devices are configured, but do not exist: {0}.

Description: No NVMeoF devices exist.

Cause: No NVMeoF devices exist in the kernel configuration as a file.

User Action: Add the missing files or remove the NVMeoF definitions.

nvmeof_devices_not_configur STATE_CHANGE WARNING no Message: The following devices exist, but they are not configured: {0}.
ed
Description: No NVMeoF device is configured in CCR file
nvmeofft_config.json.

Cause: An NVMeoF device exists in kernel _configuration, but not


configured in the CCR file nvmeofft_config.json.

User Action: Add the detected files to the NVMeoF configuration.

nvmeof_devices_ok STATE_CHANGE INFO no Message: No problems found for any NVMeoF devices.

Description: No problems found for any NVMeoF devices.

Cause: N/A

User Action: N/A

nvmeof_module_missing STATE_CHANGE ERROR no Message: NVMeoF kernel modules are missing: {0}.

Description: At least one NVMeoF kernel module is missing.

Cause: The lsmod command reported that at least one NVMeoF kernel
module is missing.

User Action: Install the missing module.

nvmeof_modules_installed STATE_CHANGE INFO no Message: NVMeoF modules are installed.

Description: All required NVMeoF modules are installed.

Cause: N/A

User Action: N/A

674 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 95. Events for the NVMeoF component (continued)

Event Event Severity Call Details


Type Home

nvmeof_multipath_disabled STATE_CHANGE INFO no Message: Native multipath is disabled for NVMeoF.

Description: Native multipath is disabled for NVMeoF as required.

Cause: N/A

User Action: N/A

nvmeof_multipath_enabled STATE_CHANGE WARNING no Message: Native multipath is not disabled for NVMeoF, but disabled
multipath is required.

Description: Native multipath is not disabled for NVMeoF, but disabled


multipath is required.

Cause: The /sys/module/nvme_core/parameters/multipath file


contains a value Y.

User Action: Disable native multipath for NVMeoF.

nvmeof_packages_installed STATE_CHANGE INFO no Message: NVMeoF packages are installed.

Description: All required NVMeoF packages are installed.

Cause: N/A

User Action: N/A

nvmeof_packages_missing STATE_CHANGE ERROR no Message: NVMeoF related package is missing: {0}.

Description: At least one NVMeoF related package is missing.

Cause: The rpm -qa command reports that an NVMeoF related package is
missing.

User Action: Install the missing package.

nvmeof_target_device_cachin STATE_CHANGE INFO no Message: NVMeoF target device caching is write-through.


g_ok
Description: Target device caching for NVMeoF is write-through.

Cause: N/A

User Action: N/A

nvmeof_target_device_cachin STATE_CHANGE WARNING no Message: NVMeoF target device caching is not write-through for: {0}.
g_wrong
Description: Target device caching for NVMeoF is not write-through, which
might lead to data loss or corruption.

Cause: Target device caching for NVMeoF is not write-through.

User Action: Set the target device caching for NVMeoF to write- through.

nvmeof_unknown_devices_co STATE_CHANGE WARNING no Message: The following nonexistent devices are listed in CCR file
nfigured nvmeofft_config.json: {0}.

Description: At least one NVMeoF device is configured in CCR file


nvmeofft_config.json, but not registered in the kernel configuration.

Cause: NVMeoF device is configured in the <nvmeof_ccr_config>, but not


registered in the <nvmeof_kernel_config>.

User Action: Remove the detected files from CCR file nvmeofft_config.json.

Object events
The following table lists the events that are created for the Object component.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.

Chapter 42. References 675


- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
Table 96. Events for the object component

Event Event Severity Call Details


Type Home

account-auditor_failed STATE_CHANGE ERROR no Message: The account-auditor process should be {0}, but is {1}.

Description: The account-auditor process is not running.

Cause: The account-auditor process is not running.

User Action: Check the status of openstack-swift-account-auditor process.

account-auditor_ok STATE_CHANGE INFO no Message: The state of the account-auditor process, as expected, is {0}.

Description: The account-auditor process is running.

Cause: The container-account process is running.

User Action: N/A

account-auditor_warn INFO WARNING no Message: The account-auditor process monitoring returned an unknown
result.

Description: The account-auditor check returned an unknown result.

Cause: A status query for openstack-swift-account-auditor returned an


unexpected error.

User Action: Check the service script and settings.

account-reaper_failed STATE_CHANGE ERROR no Message: The account-reaper process should be {0}, but is {1}.

Description: The account-reaper process is not running.

Cause: The account-reaper process is not running.

User Action: Check the status of openstack-swift-account-reaper process.

account-reaper_ok STATE_CHANGE INFO no Message: The state of account-reaper process, as expected, is {0}.

Description: The account-reaper process is running.

Cause: The account-reaper process is running.

User Action: N/A

account-reaper_warn INFO WARNING no Message: The account-reaper process monitoring returned an unknown
result.

Description: The account-reaper check returned an unknown result.

Cause: A status query for openstack-swift-account-reaper returned an


unexpected error.

User Action: Check the service script and settings.

account-replicator_failed STATE_CHANGE ERROR no Message: The account-replicator process should be {0}, but is {1}.

Description: The account-replicator process is not running.

Cause: The account-replicator process is not running.

User Action: Check the status of openstack-swift-account-replicator


process.

account-replicator_ok STATE_CHANGE INFO no Message: The state of account-replicator process, as expected, is {0}.

Description: The account-replicator process is running.

Cause: The account-replicator process is running.

User Action: N/A

676 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

account-replicator_warn INFO WARNING no Message: The account-replicator process monitoring returned an unknown
result.

Description: The account-replicator check returned an unknown result.

Cause: A status query for openstack-swift-account-replicator returned an


unexpected error.

User Action: Check the service script and settings.

account-server_failed STATE_CHANGE ERROR no Message: The account process should be {0}, but is {1}.

Description: The account-server process is not running.

Cause: The account-server process is not running.

User Action: Check the status of openstack-swift-account process.

account-server_ok STATE_CHANGE INFO no Message: The state of account process, as expected, is {0}.

Description: The account-server process is running.

Cause: The account-server process is running.

User Action: N/A

account-server_warn INFO WARNING no Message: The account process monitoring returned an unknown result.

Description: The account-server check returned an unknown result.

Cause: A status query for openstack-swift-account returned an unexpected


error.

User Action: Check the service script and settings.

account_access_down STATE_CHANGE ERROR no Message: No access to account service ip {0} and port {1}. Check the
firewall.

Description: The access check of the account service port failed.

Cause: The port is probably blocked by a firewall rule.

User Action: Check whether the account service is running and the firewall
rules.

account_access_up STATE_CHANGE INFO no Message: Access to account service ip {0} and port {1} is OK.

Description: The access check of the account service port was successful.

Cause: N/A

User Action: N/A

account_access_warn INFO WARNING no Message: Account service access check ip {0} and port {1} failed. Check for
validity.

Description: The access check of the account service port returned an


unknown result.

Cause: The account service port access cannot be determined due to a


problem.

User Action: Find potential issues for this kind of failure in the logs.

container-auditor_failed STATE_CHANGE ERROR no Message: The container-auditor process should be {0}, but is {1}.

Description: The container-auditor process is not running.

Cause: The container-auditor process is not running.

User Action: Check the status of openstack-swift-container-auditor


process.

container-auditor_ok STATE_CHANGE INFO no Message: The state of container-auditor process, as expected, is {0}.

Description: The container-auditor process is running.

Cause: The container-auditor process is running.

User Action: N/A

Chapter 42. References 677


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

container-auditor_warn INFO WARNING no Message: The container-auditor process monitoring returned an unknown
result.

Description: The container-auditor check returned an unknown result.

Cause: A status query for openstack-swift-container-auditor returned an


unexpected error.

User Action: Check the service script and settings.

container-replicator_failed STATE_CHANGE ERROR no Message: The container-replicator process should be {0}, but is {1}.

Description: The container-replicator process is not running.

Cause: The container-replicator process is not running.

User Action: Check the status of openstack-swift-container-replicator


process.

container-replicator_ok STATE_CHANGE INFO no Message: The state of container-replicator process, as expected, is {0}.

Description: The container-replicator process is running.

Cause: The container-replicator process is running.

User Action: N/A

container-replicator_warn INFO WARNING no Message: The container-replicator process monitoring returned an


unknown result.

Description: The container-replicator check returned an unknown result.

Cause: A status query for openstack-swift-container-replicator returned an


unexpected error.

User Action: Check the service script and settings.

container-server_failed STATE_CHANGE ERROR no Message: The container process should be {0}, but is {1}.

Description: The container-server process is not running.

Cause: The container-server process is not running.

User Action: Check the status of openstack-swift-container process.

container-server_ok STATE_CHANGE INFO no Message: The state of container process, as expected, is {0}.

Description: The container-server process is running.

Cause: The container-server process is running.

User Action: N/A

container-server_warn INFO WARNING no Message: The container process monitoring returned an unknown result.

Description: The container-server check returned an unknown result.

Cause: A status query for openstack-swift-container returned an


unexpected error.

User Action: Check the service script and settings.

container-updater_failed STATE_CHANGE ERROR no Message: The container-updater process should be {0}, but is {1}.

Description: The container-updater process is not running.

Cause: The container-updater process is not running.

User Action: Check the status of openstack-swift-container-updater


process.

container-updater_ok STATE_CHANGE INFO no Message: The state of container-updater process, as expected, is {0}.

Description: The container-updater process is running.

Cause: The container-updater process is running.

User Action: N/A

678 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

container-updater_warn INFO WARNING no Message: The container-updater process monitoring returned an unknown
result.

Description: The container-updater check returned an unknown result.

Cause: A status query for openstack-swift-container-updater returned an


unexpected error.

User Action: Check the service script and settings.

container_access_down STATE_CHANGE ERROR no Message: No access to container service ip {0} and port {1}. Check the
firewall.

Description: The access check to the container service port failed.

Cause: The port is probably blocked by a firewall rule.

User Action: Check whether the file system daemon is running and the
firewall rules.

container_access_up STATE_CHANGE INFO no Message: Access to container service ip {0} and port {1} is OK.

Description: The access check to the container service port was


successful.

Cause: N/A

User Action: N/A

container_access_warn INFO WARNING no Message: Container service access check ip {0} and port {1} failed. Check
for validity.

Description: The access check to the container service port returned an


unknown result.

Cause: The container service port access cannot be determined due to a


problem.

User Action: Find potential issues for this kind of failure in the logs.

disable_Address_database_n INFO_EXTERNAL INFO no Message: Disable the Address Database node.


ode
Description: Event to signal that the database flag was removed from this
node.

Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.

User Action: N/A

disable_Address_singleton_n INFO_EXTERNAL INFO no Message: Disable the Address Singleton node.


ode
Description: Event to signal that the singleton flag was removed from this
node.

Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.

User Action: N/A

enable_Address_database_no INFO_EXTERNAL INFO no Message: Enable the Address Database node.


de
Description: Event to signal that the database flag was moved to this node.

Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.

User Action: N/A

enable_Address_singleton_n INFO_EXTERNAL INFO no Message: Enable the Address Singleton node.


ode
Description: Event to signal that the singleton flag was moved to this node.

Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.

User Action: N/A

Chapter 42. References 679


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

ibmobjectizer_failed STATE_CHANGE ERROR no Message: The ibmobjectizer process should be {0}, but is {1}.

Description: The ibmobjectizer process is in the expected state.

Cause: The ibmobjectizer process is expected to be running only on the


singleton node.

User Action: Check the status of ibmobjectizer process and object


singleton flag.

ibmobjectizer_ok STATE_CHANGE INFO no Message: The state of ibmobjectizer process, as expected, is {0}.

Description: The ibmobjectizer process is not in the expected state.

Cause: The ibmobjectizer process is expected to be running only on the


singleton node.

User Action: N/A

ibmobjectizer_warn INFO WARNING no Message: The ibmobjectizer process monitoring returned an unknown
result.

Description: The ibmobjectizer check returned an unknown result.

Cause: A status query for ibmobjectizer returned an unexpected error.

User Action: Check the service script and settings.

memcached_failed STATE_CHANGE ERROR no Message: The memcached process should be {0}, but is {1}.

Description: The memcached process is not running.

Cause: The memcached process is not running.

User Action: Check the status of memcached process.

memcached_ok STATE_CHANGE INFO no Message: The state of memcached process, as expected, is {0}.

Description: The memcached process is running.

Cause: The memcached process is running.

User Action: N/A

memcached_warn INFO WARNING no Message: The memcached process monitoring returned an unknown
result.

Description: The memcached check returned an unknown result.

Cause: A status query for memcached returned an unexpected error.

User Action: Check the service script and settings.

obj_restart INFO WARNING no Message: The {0} service failed. Trying to recover.

Description: An object service was not in the expected state.

Cause: An object service might have stopped unexpectedly.

User Action: N/A

object-expirer_failed STATE_CHANGE ERROR no Message: The object-expirer process should be {0}, but is {1}.

Description: The object-expirer process is not in the expected state.

Cause: The object-expirer process is expected to be running only on the


singleton node.

User Action: Check the status of openstack-swift-object-expirer process


and object singleton flag.

object-expirer_ok STATE_CHANGE INFO no Message: The state of object-expirer process, as expected, is {0}.

Description: The object-expirer process is in the expected state.

Cause: The object-expirer process is expected to be running only on the


singleton node.

User Action: N/A

680 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

object-expirer_warn INFO WARNING no Message: The object-expirer process monitoring returned an unknown
result.

Description: The object-expirer check returned an unknown result.

Cause: A status query for openstack-swift-object-expirer returned an


unexpected error.

User Action: Check the service script and settings.

object-replicator_failed STATE_CHANGE ERROR no Message: The object-replicator process should be {0}, but is {1}.

Description: The object-replicator process is not running.

Cause: The object-replicator process is not running.

User Action: Check the status of openstack-swift-object-replicator


process.

object-replicator_ok STATE_CHANGE INFO no Message: The state of object-replicator process, as expected, is {0}.

Description: The object-replicator process is running.

Cause: The object-replicator process is running.

User Action: N/A

object-replicator_warn INFO WARNING no Message: The object-replicator process monitoring returned an unknown
result.

Description: The object-replicator check returned an unknown result.

Cause: A status query for openstack-swift-object-replicator returned an


unexpected error.

User Action: Check the service script and settings.

object-server_failed STATE_CHANGE ERROR no Message: The object process should be {0}, but is {1}.

Description: The object-server process is not running.

Cause: The object-server process is not running.

User Action: Check the status of openstack-swift-object process.

object-server_ok STATE_CHANGE INFO no Message: The state of object process, as expected, is {0}.

Description: The object-server process is running.

Cause: The object-server process is running.

User Action: N/A

object-server_warn INFO WARNING no Message: The object process monitoring returned an unknown result.

Description: The object-server check returned an unknown result.

Cause: A status query for openstack-swift-object-server returned an


unexpected error.

User Action: Check the service script and settings.

object-updater_failed STATE_CHANGE ERROR no Message: The object-updater process should be {0}, but is {1}.

Description: The object-updater process is not in the expected state.

Cause: The object-updater process is expected to be running only on the


singleton node.

User Action: Check the status of openstack-swift-object-updater process


and object singleton flag.

object-updater_ok STATE_CHANGE INFO no Message: The state of object-updater process, as expected, is {0}.

Description: The object-updater process is in the expected state.

Cause: The object-updater process is expected to be running only on the


singleton node.

User Action: N/A

Chapter 42. References 681


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

object-updater_warn INFO WARNING no Message: The object-updater process monitoring returned an unknown
result.

Description: The object-updater check returned an unknown result.

Cause: A status query for openstack-swift-object-updater returned an


unexpected error.

User Action: Check the service script and settings.

object_access_down STATE_CHANGE ERROR no Message: No access to object store ip {0} and port {1}. Check the firewall.

Description: The access check to the object service port failed.

Cause: The port is probably blocked by a firewall rule.

User Action: Check whether the object service is running and the firewall
rules.

object_access_up STATE_CHANGE INFO no Message: Access to object store ip {0} and port {1} is OK.

Description: The access check to the object service port was successful.

Cause: N/A

User Action: N/A

object_access_warn INFO WARNING no Message: Object store access check ip {0} and port {1} failed.

Description: The access check to the object service port returned an


unknown result.

Cause: The object service port access cannot be determined due to a


problem.

User Action: Find potential issues for this kind of failure in the logs.

object_quarantined INFO_EXTERNAL WARNING no Message: The object "{0}", container "{1}", account "{2}" is quarantined.
Path of quarantined object: "{3}".

Description: The object, which was being accessed, is quarantined.

Cause: Mismatch in data and metadata is found.

User Action: N/A

object_sof_access_down STATE_CHANGE ERROR no Message: No access to unified object store ip {0} and port {1}. Check the
firewall.

Description: The access check to the unified object service port failed.

Cause: The port is probably blocked by a firewall rule.

User Action: Check whether the unified object service is running and the
firewall rules.

object_sof_access_up STATE_CHANGE INFO no Message: Access to unified object store ip {0} and port {1} is OK.

Description: The access check to the unified object service port was
successful.

Cause: N/A

User Action: N/A

object_sof_access_warn INFO WARNING no Message: Unified object store access check ip {0} and port {1} failed. Check
for validity.

Description: The access check to the object unified access service port
returned an unknown result.

Cause: The unified object service port access cannot be determined due to
a problem.

User Action: Find potential issues for this kind of failure in the logs.

682 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

openstack-object-sof_failed STATE_CHANGE ERROR no Message: The object-sof process should be {0}, but is {1}.

Description: The swift-on-file process is not in the expected state.

Cause: If the swift-on-file process is expected to be running, then the


capability is enabled, and stopped when disabled.

User Action: Check the status of openstack-swift-object-sof process and


capabilities flag in spectrum-scale-object.conf.

openstack-object-sof_ok STATE_CHANGE INFO no Message: The state of object-sof process, as expected, is {0}.

Description: The swift-on-file process is in the expected state.

Cause: If the swift-on-file process is expected to be running, then the


capability is enabled, and stopped when disabled.

User Action: N/A

openstack-object-sof_warn INFO INFO no Message: The object-sof process monitoring returned an unknown result.

Description: The openstack-swift-object-sof check returned an unknown


result.

Cause: A status query for openstack-swift-object-sof returned an


unexpected error.

User Action: Check the service script and settings.

openstack-swift-object- STATE_CHANGE ERROR no Message: The object-auditor process should be {0}, but is {1}.
auditor_failed
Description: The object-auditor process is not in the expected state.

Cause: The openstack-swift-object-auditor process is expected to be


running only on the singleton node and when the capability multi-region
is enabled. It needs to be stopped in other cases.

User Action: Check the status of openstack-swift-object-auditor process


and capabilities flag in spectrum-scale-object.conf.

openstack-swift-object- STATE_CHANGE INFO no Message: The state of object-auditor process, as expected, is {0}.
auditor_ok
Description: The object-auditor process is in the expected state.

Cause: The openstack-swift-object-auditor process is expected to be


running only on the singleton node and when the capability multi-region
is enabled. It needs to be stopped in other cases.

User Action: N/A

openstack-swift-object- INFO INFO no Message: The object-auditor process monitoring returned an unknown
auditor_warn result.

Description: The openstack-swift-object-auditor check returned an


unknown result.

Cause: A status query for openstack-swift-object-auditor returned an


unexpected error.

User Action: Check the service script and settings.

postIpChange_info_o INFO_EXTERNAL INFO no Message: IP addresses are modified {0}.

Description: CES IP addresses are moved and activated.

Cause: N/A

User Action: N/A

proxy-httpd-server_failed STATE_CHANGE ERROR no Message: The proxy process should be {0}, but is {1}.

Description: The proxy-server process is not running.

Cause: The proxy-server process is not running.

User Action: Check the status of openstack-swift-proxy process.

Chapter 42. References 683


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

proxy-httpd-server_ok STATE_CHANGE INFO no Message: The state of proxy process, as expected, is {0}.

Description: The proxy-server process is running.

Cause: The proxy-server process is running.

User Action: N/A

proxy-httpd-server_warn INFO WARNING no Message: The proxy process monitoring returned an unknown result.

Description: The monitoring process returned an unknown result.

Cause: A status query for HTTPd, which runs the proxy server, returned an
unexpected error.

User Action: Check the service script and settings.

proxy-server_failed STATE_CHANGE ERROR FTDC Message: The proxy process should be {0}, but is {1}.
uploa
d Description: The proxy-server process is not running.

Cause: The proxy-server process is not running.

User Action: Check the status of openstack-swift-proxy process.

proxy-server_ok STATE_CHANGE INFO no Message: The state of proxy process, as expected, is {0}.

Description: The proxy-server process is running.

Cause: The proxy-server process is running.

User Action: N/A

proxy-server_warn INFO WARNING no Message: The proxy process monitoring returned an unknown result.

Description: The proxy-server process monitoring returned an unknown


result.

Cause: A status query for openstack-swift-proxy-server returned an


unexpected error.

User Action: Check the service script and settings.

proxy_access_down STATE_CHANGE ERROR no Message: No access to proxy service ip {0} and port {1}. Check the firewall.

Description: The access check to the proxy service port failed.

Cause: The port is probably blocked by a firewall rule.

User Action: Check whether the proxy service is running and the firewall
rules.

proxy_access_up STATE_CHANGE INFO no Message: Access to proxy service ip {0}, port {1} is OK.

Description: The access check of the proxy service port was successful.

Cause: N/A

User Action: N/A

proxy_access_warn INFO WARNING no Message: Proxy service access check ip {0} and port {1} failed. Check for
validity.

Description: The access check of the proxy service port returned an


unknown result.

Cause: The proxy service port access cannot be determined due to a


problem.

User Action: Find potential issues for this kind of failure in the logs.

ring_checksum_failed STATE_CHANGE ERROR FTDC Message: Checksum of ring file {0} does not match the one in CCR.
uploa
d Description: Files for object rings were modified unexpectedly.

Cause: Checksum of ring file did not match the stored value.

User Action: Check the ring files.

684 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 96. Events for the object component (continued)

Event Event Severity Call Details


Type Home

ring_checksum_ok STATE_CHANGE INFO no Message: Checksum of ring file {0} is OK.

Description: Files for object rings were checked successfully.

Cause: Checksum of ring file is found unchanged.

User Action: N/A

ring_checksum_warn INFO WARNING no Message: Issue while checking checksum of ring file is {0}.

Description: Checksum generation process failed.

Cause: The ring_checksum check returned an unknown result.

User Action: Check whether the ring files and md5sum are executable.

start_obj_service INFO_EXTERNAL INFO no Message: OBJ service was started.

Description: Information about an OBJ service start.

Cause: The OBJECT service was started by using the mmces service
start obj command.

User Action: N/A

stop_obj_service INFO_EXTERNAL INFO no Message: OBJ service was stopped.

Description: Information about an OBJ service stop.

Cause: The OBJECT service was stopped by using the mmces service
stop obj command.

User Action: N/A

Performance events
The following table lists the events that are created for the Performance component.
Table 97. Events for the Performance component

Event Event Severity Call Details


Type Home

pmcollector_down STATE_CHANGE ERROR no Message: The pmcollector service should be {0}, and is {1}.

Description: The performance monitor collector is down.

Cause: This node is in the global configuration of perfmon by using the


mmperfmon config show command, but the pmcollector service is not
running.

User Action: Start the service by using the service or systemctl


command. Otherwise, remove the node from the global perfmon
configuration by using the mmperfmon command.

pmcollector_port_down STATE_CHANGE ERROR no Message: Performance monitoring collector port {id} ({0}) is not
responding.

Description: One of the performance monitor collector ports is not


responding. Check for the affected port in the event message. Queryport
and Queryport2 is needed to run queries to get performance data. ipfixport
is used to collect metrics and send them to the collector. Federationport is
used by collectors to talk to each other.

Cause: Trying to establish a TCP connection to the local pmcollector port


failed. This can be because pmcollector is down or does not respond to
network requests.

User Action: Check the pmcollector process and logs and verify that it runs
correctly.

pmcollector_port_up STATE_CHANGE INFO no Message: Performance monitoring collector port {0} is responding.

Description: The performance monitor collector port is working.

Cause: The perfmon collector port is working as expected.

User Action: N/A

Chapter 42. References 685


Table 97. Events for the Performance component (continued)

Event Event Severity Call Details


Type Home

pmcollector_port_warn INFO INFO no Message: The pmcollector service port monitor returned an unknown
result.

Description: The monitoring service for performance monitor collector


returned an unknown result.

Cause: Trying to establish a TCP connection to the local pmcollector port


failed with an unexpected error.

User Action: Check the pmcollector process and logs and verify that it runs
correctly.

pmcollector_up STATE_CHANGE INFO no Message: The state of pmcollector service, as expected, is {0}.

Description: The performance monitor collector is running.

Cause: The perfmon collector service is running as expected.

User Action: N/A

pmcollector_warn INFO INFO no Message: The pmcollector service has returned an unknown result.

Description: The monitoring service for performance monitor collector


returned an unknown result.

Cause: The service or systemctl command returned unknown results


about the pmcollector service.

User Action: Check whether the pmcollector is in the expected status by


using the service or systemctl command. If there is no pmcollector
service, then irrespective of whether the node is in the global mmperfmon
config, check the perfmon documentation.

pmsensors_down STATE_CHANGE ERROR no Message: The pmsensors service should be {0}, but is {1}.

Description: The performance monitor sensors are down.

Cause: The node has the perfmon designation in the mmlscluster


command, but the pmsensors service is not running.

User Action: Start the service by using the service or systemctl


command. Otherwise, remove the perfmon designation by using the
mmchnode command.

pmsensors_up STATE_CHANGE INFO no Message: The state of pmsensors service, as expected, is {0}.

Description: The performance monitor sensors are running.

Cause: The perfmon sensors service is running as expected.

User Action: N/A

pmsensors_warn INFO INFO no Message: The pmsensors service returned an unknown result.

Description: The monitoring service for performance monitor sensors


returned an unknown result.

Cause: The service or systemctl command returned unknown results


about the pmsensors service.

User Action: Check whether the pmsensors is in the expected status by


using the service or systemctl command. If there is no pmsensors
service, then irrespective of whether the node has the perfmon designation
in the mmlscluster command, check the perfmon documentation.

686 IBM Storage Scale 5.1.9: Problem Determination Guide


Server RAID events
The following table lists the events that are created for the Server RAID component.
Table 98. Events for the Server RAID component

Event Event Severity Call Details


Type Home

raid_adapter_clear STATE_CHANGE INFO no Message: No server RAID data was listed in the output of the test program.

Description: No server RAID data was listed in the output of the test
program /sbin/iprconfig.

Cause: No server RAID data was listed in the output of the test program.

User Action: N/A

raid_check_warn INFO WARNING no Message: The disk states of the mirrored root partition cannot be
determined.

Description: The server RAID test program failed or ran into a timeout.

Cause: The server RAID test program /sbin/iprconfig failed or ran into
a timeout.

User Action: N/A

raid_root_disk_bad STATE_CHANGE WARNING no Message: {id} Mirrored root partition disk failed.

Description: A disk of the mirrored (RAID 10) root partition failed.

Cause: A disk of the mirrored (RAID 10) root partition failed.

User Action: Replace faulty root disk.

raid_root_disk_ok STATE_CHANGE INFO no Message: {id} Mirrored root partition disk is OK.

Description: The disks of the mirrored (RAID 10) root partition are OK.

Cause: The disks of the mirrored (RAID 10) root partition are OK.

User Action: N/A

raid_sas_adapter_bad STATE_CHANGE WARNING no Message: IBM Power RAID adapter {0} is degraded, which impacts small
write performance.

Description: An IBM Power RAID adapter is degraded, which impacts small


write performance.

Cause: An IBM Power RAID adapter is in a DEGRADED state. This was


detected by the /sbin/iprconfig -c show-arrays command.

User Action: Check the RAID adapter card. For more information, execute
the /sbin/iprconfig -c show-arrays command.

raid_sas_adapter_ok STATE_CHANGE INFO no Message: IBM Power RAID adapter {0} is OK.

Description: An IBM Power RAID adapter is OK.

Cause: An IBM Power RAID adapter is OK.

User Action: N/A

SMB events
The following table lists the events that are created for the SMB component.
Table 99. Events for the SMB component

Event Event Severity Call Details


Type Home

ctdb_down STATE_CHANGE ERROR FTDC Message: The CTDB process is not running.
uploa
d Description: The CTDB process is not running.

Cause: N/A

User Action: Perform the troubleshooting.

Chapter 42. References 687


Table 99. Events for the SMB component (continued)

Event Event Severity Call Details


Type Home

ctdb_recovered STATE_CHANGE INFO no Message: The CTDB recovery is finished.

Description: CTDB completed the database recovery.

Cause: N/A

User Action: N/A

ctdb_recovery STATE_CHANGE WARNING no Message: The CTDB recovery is detected.

Description: CTDB is performing a database recovery.

Cause: N/A

User Action: N/A

ctdb_state_down STATE_CHANGE ERROR FTDC Message: The CTDB state is {0}.


uploa
d Description: The CTDB state is unhealthy.

Cause: N/A

User Action: Perform the troubleshooting.

ctdb_state_up STATE_CHANGE INFO no Message: The CTDB state is healthy.

Description: The CTDB state is healthy.

Cause: N/A

User Action: N/A

ctdb_up STATE_CHANGE INFO no Message: The CTDB process is now running.

Description: The CTDB process is running.

Cause: N/A

User Action: N/A

ctdb_version_match STATE_CHANGE_EXTE INFO no Message: CTDB passed the version check.


RNAL
Description: The CTDB service successfully passed the version check.

Cause: The CTDB service successfully passed the version check on a node
and can join the running cluster. The node is given.

User Action: N/A

ctdb_version_mismatch STATE_CHANGE_EXTE ERROR FTDC Message: Cannot start CTDB version {0} as {1} is already running in the
RNAL uploa cluster.
d
Description: CTDB cannot start on a node as it detected that on other CES
nodes a CTDB cluster is running at a different version. This prevents the
SMB service to get healthy.

Cause: CTDB cannot start on a node as it detected a conflicting version


that is running in the cluster. The name of the failing node, the starting
CTDB version, and the conflicting version that is running in the cluster are
provided.

User Action: Get all gpfs-smb packages to the same version.

ctdb_warn INFO WARNING no Message: The CTDB monitoring returned an unknown result.

Description: The CTDB check returned an unknown result.

Cause: N/A

User Action: Perform the troubleshooting.

smb_exported_fs_chk STATE_CHANGE_EXTE INFO no Message: The Cluster State Manager (CSM) cleared the
RNAL smb_exported_fs_down event.

Description: Declared SMB exported file systems are either available again
on this node or not available on any node.

Cause: The CSM cleared an smb_exported_fs_down event on this node to


display the self-detected state of this node.

User Action: N/A

688 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 99. Events for the SMB component (continued)

Event Event Severity Call Details


Type Home

smb_exported_fs_down STATE_CHANGE_EXTE ERROR no Message: One or more declared SMB exported file systems are not
RNAL available on this node.

Description: One or more declared SMB exported file systems are not
available on this node. Other nodes might have those file systems available.

Cause: One or more declared SMB exported file systems are not available
on this node.

User Action: Check the SMB export-related local and remote file system
states.

smb_exports_clear_state STATE_CHANGE INFO no Message: Clear local SMB export down state temporarily.

Description: Clear local SMB export down state temporarily because a


message as 'all nodes have the same problem' was received.

Cause: Clear local SMB export down state temporarily.

User Action: N/A

smb_exports_down STATE_CHANGE WARNING no Message: One or more declared file systems for SMB exports are not
available.

Description: One or more declared file systems for SMB exports are not
available.

Cause: One or more declared file systems for SMB exports are not
available.

User Action: Check local and remote file system states.

smb_exports_up STATE_CHANGE INFO no Message: All declared file systems for SMB exports are available.

Description: All declared file systems for SMB exports are available.

Cause: All declared file systems for SMB exports are available.

User Action: N/A

smb_restart INFO WARNING no Message: The SMB service failed. Trying to recover.

Description: Attempt to start the SMBD process.

Cause: The SMBD process was not running.

User Action: N/A

smb_sensors_active TIP INFO no Message: The SMB perfmon sensor {0} is active.

Description: The SMB perfmon sensors are active. This event's monitor is
running only once an hour.

Cause: The SMB perfmon sensors' period attribute is greater than 0.

User Action: N/A

smb_sensors_inactive TIP TIP no Message: The following SMB perfmon sensor {0} is inactive.

Description: The SMB perfmon sensors are inactive. This event's monitor is
running only once an hour.

Cause: The SMB perfmon sensors' period attribute is 0.

User Action: Set the period attribute of the SMB sensors greater than
0. Run the mmperfmon config update SensorName.period=N
command, where 'SensorName' is one of the SMB sensors' name and 'N'
is a natural number, that is greater than 0. Consider that this TIP monitor
is running only once per hour and might take up to 1 hour to detect the
changes in the configuration.

Chapter 42. References 689


Table 99. Events for the SMB component (continued)

Event Event Severity Call Details


Type Home

smb_sensors_not_configured TIP TIP no Message: The SMB perfmon sensor {0} is not configured.

Description: The SMB perfmon sensor does not exist in the mmperfmon
config show command.

Cause: The SMB perfmon sensor is not configured in the sensors


configuration file.

User Action: Include the sensors into the perfmon configuration by


using the mmperfmon config add --sensors /opt/IBM/zimon/
defaults/ZIMonSensors_smb.cfg command. For more information
on the configuration file, see the mmperfmon command section in the
Command Reference Guide.

smbd_down STATE_CHANGE ERROR FTDC Message: SMBD process is not running.


uploa
d Description: The SMBD process is not running.

Cause: N/A

User Action: Perform the troubleshooting.

smbd_up STATE_CHANGE INFO no Message: SMBD process is now running.

Description: The SMBD process is running.

Cause: N/A

User Action: N/A

smbd_warn INFO WARNING no Message: The SMBD process monitoring returned an unknown result.

Description: The SMBD process monitoring returned an unknown result.

Cause: N/A

User Action: Perform troubleshooting.

smbport_down STATE_CHANGE ERROR no Message: SMB port {0} is not active.

Description: SMBD is not listening on a TCP protocol port.

Cause: N/A

User Action: Perform troubleshooting.

smbport_up STATE_CHANGE INFO no Message: SMB port {0} is now active.

Description: An SMB port was activated.

Cause: N/A

User Action: N/A

smbport_warn INFO WARNING no Message: The SMB port monitoring {0} returned an unknown result.

Description: An internal error occurred while monitoring the SMB TCP


protocol ports.

Cause: N/A

User Action: Perform troubleshooting.

start_smb_service INFO_EXTERNAL INFO no Message: SMB service was started.

Description: Information about an SMB service start.

Cause: The SMB service was started by using the mmces service start
smb command.

User Action: N/A

stop_smb_service INFO_EXTERNAL INFO no Message: SMB service was stopped.

Description: Information about an SMB service stop.

Cause: The SMB service was stopped by using the mmces service stop
smb command.

User Action: N/A

690 IBM Storage Scale 5.1.9: Problem Determination Guide


Stretch cluster events
The following table lists the events that are created for the Stretch cluster component.
Table 100. Events for the Stretch cluster component

Event Event Severity Call Details


Type Home

site_degraded_replication STATE_CHANGE WARNING no Message: Replication issues are reported at site {id}.

Description: Replication issues exist at the site.

Cause: Replication issues are reported at the site.

User Action: Check the health of the site recovery group and take any
corrective action, such as issuing the mmrestripefs command.

site_found INFO_ADD_ENTITY INFO no Message: Site {id} was found.

Description: Site is detected.

Cause: N/A

User Action: N/A

site_fs_desc_fail STATE_CHANGE ERROR no Message: Site {id} has no descriptor disks for all defined file systems.

Description: All file systems at the site have failure groups with no
descriptor disks.

Cause: No file systems contain descriptor disks at the site.

User Action: Check the health of the file system descriptor disks at the site
and ensure that they are working properly on all nodes.

site_fs_desc_ok STATE_CHANGE INFO no Message: Site {id} file system descriptor disk health is OK.

Description: Site file system descriptor disk health is OK.

Cause: N/A

User Action: N/A

site_fs_desc_warn STATE_CHANGE WARNING no Message: Site {id} file system {0} has no descriptor disks in failure groups
{1}.

Description: One or more file systems have descriptor disks that are
missing in the failure groups.

Cause: File system descriptor disks are missing at the site.

User Action: Check the health of the file system descriptor disks at the site
and ensure that they are working properly on all nodes.

site_fs_down STATE_CHANGE ERROR no Message: File system {0} is down or unavailable at site {id}.

Description: File system is unavailable on all nodes at the site.

Cause: File system is unavailable.

User Action: Check the health of the file system at the site and ensure that
it is properly mounted on all nodes.

site_fs_ok STATE_CHANGE INFO no Message: Site {id} file system health is OK.

Description: Site file system health is OK.

Cause: N/A

User Action: N/A

site_fs_quorum_fail STATE_CHANGE ERROR no Message: Site {id} file system {0} does not have enough healthy descriptor
disks for quorum.

Description: Not enough healthy descriptor disks are found.

Cause: The file system at the site does not have enough healthy descriptor
disks for quorum.

User Action: Check the health state of disks, which are declared as
descriptor disks for the file system, to prevent potential data loss. For more
information, see the Disk issues section in the IBM Storage Scale: Problem
Determination Guide.

Chapter 42. References 691


Table 100. Events for the Stretch cluster component (continued)

Event Event Severity Call Details


Type Home

site_fs_warn STATE_CHANGE WARNING no Message: Site {id} has {0} nodes that face file system issues with {1}.

Description: Many nodes face file system events at the site, which indicate
network, resource, or configuration issues.

Cause: Many nodes face file system events at the site.

User Action: Check the health of the file system at the site and ensure that
it is properly mounted on all nodes.

site_gpfs_down STATE_CHANGE ERROR no Message: GPFS is unavailable at the site {id}.

Description: GPFS is reported as unavailable at the site.

Cause: GPFS is reported as unavailable at the site.

User Action: Check the health of GPFS services at the site.

site_gpfs_ok STATE_CHANGE INFO no Message: Site {id} GPFS health is OK.

Description: Site GPFS health is OK.

Cause: N/A

User Action: N/A

site_gpfs_warn STATE_CHANGE WARNING no Message: Site {id} has {0} nodes that are facing GPFS unavailable health
events.

Description: Many nodes are facing GPFS unavailable events at the site,
which might indicate network, resource, or configuration issues.

Cause: Many nodes have reported GPFS unavailable events at the site.

User Action: Check the health of GPFS services at the site.

site_heartbeats_degraded STATE_CHANGE WARNING no Message: Site {id} has {0} nodes with missing heartbeat health events.

Description: Many nodes face missing heartbeat events at the site, which
might indicate network, resource, or configuration issues.

Cause: Many nodes face missing heartbeat events at the site.

User Action: Check the health of the site nodes.

site_heartbeats_ok STATE_CHANGE INFO no Message: Site {id} heartbeat is OK.

Description: Site heartbeats are healthy.

Cause: N/A

User Action: N/A

site_missing_heartbeats STATE_CHANGE ERROR no Message: Heartbeats are missing from site {id}.

Description: Heartbeats are missing from the site, which might indicate
network, resource, or configuration issues.

Cause: Heartbeats are missing from the site.

User Action: Check the health of the site.

site_ok STATE_CHANGE INFO no Message: Site is OK.

Description: Site is healthy.

Cause: N/A

User Action: N/A

site_quorum_down STATE_CHANGE ERROR no Message: Quorum unavailable is reported by site {id}.

Description: Quorum nodes cannot communicate with each other that is


causing GPFS to lose quorum.

Cause: IBM Storage Scale quorum is unavailable.

User Action: Check the health of the GPFS quorum state by using the
mmgetstate command and take corrective actions.

692 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 100. Events for the Stretch cluster component (continued)

Event Event Severity Call Details


Type Home

site_quorum_error STATE_CHANGE ERROR no Message: Site {id} is experiencing quorum issues with site {0}.

Description: Site nodes are unable to contact the quorum nodes at another
site.

Cause: IBM Storage Scale quorum reports warnings.IBM Storage Scale

User Action: Check the health of the GPFS quorum state by using the
mmgetstate command and take corrective actions.

site_quorum_ok STATE_CHANGE INFO no Message: Site {id} quorum health is OK.

Description: Site quorum health is OK.

Cause: N/A

User Action: N/A

site_replication_ok STATE_CHANGE INFO no Message: Site {id} replication health is OK.

Description: Site replication health is OK.

Cause: N/A

User Action: N/A

site_vanished INFO_DELETE_ENTITY INFO no Message: Site {id} is no longer configured as a stretch cluster site node.

Description: The site is no longer detected in the output of the


mmlsnodeclass command.

Cause: N/A

User Action: N/A

Transparent cloud tiering events


The following table lists the events that are created for the Transparent cloud tiering component.
Table 101. Events for the Transparent cloud tiering component

Event Event Severity Call Details


Type Home

tct_account_active STATE_CHANGE INFO no Message: Cloud provider account, which is configured with Transparent
Cloud Tiering service, is active.

Description: Cloud provider account, which is configured with Transparent


Cloud Tiering service, is active.

Cause: N/A

User Action: N/A

tct_account_bad_req STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a bad request error.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of a bad request error.

Cause: A bad request is encountered.

User Action: For more information, check the trace messages and error
logs.

tct_account_certinvalidpath STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because it was unable to find a valid certification path.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because it was unable to find a valid certification path.

Cause: Unable to find a valid certificate path.

User Action: For more information, check the trace messages and error
logs.

Chapter 42. References 693


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_account_configerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering refuses to connect to the cloud
provider.

Description: Transparent Cloud Tiering refuses to connect to the cloud


provider.

Cause: Some of the cloud provider-dependent services are down.

User Action: Check whether the cloud provider-dependent services are up


and running.

tct_account_configured STATE_CHANGE WARNING no Message: Cloud provider account is configured with Transparent Cloud
Tiering, but the service is down.

Description: Cloud provider account is configured with Transparent Cloud


Tiering, but the service is down.

Cause: The Transparent Cloud Tiering service is down.

User Action: Run the mmcloudgateway service start command to


resume the cloud gateway service.

tct_account_connecterror STATE_CHANGE ERROR no Message: An error occurred while attempting to connect a socket to cloud
provider URL.

Description: The connection was refused remotely by cloud provider.

Cause: No process is listening on the cloud provider address.

User Action: Check whether the cloud provider hostname and port
numbers are valid.

tct_account_containecreatere STATE_CHANGE ERROR no Message: The cloud provider container creation failed.
rror
Description: The cloud provider container creation failed.

Cause: The cloud provider account might not be authorized to create a


container.

User Action: For more information, check the trace messages and
error logs. Also, check whether the account creates issues. For more
information, see the 'Transparent Cloud Tiering issues' section in the
Problem Determination Guide.

tct_account_dbcorrupt STATE_CHANGE ERROR no Message: The database of Transparent Cloud Tiering service is corrupted.

Description: The database of Transparent Cloud Tiering service is


corrupted.

Cause: Database is corrupted.

User Action: For more information, check the trace messages and error
logs. Use the mmcloudgateway files rebuildDB command to repair
it.

tct_account_direrror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because one of its internal
directories is not found.

Description: Transparent Cloud Tiering failed because one of its internal


directories is not found.

Cause: Transparent Cloud Tiering service internal directory is missing.

User Action: For more information, check the trace messages and error
logs.

tct_account_invalidcredential STATE_CHANGE ERROR no Message: The cloud provider account credentials are invalid.
s
Description: The Transparent Cloud Tiering service failed to connect to
cloud provider because the authentication failed.

Cause: Cloud provider account credentials that are either changed or


expired.

User Action: Run the mmcloudgateway account update command to


change the cloud provider account password.

694 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_account_invalidurl STATE_CHANGE ERROR no Message: Cloud provider account URL is invalid.

Description: The reason can be because of 'HTTP 404 Not Found' error.

Cause: The reason can be because of 'HTTP 404 Not Found' error.

User Action: Check whether the cloud provider URL is valid.

tct_account_lkm_down STATE_CHANGE ERROR no Message: The local key manager, which is configured for Transparent Cloud
Tiering is not found or corrupted.

Description: The local key manager, which is configured for Transparent


Cloud Tiering is not found or corrupted.

Cause: Local key manager is not found or corrupted.

User Action: For more information, check the trace messages and error
logs.

tct_account_malformedurl STATE_CHANGE ERROR no Message: Cloud provider account URL is malformed.

Description: Cloud provider account URL is malformed.

Cause: The cloud provider URL is malformed.

User Action: Check whether the cloud provider URL is valid.

tct_account_manyretries INFO WARNING no Message: Transparent Cloud Tiering service faced too many internal
retries.

Description: Transparent Cloud Tiering service faced too many internal


retries.

Cause: The Transparent Cloud Tiering service faced connectivity issues


with the cloud provider.

User Action: For more information, check the trace messages and error
logs.

tct_account_network_down STATE_CHANGE ERROR no Message: The network connection to the Transparent Cloud Tiering node is
down.

Description: The network connection to the Transparent Cloud Tiering


node is down.

Cause: A network connection problem was encountered.

User Action: For more information, check the trace messages and error
logs. Also, check whether the network connection is valid.

tct_account_noroute STATE_CHANGE ERROR no Message: The response from cloud provider is invalid.

Description: The response from cloud provider is invalid.

Cause: The cloud provider URL returns a response code as '-1'.

User Action: Check whether the cloud provider URL is accessible.

tct_account_notconfigured STATE_CHANGE WARNING no Message: Transparent Cloud Tiering is not configured with the cloud
provider account.

Description: The Transparent Cloud Tiering is not configured with the cloud
provider account.

Cause: The Transparent Cloud Tiering installed, but account is not


configured or account is deleted.

User Action: Run the mmcloudgateway account create command to


create the cloud provider account.

tct_account_preconderror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a precondition failed error.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of a precondition failed error.

Cause: Cloud provider URL returned an 'HTTP 412 Precondition Failed'


error message.

User Action: For more information, check the trace messages and error
logs.

Chapter 42. References 695


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_account_rkm_down STATE_CHANGE ERROR no Message: The remote key manager, which is configured for Transparent
Cloud Tiering, is inaccessible.

Description: The remote key manager, which is configured for Transparent


Cloud Tiering, is inaccessible.

Cause: The Transparent Cloud Tiering fails to connect to the IBM Security
Key Lifecycle Manager.

User Action: For more information, check the trace messages and error
logs.

tct_account_servererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a cloud provider service encounters an unavailability error.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of a cloud provider server error or when the container
size reached the maximum storage limit.

Cause: Cloud provider returns an 'HTTP 503 Server' error message.

User Action: For more information, check the trace messages and error
logs.

tct_account_sockettimeout STATE_CHANGE ERROR no Message: Timeout occurred on a socket while connecting to the cloud
provider.

Description: Timeout occurred on a socket while connecting to the cloud


provider.

Cause: A network connection problem was encountered.

User Action: For more information, check the trace messages and error
log. Also, check whether the network connection is valid.

tct_account_sslbadcert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a bad SSL certificate.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of a bad SSL certificate.

Cause: A bad SSL certificate is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslcerterror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect to the cloud provider
because of an untrusted server certificate chain.

Description: Transparent Cloud Tiering failed to connect to the cloud


provider because of an untrusted server certificate chain.

Cause: An untrusted server certificate chain error was encountered.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of an error, which is found in the SSL subsystem.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of an error, which is found in the SSL subsystem.

Cause: An error is found in the SSL subsystem.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslhandshakeerr STATE_CHANGE ERROR no Message: The cloud account status failed due to an unknown SSL
or handshake error.

Description: The cloud account status failed due to an unknown SSL


handshake error.

Cause: TCT and cloud provider cannot negotiate the desired level of
security.

User Action: For more information, check the trace messages and error
logs.

696 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_account_sslhandshakefail STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
ed because they cannot negotiate the desired level of security.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because they cannot negotiate the desired level of security.

Cause: Transparent Cloud Tiering and cloud provider server cannot


negotiate the desired level of security.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslinvalidalgo STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect to the cloud provider
because of invalid SSL algorithm parameters.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of invalid or inappropriate SSL algorithm parameters.

Cause: Invalid or inappropriate SSL algorithm parameters are found.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslinvalidpaddin STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect to the cloud provider
g because of invalid SSL padding.

Description: Transparent Cloud Tiering failed to connect to the cloud


provider because of invalid SSL padding.

Cause: Invalid SSL padding is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslkeyerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a bad SSL key or misconfiguration.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of a bad SSL key or misconfiguration.

Cause: A bad SSL key or misconfiguration is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslnocert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because no certificate is available.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because no certificate is available.

Cause: No certificate is available.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslnottrustedcer STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
t because of an untrusted SSL server certificate.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of an untrusted SSL server certificate.

Cause: The cloud provider server SSL certificate is untrusted.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslpeererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because its identity cannot be verified.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because its identity cannot be verified.

Cause: Cloud provider identity cannot be verified.

User Action: For more information, check the trace messages and error
logs.

Chapter 42. References 697


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_account_sslprotocolerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because an error is found during the SSL protocol operation.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because an error is found during the SSL protocol operation.

Cause: An SSL protocol implementation error is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslscoketclosed STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because a remote host closed the connection during a handshake.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because a remote host closed the connection during a handshake.

Cause: Remote host had closed the connection during a handshake.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslunknowncert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of an unknown SSL certificate.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of an unknown SSL certificate.

Cause: An unknown SSL certificate is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_sslunrecognized STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
msg because of an unrecognized SSL message.

Description: Transparent Cloud Tiering fails to connect to the cloud


provider because of an unrecognized SSL message.

Cause: An unrecognized SSL message is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_timeskewerror STATE_CHANGE ERROR no Message: The time, which is observed on the Transparent Cloud Tiering
service node, is not in sync with the time on the target cloud provider.

Description: The time, which is observed on the Transparent Cloud Tiering


service node, is not in sync with the time on the target cloud provider.

Cause: Transparent Cloud Tiering service node current timestamp is not in


sync with the target cloud provider.

User Action: Change the Transparent Cloud Tiering service node


timestamp to be in sync with NTP server, and rerun the operation.

tct_account_unknownerror STATE_CHANGE ERROR no Message: The cloud provider account is inaccessible due to an unknown
error.

Description: The cloud provider account is inaccessible due to an unknown


error.

Cause: An unknown runtime exception is found.

User Action: For more information, check the trace messages and error
logs.

tct_account_unreachable STATE_CHANGE ERROR no Message: Cloud provider account URL is unreachable.

Description: The cloud provider end-point URL is unreachable because


either it is down or has network issues.

Cause: The cloud provider URL is unreachable.

User Action: For more information, check trace messages, error log, and
DNS settings.

698 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_container_alreadyexists STATE_CHANGE ERROR no Message: The cloud provider container creation failed as it already exists.
CSAP/Container pair set: {id}.

Description: The cloud provider container creation failed as it already


exists.

Cause: The cloud provider container already exists.

User Action: For more information, check the trace messages and error
log.

tct_container_creatererror STATE_CHANGE ERROR no Message: The cloud provider container creation failed. CSAP/Container pair
set: {id}.

Description: The cloud provider container creation failed.

Cause: The cloud provider account might not be authorized to create a


container.

User Action: For more information, check the trace messages and error
log.

tct_container_limitexceeded STATE_CHANGE ERROR no Message: The cloud provider container creation failed as it exceeded the
maximum limit. CSAP/Container pair set: {id}.

Description: The cloud provider container creation failed as it exceeded


the maximum limit.

Cause: The cloud provider containers exceeded the maximum limit.

User Action: For more information, check the trace messages and error
log.

tct_container_notexists STATE_CHANGE ERROR no Message: The cloud provider container does not exist. CSAP/Container pair
set: {id}.

Description: The cloud provider container does not exist.

Cause: The cloud provider container does not exist.

User Action: Check the cloud provider to verify whether the container
exists.

tct_cs_disabled STATE_CHANGE INFO no Message: Cloud service {id} is disabled.

Description: Cloud service is disabled.

Cause: A Cloud service is disabled by the administrator.

User Action: N/A

tct_cs_enabled STATE_CHANGE INFO no Message: Cloud service {id} is enabled.

Description: Cloud service is enabled for cloud operations.

Cause: A Cloud service is enabled by the administrator.

User Action: N/A

tct_cs_found INFO_ADD_ENTITY INFO no Message: Cloud service {0} is found.

Description: A new cloud service is found.

Cause: A new cloud service is listed by the mmcloudgateway service


status command.

User Action: N/A

tct_cs_vanished INFO_DELETE_ENTITY INFO no Message: Cloud service {0} is unavailable.

Description: One of the cloud services cannot be detected anymore.

Cause: One of the previously monitored cloud services is not listed by


the mmcloudgateway service status command anymore, possibly
because Transparent Cloud Tiering service is in the suspended state.

User Action: N/A

Chapter 42. References 699


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_csap_access_denied STATE_CHANGE ERROR no Message: Cloud storage access point failed due to an authorization error.
CSAP/Container pair set: {id}.

Description: Access is denied due to an authorization error.

Cause: Access is denied due to an authorization error.

User Action: Check the authorization configurations on the cloud provider.

tct_csap_bad_req STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a bad request error. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of a bad request error.

Cause: Bad request is encountered.

User Action: For more information, check the trace messages and error
log.

tct_csap_base_found INFO_ADD_ENTITY INFO no Message: CSAP {0} was found.

Description: A new CSAP was found.

Cause: A new CSAP is listed by the mmcloudgateway service status


command.

User Action: N/A

tct_csap_base_removed INFO_DELETE_ENTITY INFO no Message: CSAP {0} was deleted or converted to a CSAP or container pair.

Description: A CSAP was deleted.

Cause: One of the previously monitored CSAPs is not listed by the


mmcloudgateway service status command anymore.

User Action: N/A

tct_csap_certinvalidpath STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect cloud storage access
point because it could not find a valid certification path. CSAP/Container
pair set: {id}.

Description: Transparent Cloud Tiering failed to connect cloud storage


access point because it could not find a valid certification path.

Cause: Unable to find a valid certificate path.

User Action: For more information, check the trace messages and error
log.

tct_csap_configerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering refused to connect to the cloud
storage access point. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering refused to connect to the cloud


storage access point.

Cause: Some of the cloud provider-dependent services are down.

User Action: Check whether the cloud provider-dependent services are up


and running.

tct_csap_connecterror STATE_CHANGE ERROR no Message: An error occurred while attempting to connect a socket to the
cloud storage access point URL. CSAP/Container pair set: {id}.

Description: The connection was refused remotely by the cloud storage


access point.

Cause: No process is listening to the cloud storage access point address.

User Action: Check whether the cloud storage access point hostname and
port numbers are valid.

tct_csap_dbcorrupt STATE_CHANGE ERROR no Message: The database of Transparent Cloud Tiering service is corrupted.
CSAP/Container pair set: {id}.

Description: The database of Transparent Cloud Tiering service is


corrupted.

Cause: Database is corrupted.

User Action: Run the mmcloudgateway files rebuildDB command to


rebuild the database.

700 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_csap_forbidden STATE_CHANGE ERROR no Message: Cloud storage access point failed with an authorization error.
CSAP/Container pair set: {id}.

Description: The 'HTTP 403 Forbidden' error message is encountered.

Cause: The 'HTTP 403 Forbidden' error message is encountered.

User Action: Check the authorization configurations on the cloud provider.

tct_csap_found INFO_ADD_ENTITY INFO no Message: CSAP or container pair {0} was found.

Description: A new CSAP or container pair was found.

Cause: A new CSAP or container pair is listed by the mmcloudgateway


service status command.

User Action: N/A

tct_csap_invalidcredentials STATE_CHANGE ERROR no Message: The cloud storage access point account {0} credentials are
invalid. CSAP/Container pair set: {id}.

Description: The Transparent Cloud Tiering service fails to connect to the


cloud storage access point because the authentication failed.

Cause: Cloud storage access point account credentials are either changed
or expired.

User Action: Run the mmcloudgateway account update command to


change the cloud provider account password.

tct_csap_invalidurl STATE_CHANGE ERROR no Message: Cloud storage access point URL is invalid. CSAP/Container pair
set: {id}.

Description: An 'HTTP 404 Not Found' error message is encountered.

Cause: An 'HTTP 404 Not Found' error message is encountered.

User Action: Check whether the cloud provider URL is valid.

tct_csap_lkm_down STATE_CHANGE ERROR no Message: The local key manager, which is configured for Transparent Cloud
Tiering, is not found or corrupted. CSAP/Container pair set: {id}.

Description: The local key manager, which is configured for Transparent


Cloud Tiering, is not found or corrupted.

Cause: Local key manager is not found or corrupted.

User Action: For more information, check the trace messages and error
log.

tct_csap_malformedurl STATE_CHANGE ERROR no Message: Cloud storage access point URL is malformed. CSAP/Container
pair set: {id}.

Description: Cloud storage access point URL is malformed.

Cause: The cloud provider URL is malformed.

User Action: Check whether the cloud provider URL is valid.

tct_csap_noroute STATE_CHANGE ERROR no Message: The response from cloud storage access point is invalid. CSAP/
Container pair set: {id}.

Description: The response from cloud storage access point is invalid.

Cause: The cloud storage access point URL returns a response code '-1'.

User Action: Check whether the cloud storage access point URL is
accessible.

tct_csap_online STATE_CHANGE INFO no Message: Cloud storage access point, which is configured with Transparent
Cloud Tiering service, is active. CSAP/Container pair set: {id}.

Description: Cloud storage access point, which is configured with


Transparent Cloud Tiering service, is active.

Cause: N/A

User Action: N/A

Chapter 42. References 701


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_csap_preconderror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a precondition failed error. CSAP/Container pair
set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of a precondition failed error.

Cause: Cloud storage access point URL returns an 'HTTP 412 Precondition
Failed' error message.

User Action: For more information, check the trace messages and error
log.

tct_csap_removed INFO_DELETE_ENTITY INFO no Message: CSAP or container pair {0} is unavailable.

Description: A CSAP or container pair cannot be detected anymore.

Cause: One of the previously monitored CSAPs or container pair is not


listed by the mmcloudgateway service status command anymore,
possibly because Transparent Cloud Tiering service is in the suspended
state.

User Action: N/A

tct_csap_rkm_down STATE_CHANGE ERROR no Message: The remote key manager, which is configured for Transparent
Cloud Tiering, is inaccessible. CSAP/Container pair set: {id}.

Description: The remote key manager, which is configured for Transparent


Cloud Tiering, is inaccessible.

Cause: The Transparent Cloud Tiering failed to connect to the IBM Security
Key Lifecycle Manager.

User Action: For more information, check the trace messages and error
log.

tct_csap_servererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of cloud storage access point service encounters an
unavailability error. CSAP/Container pair set:{id}.

Description: Transparent Cloud Tiering fails to connect to the cloud


storage access point because of cloud storage access point server error
or container size reached the maximum storage limit.

Cause: Cloud storage access point returned an 'HTTP 503 Server' error
message.

User Action: For more information, check the trace messages and error
log.

tct_csap_sockettimeout STATE_CHANGE ERROR no Message: Timeout occurred on a socket while connecting to the cloud
storage access point URL. CSAP/Container pair set: {id}.

Description: Timeout occurred on a socket while connecting to the cloud


storage access point URL.

Cause: A network connection problem is encountered.

User Action: For more information, check the trace messages and error
log. Also, check whether the network connection is valid.

tct_csap_sslbadcert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a bad SSL certificate. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of a bad SSL certificate.

Cause: A bad SSL certificate is encountered.

User Action: For more information, check the trace messages and error
log.

702 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_csap_sslcerterror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an untrusted server certificate chain. CSAP/
Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of an untrusted server certificate chain.

Cause: An untrusted server certificate chain error is encountered.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an error that is found in the SSL subsystem. CSAP/
Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of an error that is found in the SSL subsystem.

Cause: An error is found in the SSL subsystem.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslhandshakeerror STATE_CHANGE ERROR no Message: The cloud storage access point status failed due to an unknown
SSL handshake error. CSAP/Container pair set: {id}.

Description: The cloud storage access point status failed due to an


unknown SSL handshake error.

Cause: Transparent Cloud Tiering and cloud storage access point cannot
negotiate the desired level of security.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslhandshakefailed STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because they cannot negotiate the desired level of security.
CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because they cannot negotiate the desired level of security.

Cause: Transparent Cloud Tiering and cloud storage access point cannot
negotiate the desired level of security.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslinvalidalgo STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of invalid SSL algorithm parameters. CSAP/Container
pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of invalid or inappropriate SSL algorithm parameters.

Cause: Invalid or inappropriate SSL algorithm parameters are found.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslinvalidpadding STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of invalid SSL padding. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of invalid SSL padding.

Cause: Invalid SSL padding is found.

User Action: For more information, check the trace messages and error
log.

Chapter 42. References 703


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_csap_sslkeyerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a bad SSL key or misconfiguration. CSAP/Container
pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of a bad SSL key or misconfiguration.

Cause: A bad SSL key or misconfiguration is encountered.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslnocert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because no certificate is available. CSAP/Container pair set:
{id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because no certificate is available.

Cause: No certificate is available.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslnottrustedcert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an untrusted server certificate. CSAP/Container
pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of an untrusted SSL server certificate.

Cause: The cloud storage access point server SSL certificate is untrusted.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslpeererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because its identity cannot be verified. CSAP/Container pair
set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because its identity cannot be verified.

Cause: Cloud provider identity cannot be verified.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslprotocolerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an error that is found in the SSL protocol operation.
CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of an error that is found in the SSL protocol operation.

Cause: An SSL protocol implementation error is found.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslscoketclosed STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because the remote host closed the connection during a
handshake. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because the remote host closed the connection during a
handshake.

Cause: Remote host closed the connection during a handshake.

User Action: For more information, check the trace messages and error
log.

704 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_csap_sslunknowncert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an unknown SSL certificate. CSAP/Container pair
set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of an unknown SSL certificate.

Cause: An unknown SSL certificate is found.

User Action: For more information, check the trace messages and error
log.

tct_csap_sslunrecognizedmsg STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an unrecognized SSL message. CSAP/Container
pair set: {id}.

Description: Transparent Cloud Tiering fails to connect to the cloud storage


access point because of an unrecognized SSL message.

Cause: An unrecognized SSL message is found.

User Action: For more information, check the trace messages and error
log.

tct_csap_timeskewerror STATE_CHANGE ERROR no Message: The time, which is observed on the Transparent Cloud Tiering
service node, is not in sync with the time on target cloud storage access
point. CSAP/Container pair set: {id}.

Description: The time, which is observed on the Transparent Cloud Tiering


service node, is not in sync with the time on target cloud storage access
point.

Cause: Transparent Cloud Tiering service node current timestamp is not in


sync with the target cloud storage access point.

User Action: Change Transparent Cloud Tiering service node timestamp to


be in sync with NTP server and rerun the operation.

tct_csap_toomanyretries INFO WARNING no Message: Transparent Cloud Tiering service experienced too many internal
retries. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering service experienced too many


internal retries.

Cause: The Transparent Cloud Tiering service might experience


connectivity issues with the cloud service provider.

User Action: For more information, check the trace messages and error
log.

tct_csap_unknownerror STATE_CHANGE ERROR no Message: The cloud storage access point account is inaccessible due to an
unknown error. CSAP/Container pair set: {id}.

Description: The cloud storage access point account is inaccessible due to


an unknown error.

Cause: An unknown runtime exception is found.

User Action: For more information, check the trace messages and error
log.

tct_csap_unreachable STATE_CHANGE ERROR no Message: Cloud storage access point URL is unreachable. CSAP/Container
pair set: {id}.

Description: The cloud storage access point URL is unreachable because it


is down or due to network issues.

Cause: The cloud storage access point URL is unreachable.

User Action: For more information, check the trace messages, error logs,
and DNS settings.

Chapter 42. References 705


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_dir_corrupted STATE_CHANGE ERROR no Message: The directory of Transparent Cloud Tiering service is corrupted.
CSAP/Container pair set: {id}.

Description: The directory of Transparent Cloud Tiering service is


corrupted.

Cause: Directory is corrupted.

User Action: For more information, check the trace messages and error
log.

tct_fs_configured STATE_CHANGE INFO no Message: The Transparent Cloud Tiering is configured with the file system.

Description: The Transparent Cloud Tiering is configured with the file


system.

Cause: N/A

User Action: N/A

tct_fs_corrupted STATE_CHANGE ERROR no Message: The file system {0} of Transparent Cloud Tiering service is
corrupted. CSAP/Container pair set: {id}.

Description: The file system of Transparent Cloud Tiering service is


corrupted.

Cause: File system is corrupted.

User Action: For more information, check the trace messages and error
log.

tct_fs_notconfigured STATE_CHANGE WARNING no Message: The Transparent Cloud Tiering is not configured with the file
system.

Description: The Transparent Cloud Tiering is not configured with the file
system.

Cause: The Transparent Cloud Tiering is installed, but the file system is not
configured or deleted.

User Action: Run the mmcloudgateway filesystem create command


to configure the file system.

tct_fs_running_out_space INFO WARNING no Message: Available disk space is {0}.

Description: File system, where Transparent Cloud Tiering got installed, is


running out of space.

Cause: File system, where Transparent Cloud Tiering got installed, is


running out of space.

User Action: Free up disk space on the file system where Transparent
Cloud Tiering is installed.

tct_internal_direrror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because one of its internal
directories is not found. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering failed because one of its internal


directories is not found.

Cause: Transparent Cloud Tiering service internal directory is missing.

User Action: For more information, check the trace messages and error
log.

tct_km_error STATE_CHANGE ERROR no Message: The key manager, which is configured for Transparent Cloud
Tiering, is not found or corrupted. CSAP/Container pair set: {id}.

Description: The key manager, which is configured for Transparent Cloud


Tiering, is not found or corrupted.

Cause: Key manager is not found or corrupted.

User Action: For more information, check the trace messages and error
log.

706 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_network_interface_down STATE_CHANGE ERROR no Message: The network of Transparent Cloud Tiering node is down. CSAP/
Container pair set: {id}.

Description: The network of Transparent Cloud Tiering node is down.

Cause: A network connection problem is encountered.

User Action: For more information, check the trace messages and error
log. Also, check whether the network connection is valid.

tct_only_ensure STATE_CHANGE INFO no Message: Transparent Cloud Tiering container is available on cloud, but
does not guarantee that migrate operations would work. Container pair set:
{id}.

Description: Transparent Cloud Tiering container is available on cloud, but


does not guarantee that migrate operations would work.

Cause: Transparent Cloud Tiering container is available on cloud, but does


not guarantee that migrate operations would work.

User Action: For more information, check the trace messages and error
log.

tct_resourcefile_notfound STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because resource address file is
not found. CSAP/Container pair set: {id}.

Description: Transparent Cloud Tiering failed because resource address


file is not found.

Cause: Transparent Cloud Tiering failed because resource address file is


not found.

User Action: For more information, check the trace messages and error
log.

tct_rootdir_notfound STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because its container pair root
directory is not found. Container pair set: {id}.

Description: Transparent Cloud Tiering failed because its container pair


root directory is not found.

Cause: Transparent Cloud Tiering failed because its container pair root
directory is not found.

User Action: For more information, check the trace messages and error
log.

tct_service_down STATE_CHANGE ERROR no Message: Cloud gateway service is down.

Description: The cloud gateway service is down and cannot be started.

Cause: The mmcloudgateway service status command returned that


the cloud gateway service is stopped.

User Action: Run the mmcloudgateway service start command to


start the cloud gateway service.

tct_service_notconfigured STATE_CHANGE WARNING no Message: Transparent Cloud Tiering is not configured.

Description: The Transparent Cloud Tiering is not configured or its service


is not started.

Cause: The Transparent Cloud Tiering is not configured or its service is not
started.

User Action: Set up the Transparent Cloud Tiering and start its service.

tct_service_restart INFO WARNING no Message: The cloud gateway service failed. Trying to recover.

Description: Attempt to restart the cloud gateway process.

Cause: A problem with the cloud gateway process is detected.

User Action: N/A

Chapter 42. References 707


Table 101. Events for the Transparent cloud tiering component (continued)

Event Event Severity Call Details


Type Home

tct_service_suspended STATE_CHANGE WARNING no Message: Cloud gateway service is suspended.

Description: The cloud gateway service is manually suspended.

Cause: The mmcloudgateway service status command returns


suspended.

User Action: Run the mmcloudgateway service start command to


resume the cloud gateway service.

tct_service_up STATE_CHANGE INFO no Message: Cloud gateway service is up and running.

Description: The cloud gateway service is up and running.

Cause: N/A

User Action: N/A

tct_service_warn INFO WARNING no Message: The cloud gateway monitoring returned an unknown result.

Description: The cloud gateway check returned an unknown result.

Cause: N/A

User Action: Perform troubleshooting. For more information, see the


Problem Determination Guide.

Threshold events
The following table lists the events that are created for the threshold component.
Table 102. Events for the threshold component

Event Event Severity Call Details


Type Home

activate_afm_inqueue_rule INFO INFO no Message: Detected AFM Gateway node {id}. Enabled AFM In Queue
Memory rule for threshold monitoring.

Description: Detected AFM Gateway node. The AFM InQueue Memory


threshold rule is included in the active monitoring process.

Cause: Detected AFM Gateway node to enable AFM InQueue memory


monitoring.

User Action: N/A

activate_smb_default_rules INFO INFO no Message: Detected new SMB exports. The SMBGlobalStats sensor on node
{id} is configured. Enable SMBConnPerNode_Rule and SMBConnTotal_Rule
for threshold monitoring.

Description: Detected new SMB exports. The default threshold rules,


SMBConnPerNode_Rule, and SMBConnTotal_Rule, are included in the
active monitoring process when they are not running.

Cause: Detected new SMB exports, and check the default rules that are
required to be enabled.

User Action: N/A

reset_threshold INFO INFO no Message: Requesting the current threshold states.

Description: Sysmon restart is detected, which requests the current


threshold states.

Cause: Node restarted.

User Action: N/A

thresh_monitor_del_active INFO_DELETE_ENTITY INFO no Message: The threshold monitoring process is not more running in ACTIVE
state on the local node.

Description: Removed the ACTIVE threshold monitoring role from the


pmcollector that is running on the local node.

Cause: A threshold monitoring role was removed.

User Action: N/A

708 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 102. Events for the threshold component (continued)

Event Event Severity Call Details


Type Home

thresh_monitor_lost_active INFO INFO no Message: The pmcollector on node {id} has lost the active role of the
threshold monitoring.

Description: A pmcollector node lost the active role of the threshold


monitoring.

Cause: A pmcollector node lost the active role of the threshold monitoring.

User Action: N/A

thresh_monitor_set_active INFO_ADD_ENTITY INFO no Message: The threshold monitoring process is running in ACTIVE state on
the local node.

Description: Added the ACTIVE thresholds monitoring role to the


pmcollector that is running on the local node.

Cause: A threshold monitoring role was added.

User Action: N/A

thresholds_del_rule INFO_DELETE_ENTITY INFO no Message: Rule {0} was removed.

Description: A threshold rule was removed.

Cause: A threshold rule was removed.

User Action: N/A

thresholds_error STATE_CHANGE ERROR no Message: The value of {1} for the component(s) {id} exceeded the
threshold error level {0} defined in {2}.

Description: The threshold value reached an error level.

Cause: The threshold value reached an error level.

User Action: Run the mmhealth thresholds list -v command and


review the user action recommendations for the corresponding threshold
rule.

thresholds_new_rule INFO_ADD_ENTITY INFO no Message: Rule {0} was added.

Description: A threshold rule was added.

Cause: A threshold rule was added.

User Action: N/A

thresholds_no_data STATE_CHANGE INFO no Message: The value of {1} for the component(s) {id}, which is defined in {2},
returns no data.

Description: The thresholds value cannot be determined.

Cause: The thresholds value cannot be determined.

User Action: N/A

thresholds_no_rules STATE_CHANGE INFO no Message: No thresholds are defined.

Description: No thresholds are defined.

Cause: No thresholds are defined.

User Action: N/A

thresholds_normal STATE_CHANGE INFO no Message: The value of {1} defined in {2} for component {id} reached a
normal level.

Description: The threshold value reached a normal level.

Cause: The threshold value reached a normal level.

User Action: N/A

thresholds_removed STATE_CHANGE INFO no Message: The value of {1} for the component(s) {id}, which was defined in
{2}, was removed.

Description: The thresholds value cannot be determined.

Cause: No component usage data in performance monitoring.

User Action: N/A

Chapter 42. References 709


Table 102. Events for the threshold component (continued)

Event Event Severity Call Details


Type Home

thresholds_warn STATE_CHANGE WARNING no Message: The value of {1} for the component(s) {id} exceeded the
threshold warning level {0} defined in {2}.

Description: The threshold value reached a warning level.

Cause: The threshold value reached a warning level.

User Action: Run the mmhealth thresholds list -v command and


review the user action recommendations for the corresponding threshold
rule.

Watchfolder events
The following table lists the events that are created for the Watchfolder component.
Table 103. Events for the Watchfolder component

Event Event Severity Call Details


Type Home

watchc_service_failed STATE_CHANGE ERROR no Message: Watchfolder consumer {1} for file system {0} is not running.

Description: Watchfolder consumer service is not running.

Cause: N/A

User Action: For more information, run the systemctl status


<consumername> and check '/var/adm/ras/mmwatch.log'.

watchc_service_ok STATE_CHANGE INFO no Message: Watchfolder consumer service for file system {0} is running.

Description: Watchfolder consumer service is running.

Cause: N/A

User Action: N/A

watchc_warn STATE_CHANGE WARNING no Message: Warning is encountered in watchfolder consumer for file system
{0}.

Description: Warning is encountered in the watchfolder consumer.

Cause: N/A

User Action: For more information, check '/var/adm/ras/mmwatch.log'.

watchconduit_err STATE_CHANGE ERROR no Message: {0} error: {1} is encountered in GPFS Watch Conduit for watch id
{id}.

Description: Error is encountered in GPFS Watch Conduit.

Cause: N/A

User Action: For more information, check '/var/adm/ras/mmwatch.log'.

watchconduit_found INFO_ADD_ENTITY INFO no Message: Watch conduit for watch id {id} was found.

Description: A watch conduit that is listed in the IBM Storage Scale


configuration was detected.

Cause: N/A

User Action: N/A

watchconduit_ok STATE_CHANGE INFO no Message: GPFS Watch Conduit for watch id {id} is running.

Description: GPFS Watch Conduit is running.

Cause: N/A

User Action: N/A

watchconduit_replay_done STATE_CHANGE INFO no Message: Conduit has finished replaying {0} events for watch id {id}.

Description: Conduit replay has finished replaying the events.

Cause: N/A

User Action: N/A

710 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 103. Events for the Watchfolder component (continued)

Event Event Severity Call Details


Type Home

watchconduit_resume STATE_CHANGE INFO no Message: Conduit has finished producing {0} events to secondary sink for
watch id {id}.

Description: Conduit is resuming from a suspended state.

Cause: N/A

User Action: N/A

watchconduit_suspended STATE_CHANGE INFO no Message: GPFS Watch Conduit for watch id {id} is suspended.

Description: GPFS Watch Conduit is suspended.

Cause: N/A

User Action: N/A

watchconduit_vanished INFO_DELETE_ENTITY INFO no Message: GPFS Watch Conduit for watch id {id} has vanished.

Description: GPFS Watch Conduit, which is listed in the IBM Storage Scale
configuration, has been removed.

Cause: N/A

User Action: N/A

watchconduit_warn STATE_CHANGE WARNING no Message: {0} warning: {1} is encountered for watch id {id}.

Description: Warning is encountered in GPFS watch conduit.

Cause: N/A

User Action: For more information, check '/var/adm/ras/mmwatch.log'.

watchfolder_service_err STATE_CHANGE ERROR no Message: Error is loading the librdkafka library for watchfolder producers.

Description: librdkafka is not installed on the current node.

Cause: N/A

User Action: Verify that 'gpfs.librdkafka' package is correctly installed.

watchfolder_service_ok STATE_CHANGE INFO no Message: Watchfolder service is running.

Description: Watchfolder service is running.

Cause: N/A

User Action: N/A

watchfolderp_auth_err STATE_CHANGE ERROR no Message: Error obtaining authentication credentials for Kafka
authentication. Error message: {2}.

Description: Authentication error encountered in event producer.

Cause: N/A

User Action: Verify that Watchfolder is properly configured.

watchfolderp_auth_info TIP TIP no Message: Authentication information for Kafka is not present or outdated.
Request to update credentials has been started and new credentials are
used on next event. Message: {2}.

Description: Event producer has no or outdated authentication


information.

Cause: N/A

User Action: N/A

watchfolderp_auth_warn STATE_CHANGE WARNING no Message: Authentication credentials for Kafka could not be obtained.
Attempt to update credentials are made later. Message: {2}.

Description: Event producer failed to obtain authentication information.

Cause: N/A

User Action: N/A

Chapter 42. References 711


Table 103. Events for the Watchfolder component (continued)

Event Event Severity Call Details


Type Home

watchfolderp_create_err STATE_CHANGE ERROR no Message: Error is encountered while creating a new loading or configuring
event producer. Error message: {2}.

Description: error encountered when creating a new event producer.

Cause: N/A

User Action: Verify that correct gpfs.librdkafka is installed. For more


information, check '/var/adm/ras/mmfs.log.latest'.

watchfolderp_found INFO_ADD_ENTITY INFO no Message: New event producer for {id} was configured.

Description: A new event producer was configured.

Cause: N/A

User Action: N/A

watchfolderp_log_err STATE_CHANGE ERROR no Message: Error opening or writing to event producer log file.

Description: Log error encountered in event producer.

Cause: N/A

User Action: Verify that log directory and file /var/adm/ras/mmwf.log


is present and writeable. For more information, check the '/var/adm/ras/
mmfs.log.latest'.

watchfolderp_msg_send_err STATE_CHANGE ERROR no Message: Failed to send Kafka message for file system {2}. Error message:
{3}.

Description: Error while sending messages is encountered in the event


producer.

Cause: N/A

User Action: Check the connectivity to Kafka broker and topic, and
whether a broker can accept new messages for the given topic. For
more information, check '/var/adm/ras/mmfs.log.latest' and '/var/adm/ras/
mmmsgqueue.log'.

watchfolderp_msg_send_sto STATE_CHANGE ERROR no Message: Failed to send more than {2} kafka messages. Producer is now
p shutdown and no more messages are sent.

Description: Producer has reached an error threshold. The producer no


longer attempts to send messages.

Cause: N/A

User Action: To re-enable events, run the mmaudit all


producerRestart -N <Node> command. If that does not succeed,
then disable and re-enable the watchfolder. Use the mmwatch <device>
disable/enable command. However, if watchfolder fails again, then
disable and re-enable the message queue if you are running the message
queue. Run the mmmsgqueue enable/disable command, and then
enable the watchfolder. If the watchfolder continues to fail, then run the
mmmsgqueue config --remove command, then enable the message
queue, followed by enabling the watchfolder. If you are running with no
message queue, then ensure that the external sink is up and reachable
from this IBM Storage Scale node.

watchfolderp_msgq_unsuppo STATE_CHANGE ERROR no Message: Message queue is no longer supported and no clustered watch
rted folder or file audit logging commands can run until the message queue is
removed.

Description: Message queue is no longer supported in IBM Storage Scale


and must be removed.

Cause: N/A

User Action: For more information, see the mmmsgqueue config --


remove-msgqueue command page in the Command Reference Guide.

watchfolderp_ok STATE_CHANGE INFO no Message: Event producer for file system {2} is OK.

Description: Event producer is OK.

Cause: N/A

User Action: N/A

712 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 103. Events for the Watchfolder component (continued)

Event Event Severity Call Details


Type Home

watchfolderp_vanished INFO_DELETE_ENTITY INFO no Message: An event producer for {id} has been removed.

Description: An event producer has been removed.

Cause: N/A

User Action: N/A

Transparent cloud tiering status description


The following table describes the various statuses and their description that is associated with the health
of Cloud services running on each node in the cluster.

Table 104. Cloud services status description


Entity Status Description Comments
TCT Not configured The Transparent cloud tiering is Run the mmcloudgateway
Account installed, but the account is not account create command to
Status configured or deleted. create a cloud provider account.

Active The cloud provider account that is N/A


configured with Transparent cloud
tiering service is active.
Configured The cloud provider account is Run the mmcloudgateway
configured with Transparent cloud service start command to
tiering, but the service is down. resume the cloud gateway service.

Unreachable The cloud provider access point URL is Ensure that the cloud provider is
unreachable because of network issues online. Check whether the network
or when the cloud provider access is reachable between the cloud
point URL is down. provider and Transparent cloud
tiering. Also, check the DNS settings.
For more information, check the
trace messages and error log.

Chapter 42. References 713


Table 104. Cloud services status description (continued)
Entity Status Description Comments
TCT invalid_csv_en The reason might be because of an Check whether the configured cloud
Account dpoint_URL HTTP 404 Not Found error. provider URL is valid.
Status
malformed_URL The cloud provider access point Check whether the following
URL is malformed. conditions are met:
• Cloud provider URL is valid.
• Cloud provider URL has a proper
legal protocol, such as HTTP or
HTTPS.
• Cloud provider URL has space or
any special characters that cannot
be parsed.

no_route_to_cs The response from the cloud storage Check whether the following
p access point is invalid. conditions are met:
• DNS and firewall settings are
configured.
• Network is reachable between
Transparent cloud tiering and the
cloud provider.

connect_except The connection is refused remotely by Check whether the following


ion the CSAP. conditions are met:
• Network is reachable between
Transparent cloud tiering and the
cloud provider.
• Cloud provider URL is valid.
• Cloud provider receives
connections.

714 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 104. Cloud services status description (continued)
Entity Status Description Comments
TCT socket_timeout Timeout occurred on a socket while it Check whether network is reachable
Account was connecting to the cloud provider. between Transparent cloud tiering
Status and the cloud provider.

invalid_cloud_ Transparent cloud tiering refuses to Check whether the cloud object
config connect to the CSAP. store is configured correctly.
For a Swift cloud provider, check
whether both Keystone and Swift
provider configuration are proper.
Also, check whether Swift is
reachable over Keystone.

credentials_in The Transparent cloud tiering service Check whether the access key
valid fails to connect to CSAP because of and secret key are valid. Also,
failed authentication. check whether the username and
password are correct.

mcstore_node_n The network of the Transparent cloud Check whether the network interface
etwork_down tiering node is down. on the Transparent cloud tiering
node is proper and is able to
communicate with public and private
networks.

ssl_handshake_ The CSAP fails due to an unknown SSL Check whether the following
exception handshake error. conditions are met:
• Cloud provider supports secured
communication and is properly
configured with certificate chain.
• The provided cloud provider URL is
secure (HTTPS).
• A secure connection to cloud
provider is established by
running the openssl s_client
-connect <cloud provider
ipaddress>: <secured_port>
command on a Transparent cloud
tiering node.

Chapter 42. References 715


Table 104. Cloud services status description (continued)
Entity Status Description Comments
TCT SSL handshake Transparent cloud tiering failed to Check whether the following
Account certificate connect to the CSAP because of an conditions are met:
Status exception untrusted server certificate chain or
certificate is invalid. • The server certificate is not
expired, still valid, and the DER is
properly encoded.
• A self-signed or internal CA signed
certificate is added to Transparent
cloud tiering truststore. Use –
server-cert-path option to
add a self-signed certificate.
• The server certificate is valid by
using the openssl x509 -in
<server_cert> -text -noout
command. Check the Validity
section and ensure that the
certificate is not expired and still
valid.

SSL handshake Transparent cloud tiering fails to Check whether the following
sock closed connect to the CSAP because the conditions are met:
exception. remote host closed connection during
the handshake. • A secure network connection is
established and that secure port is
reachable.
• A secured connection is
established to the cloud provider
by running the openssl
s_client -connect
<cloud_provider_ipaddress>
: <secured_port> command on
the Transparent cloud tiering node.

SSL handshake Transparent cloud tiering fails to Ensure that a self-signed or internal
bad certificate connect to the CSAP because the CA-signed certificate is properly
exception server certificate does not exist the added to the Transparent cloud
truststore. tiering truststore. Use the –server-
cert-path option to add a self-
signed certificate.

716 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 104. Cloud services status description (continued)
Entity Status Description Comments
TCT SSL handshake Transparent cloud tiering fails to Check whether the following
Account invalid path connect to the CSAP because it is conditions are met:
Status certificate unable to find a valid certification path.
exception • A proper self-signed or internal
CA-signed certificate is added
to the Transparent cloud tiering
truststore. Use the –server-
cert-path option to add a self-
signed certificate.
• A client (Transparent cloud
tiering ) and server (cloud
provider) certificate are signed
with same certificate authority or
configured with same self-signed
certificate.
• Proper cloud provider certificates
are returned when a load balancer
or firewall is present.
• A valid certificate chain is returned
from the cloud provider by running
the openssl s_client
-connect
<cloud_provider_ipaddress>
: <secured_port> command on
the Transparent cloud tiering node.

SSL handshake Transparent cloud tiering fails to Ensure that the cloud provider
failure exception connect to the CSAP provider because supports TLSv1.2 protocol and
it might not negotiate the required level TLSv1.2 enabled cipher suites.
of security.
SSL handshake Transparent cloud tiering fails to Ensure that a proper self-signed
unknown connect to the CSAP because of an or internal CA-signed certificate is
certificate unknown certificate. added to the Transparent cloud
exception tiering truststore. Use the –server-
cert-path option to add a self-
signed certificate.

SSL key exception Transparent cloud tiering fails to Check whether the following
connect to the CSAP because of bad conditions are met:
SSL key or misconfiguration.
• The SSL configuration on cloud
provider is proper.
• The Transparent cloud
tiering truststore, /var/
MCStore/.mcstore.jceks, is
not corrupted. If the /var/
MCStore/.mcstore.jceks is
corrupted, then remove it
and restart the server. This
action replaces the /var/
MCStore/.mcstore.jceks from
the CCR file.

Chapter 42. References 717


Table 104. Cloud services status description (continued)
Entity Status Description Comments
TCT SSL peer Transparent cloud tiering fails to Check whether the following
Account unverified connect to the CSAP because its conditions are met:
Status exception identity is not verified.
• The cloud provider has a valid SSL
configuration in place.
• The cloud provider supports
TLSv1.2 protocol and TLSv1.2
enabled cipher suites.
• A secure connection is in place to
the cloud provider by using the
openssl s_client -connect
<cloud_provider_ipaddress>
: <secured_port> command on
the Transparent cloud tiering node.

SSL protocol Transparent cloud tiering failed to For more information, check trace
exception connect to the cloud provider because messages and error logs.
of an error in the operation of the SSL
protocol.
SSL exception Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of an error in the SSL subsystem.
SSL no certificate Transparent cloud tiering failed to For more information, check trace
exception connect to the cloud provider because messages and error logs.
a certificate was not available.
TCT SSL not Transparent cloud tiering failed to For more information, check trace
Account trusted certificate connect to the cloud provider because messages and error logs.
Status exception. it might not locate a trusted server
certificate.
SSL invalid Transparent cloud tiering failed to For more information, check trace
algorithm connect to the cloud provider because messages and error logs.
exception of invalid or inappropriate SSL
algorithm parameters.
SSL invalid Transparent cloud tiering failed to For more information, check trace
padding exception connect to the cloud provider because messages and error logs.
of invalid SSL padding.
SSL unrecognized Transparent cloud tiering failed to For more information, check trace
message connect cloud provider because of an messages and error logs.
unrecognized SSL message.
Bad request Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of a request error.
Precondition failed Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of a precondition failed error.
Default exception The cloud provider account is not For more information, check trace
accessible due to an unknown error. messages and error logs.

718 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 104. Cloud services status description (continued)
Entity Status Description Comments
TCT Container creation The cloud provider container creation For more information, check trace
Account failed failed. The cloud provider account messages and error logs. Also,
Status might not be authorized to create the check whether the account creates
container. problems in Transparent cloud
tiering issues in the IBM Storage
Scale: Problem Determination Guide.

Time skew The time that is observed on Change the Transparent cloud tiering
the Transparent cloud tiering service service node timestamp to be in
node is not in sync with the time on the sync with the NTP server and rerun
target cloud provider. the operation.

Server error Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of a cloud provider server error (HTTP
503) or the container size reached max
storage limit.
Internal directory Transparent cloud tiering failed For more information, check trace
not found because one of its internal directories messages and error logs.
is not found.
Database The database of Transparent cloud For more information, check trace
corrupted tiering service is corrupted. messages and error logs. Run
the mmcloudgateway files
rebuildDB command to repair
when any issues are found.

TCT Not configured Transparent cloud tiering installed, but Run the mmcloudgateway
File the file system is not configured or it filesystem create command to
System was deleted. configure the file system.
Status
Configured The Transparent cloud tiering is N/A
configured with a file system.
TCT Stopped Transparent cloud tiering service is Run the mmcloudgateway
Server stopped by CLI command or stopped service start command to start
Status itself due to some error. the cloud gateway service.

Suspended The cloud service was suspended Run the mmcloudgateway


manually. service start command to
resume the cloud gateway service.

Started The cloud gateway service is running. N/A


Not configured Transparent cloud tiering was either Set up the Transparent cloud tiering
not configured or its services were and start the service.
never started.
Securit rkm down The remote key manager that is For more information, check trace
y configured for Transparent cloud tiering messages and error logs.
is not accessible.
lkm down The local key manager who is For more information, check trace
configured for Transparent cloud tiering messages and error logs.
is either not found or corrupted.

Chapter 42. References 719


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Cloud ONLINE Cloud storage access point that is The state when the cloud storage
Storage configured with Transparent cloud access point is reachable, and
Access tiering service is active. credentials are valid.
Point
OFFLINE The cloud storage access point URL is unreachable
unreachable because it is down or has
network issues. The cloud provider end-point URL is
unreachable.

OFFLINE The reason might be because of HTTP invalid_csp_endpoint_url


404 Not Found error.
The specified end-point URL is
invalid.

OFFLINE The cloud storage access point malformed_url


URL is malformed.
The URL is malformed.

OFFLINE The response from cloud storage no_route_to_csp


access point is invalid.
There is no route to the CSP, which
indicates a possible firewall issue.

OFFLINE The connection is refused remotely by connect_exception


the cloud storage access point.
There is a connection exception
when you connect.

OFFLINE Timeout occurs on a socket while it SOCKET_TIMEOUT


connects to the cloud storage access
The socket times out.
point URL.
Cloud OFFLINE Transparent cloud tiering refuses to invalid_cloud_config
Storage connect to the cloud storage access
There is an invalid cloud
Access point.
configuration, check the cloud.
Point
OFFLINE The Transparent cloud tiering service credentials_invalid
fails to connect cloud storage access
The credentials that are provided
point because the authentication fails.
during addCloud are no longer
valid (including if the password is
expired).

OFFLINE The network of Transparent cloud mcstore_node_network_down


tiering node is down.
The network is down on the
mcstore node.

OFFLINE The cloud storage access point status ssl_handshake_exception


fails due to an unknown SSL handshake
There is an SSL handshake
error.
exception.

OFFLINE Transparent cloud tiering fails to ssl_handshake_cert_exceptio


connect cloud storage access point n
because of an untrusted server
There is an SSL handshake
certificate chain.
certificate exception.

720 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Cloud OFFLINE Transparent cloud tiering fails to ssl_handshake_sock_closed_e
Storage connect cloud storage access point xception
Access because the remote host closed
There is an SSL handshake socket
Point connection during handshake.
closed exception.

OFFLINE Transparent cloud tiering fails to ssl_handshake_bad_cert_exce


connect cloud storage access point ption
because of a bad certificate.
There is an SSL handshake bad
certificate exception.

OFFLINE Transparent cloud tiering fails to ssl_handshake_invalid_path_


connect cloud storage access point cert _exception
because it is unable to find a valid
There is an SSL handshake invalid
certification path.
path certificate exception.

OFFLINE Transparent cloud tiering fails to ssl_handshake_failure_excep


connect to the cloud storage access tion
point because it might not negotiate
There is an SSL handshake failure
the required level of security.
exception.

OFFLINE Transparent cloud tiering fails to ssl_handshake_unknown_cert


connect cloud storage access point _exception
because of an unknown certificate.
There is an SSL handshake unknown
certificate exception.

Cloud OFFLINE Transparent cloud tiering fails to ssl_key_exception


Storage connect cloud storage access point
There is an SSL key exception.
Access because of a bad SSL key or
Point misconfiguration.
OFFLINE Transparent cloud tiering fails to ssl_peer_unverified_excepti
connect cloud storage access point on
because its identity is not verified.
Check whether the following
conditions are met:
• The Transparent cloud tiering has
a valid SSL configuration in place.
• The cloud provider supports
TLSv1.2 protocol and TLSv1.2
enabled cipher suites.
• A secured connection to the cloud
provider is in place by using the
openssl s_client -connect
<cloud_provider_ipaddress>
: <secured_port> command on
the Transparent cloud tiering node.

OFFLINE Transparent cloud tiering fails to ssl_protocol_exception


connect to the cloud storage access
Ensure that the Cloud Provider
point because of an SSL protocol
SSL configuration supports TLSv1.2
operation error.
protocol.

Chapter 42. References 721


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Cloud OFFLINE Transparent cloud tiering fails to ssl_exception
Storage connect to the cloud storage access
Check whether a secured
Access point because of an SSL subsystem
connection to the cloud provider
Point error.
is in place by using the
openssl s_client -connect
<cloud_provider_ipaddress>:
<secured_port> command on the
Transparent cloud tiering.
For more information, check error
log and traces to find further cause
and take necessary actions, as this is
generic error.

OFFLINE Transparent cloud tiering fails to ssl_no_cert_exception


connect to the cloud storage access
Check whether the following
point because there is no certificate
conditions are met:
available.
• A self-signed or internal root
CA certificate is properly added
to the Transparent cloud tiering
truststore.
• Add a self-signed certificate or
internal root CA certificate by using
the –server-cert-path option.

OFFLINE Transparent cloud tiering fails to ssl_not_trusted_cert_except


connect to the cloud storage access ion
point because the server certificate is
• If the cloud provider is using
not trusted.
a self-signed certificate, make
sure it is properly added to
the Transparent cloud tiering
truststore.
• Use the –server-cert-path
option to add a self-signed
certificate.
• If the cloud provider is using a
third-party CA signed certificate,
make sure that the certificate
chain is properly established on
the cloud provider.
• Run this command on the
Transparent cloud tiering node:
openssl s_client -connect
<cloud_provider_ipaddress>
: <secured_port>

722 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Cloud OFFLINE Transparent cloud tiering fails to ssl_invalid_algo_exception
Storage connect cloud storage access point
Check whether the JRE
Access because of invalid or inappropriate SSL
security file (java.security)
Point algorithm parameters.
is installed under /opt/ibm/
MCStore/jre/lib/security got
updated to an unsupported SSL
algorithm or invalid key length.

OFFLINE Transparent cloud tiering fails to ssl_invalid_padding_excepti


connect cloud storage access point on
because of invalid SSL padding.
Make sure that server SSL
configuration uses TLSv11.2 enabled
cipher suites.

Cloud OFFLINE Transparent cloud tiering fails to ssl_unrecognized_msg


Storage connect cloud storage access point
Check the following:
Access because of an unrecognized SSL
Point message. • Check whether the cloud provider
URL is secured (https)
• Check whether the cloud provider
supports SSL communication.

OFFLINE Transparent cloud tiering fails to the bad_request


connect cloud storage access point
Check the following:
because of a request error.
• Check whether the cloud provider
URL is valid
• Check whether the cloud provider
is working well through other
available tools

OFFLINE Transparent cloud tiering fails to precondition_failed


connect to the cloud storage access
Check whether the cloud provider
point because of a precondition failed
is responding to requests through
error.
other available tools and make sure
that the resource is accessible.
Retry the operation once the server
is able to serve the requests.

OFFLINE The cloud storage access point account default


is not accessible due to unknown error.
Some unchecked exception is
caught.

OFFLINE The cloud provider container creation CONTAINER_CREATE_FAILED


fails.
The container creation failed.

OFFLINE The cloud provider container creation CONTAINER_ALREADY_EXISTS


fails because it exists.
The container creation failed.

Chapter 42. References 723


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Cloud OFFLINE The cloud provider container creation BUCKET_LIMIT_EXCEEDED
Storage fails because it exceeds the maximum
The container creation failed.
Access limit.
Point
OFFLINE The cloud provider container does not CONTAINER_DOES_NOT_EXIST
exist.
The cloud operations failed.

OFFLINE The time observed on the Transparent TIME_SKEW


cloud tiering service node is not in sync
with the time observed on target cloud
storage access point.
OFFLINE Transparent cloud tiering fails to SERVER_ERROR
connect cloud storage access point
because the cloud storage access point
server error or container size reaches
the maximum storage limit.
OFFLINE Transparent cloud tiering fails because The internal directory is
one of its internal directories is not not found and shows
found. the internal_dir_notfound
message.
Cloud OFFLINE Transparent cloud tiering failed The resource address file is not
Storage because the resource address file is found and shows the
Access not found. RESOURCE_ADDRESSES_FILE_NOT
Point _FOUND message.

OFFLINE The database of Transparent cloud The Transparent cloud tiering


tiering service is corrupted. service is corrupted and shows the
db_corrupted message.

OFFLINE The remote key manager that is The SKLM server is not accessible.
configured for Transparent cloud tiering This condition is valid only when
is not accessible. ISKLM is configured.

OFFLINE The local key manager who is The local .jks file is not found
configured for Transparent cloud tiering or corrupted. This action is valid
is either not found or corrupted. only when local key manager is
configured.

OFFLINE The reason might be because of an The action is forbidden.


HTTP 403 Forbidden error.
OFFLINE Access denied due to authorization Access is denied.
error.

724 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Cloud OFFLINE The file system of Transparent cloud The file system is corrupted.
Storage tiering service is corrupted.
Access
OFFLINE The directory of Transparent cloud There is a directory error.
Point
tiering service is corrupted.
OFFLINE The key manager that is configured for There is a key manager error.
Transparent cloud tiering is either not
found or corrupted.
OFFLINE Transparent cloud tiering failed The container pair root directory is
because its container pair root not found.
directory not found.
Transparent cloud tiering service has There are too many retries.
too many retries internally.
Cloud ENABLED The cloud service is enabled for cloud The Cloud service is enabled.
Service operations.
DISABLED The cloud service is disabled. The Cloud service is disabled.

TCT STOPPED The cloud gateway service is down and The server is stopped abruptly, for
Service might not be started. example, JVM crash.

SUSPENDED The cloud gateway service is The TCT service is stopped


suspended manually. intentionally.

STARTED The cloud gateway service is up and The TCT service is started up and
running. running.

The cloud gateway check returns an The TCT service status has an
unknown result. unknown value.

Attempt to restart the cloud gateway The TCT service is restarted.


process.
NOT_CONFIGURED The Transparent cloud tiering was not The TCT service is recently installed.
configured or its service was never
started.

Chapter 42. References 725


Table 104. Cloud services status description (continued)
Entity Status Description Comments
Contain ONLINE The container pair set is configured The state when the container pair
er Pair with an active Transparent cloud tiering set is reachable.
Status service.
OFFLINE Transparent cloud tiering failed to Connection failure due to
connect to the container pair set directory_error or TCT directory
because of a directory error. error.

Transparent cloud tiering failed to container_does_not_exist


connect to the container pair set
because the container does not exist. The cloud container cannot be found
but the cloud connection is active.

Transparent cloud tiering failed to For more information, check trace


connect to the cloud provider because messages and error logs.
of precondition failed error.
Transparent cloud tiering failed to For more information, check trace
connect because resource address file messages and error logs.
is not found.
Transparent cloud tiering failed to For more information, check trace
connect because one of its internal messages and error logs.
directories is not found.
Transparent cloud tiering failed For more information, check trace
because one of its internal directories messages and error logs.
is not found.
Transparent cloud tiering failed to For more information, check trace
connect cloud storage access point messages and error logs.
because of cloud storage access point
service unavailability error.
Transparent cloud tiering failed to For more information, check trace
connect cloud storage access point messages and error logs.
because of a bad certificate.

Cloud services audit events


Every operation that is performed using Cloud services is audited and recorded to meet the regulatory
requirements.
The audit details are saved in this file: /var/MCStore/ras/audit/audit_events.json. You can
parse this JSON file by using some tools and use it for troubleshooting purposes. For the complete list of
events, see Table 105 on page 727. This is an example of the audit entry that is added to the JSON file
after you successfully execute the mmcloudgateway account create command.

{"typeURI":"http://schemas.dmtf.org/cloud/audit/1.0/event","eventType":"activity","id":
"b4e9a5a9-0bf7-45ee-9e93-b6f825781328","eventTime":"2017-08-21T18:46:10.439 UTC","action":
"create/create_cloudaccount","outcome":"success","initiator":{"id":"b22ec254-d645-43c4-
a402-3e15757d8463",
"typeURI":"data/security/account/admin","name":"root","host":{"address":"192.0.2.0"}},"target":
{"id":"58347894-6a10-4218-a66d-357e4a3f4aaf","typeURI":"service/storage/object/account","name":
"tct.cloudstorageaccesspoint"},"observer":{"id":"target"},"attachments":[{"content":"account-
name=
swift-account, cloud-type=openstack-swift, username=admin, tenant=admin, src-keystore-
path=null,
src-alias-name=null, src-keystore-type=null","name":"swift-account","contentType":"text"}]}

726 IBM Storage Scale 5.1.9: Problem Determination Guide


Table 105. Audit events
S.No Events
1 Add a cloud account – Success
2 Add a cloud account – Failure
3 Create a cloud storage access point – Success
4 Create a cloud storage access point – Failure
5 Create a cloud service – Success
6 Create a cloud service – Failure
7 Create a container pair – Success
8 Create a container pair – Failure
9 Create a key manager – Success
10 Create a key manager – Failure
11 Update a cloud account – Success
12 Update a cloud account - Failure
13 Update cloud storage access point – Success
14 Update cloud storage access point – Failure
15 Update cloud service – Success
16 Update cloud service – Failure
17 Update container pair – Success
18 Update container pair – Failure
19 Update key manager – Success
20 Update key manager – Failure
21 Delete a cloud account – Success
22 Delete a cloud account - Failure
23 Delete cloud storage access point – Success
24 Delete cloud storage access point – Failure
25 Delete cloud service – Success
26 Delete cloud service – Failure
27 Delete container pair – Success
28 Delete container pair – Failure
29 Rotate key manager – Success
30 Rotate key manager – Failure
31 Clean up orphan objects from the orphan table
32 Cloud destroy – Success
33 Cloud destroy – Failure
34 Export files – Success

Chapter 42. References 727


Table 105. Audit events (continued)
S.No Events
35 Export files – Failure
36 Import files – Success
37 Import files -Failure
38 Migrate files – Success
39 Migrate files – Failure
40 Recall files – Success
41 Recall files – Failure
42 Remove cloud objects – Success
43 Remove cloud objects – Failure
44 Reconcile files – Success
45 Reconcile files – Failure
46 Rebuild DB – Success
47 Rebuild DB - Failure
48 Restore files – Success
49 Restore files – Failure
50 Run policy (lwe destroy) – Success
51 Run policy (lwe destroy) – Failure
52 Config set – Success
53 Config set - Failure

Messages
This topic contains explanations for GPFS error messages.
Messages for IBM Storage Scale RAID in the ranges 6027-1850 – 6027-1899 and 6027-3000 –
6027-3099 are documented in IBM Storage Scale RAID: Administration.

Message severity tags


GPFS has adopted a message severity tagging convention. This convention applies to some newer
messages and to some messages that are being updated and adapted to be more usable by scripts
or semi-automated management programs.
A severity tag is a one-character alphabetic code (A through Z), optionally followed by a colon (:) and a
number, and surrounded by an opening and closing bracket ([ ]). For example:

[E] or [E:nnn]

If more than one substring within a message matches this pattern (for example, [A] or [A:nnn]), the
severity tag is the first such matching string.
When the severity tag includes a numeric code (nnn), this is an error code associated with the message. If
this were the only problem encountered by the command, the command return code would be nnn.

728 IBM Storage Scale 5.1.9: Problem Determination Guide


If a message does not have a severity tag, the message does not conform to this specification. You can
determine the message severity by examining the text or any supplemental information provided in the
message catalog, or by contacting the IBM Support Center.
Each message severity tag has an assigned priority that can be used to filter the messages that are sent
to the error log on Linux. Filtering is controlled with the mmchconfig attribute systemLogLevel. The
default for systemLogLevel is error, which means GPFS will send all error [E], critical [X], and alert
[A] messages to the error log. The values allowed for systemLogLevel are: alert, critical, error,
warning, notice, configuration, informational, detail, or debug. Additionally, the value none
can be specified so no messages are sent to the error log.
Alert [A] messages have the highest priority, and debug [B] messages have the lowest priority. If the
systemLogLevel default of error is changed, only messages with the specified severity and all those
with a higher priority are sent to the error log. The following table lists the message severity tags in order
of priority:

Table 106. Message severity tags ordered by priority


Type of message
(systemLogLeve
Severity tag l attribute) Meaning
A alert Indicates a problem where action must be taken immediately. Notify
the appropriate person to correct the problem.
X critical Indicates a critical condition that should be corrected immediately.
The system discovered an internal inconsistency of some kind.
Command execution might be halted or the system might attempt
to continue despite the inconsistency. Report these errors to the IBM
Support Center.
E error Indicates an error condition. Command execution might or might
not continue, but this error was likely caused by a persistent
condition and will remain until corrected by some other program or
administrative action. For example, a command operating on a single
file or other GPFS object might terminate upon encountering any
condition of severity E. As another example, a command operating on
a list of files, finding that one of the files has permission bits set that
disallow the operation, might continue to operate on all other files
within the specified list of files.
W warning Indicates a problem, but command execution continues. The problem
can be a transient inconsistency. It can be that the command
has skipped some operations on some objects, or is reporting an
irregularity that could be of interest. For example, if a multipass
command operating on many files discovers during its second pass
that a file that was present during the first pass is no longer present,
the file might have been removed by another command or program.
N notice Indicates a normal but significant condition. These events are
unusual but not error conditions, and might be summarized in an
email to developers or administrators for spotting potential problems.
No immediate action is required.
C configuration Indicates a configuration change; such as, creating a file system or
removing a node from the cluster.
I informational Indicates normal operation. This message by itself indicates that
nothing is wrong; no action is required.
D detail Indicates verbose operational messages; no is action required.

Chapter 42. References 729


Table 106. Message severity tags ordered by priority (continued)
Type of message
(systemLogLeve
Severity tag l attribute) Meaning
B debug Indicates debug-level messages that are useful to application
developers for debugging purposes. This information is not useful
during operations.

6027-000 Attention: A disk being removed the verifyGpfsReady option via mmchconfig
reduces the number of failure verifyGpfsReady=no.
groups to nFailureGroups, which 6027-304 [W] script ended abnormally
is below the number required for
replication: nReplicas. Explanation:
The verifyGpfsReady=yes configuration attribute is
Explanation: set and /var/mmfs/etc/gpfsready script did not
Replication cannot protect data against disk failures complete successfully.
when there are insufficient failure groups.
User response:
User response: Make sure /var/mmfs/etc/gpfsready completes
Add more disks in new failure groups to the file system and returns a zero exit status, or disable
or accept the risk of data loss. the verifyGpfsReady option via mmchconfig
6027-300 [N] mmfsd ready verifyGpfsReady=no.
Explanation: 6027-305 [N] script failed with exit code code
The mmfsd server is up and running. Explanation:
User response: The verifyGpfsReady=yes configuration attribute is
None. Informational message only. set and /var/mmfs/etc/gpfsready script did not
complete successfully
6027-301 File fileName could not be run with
err errno. User response:
Make sure /var/mmfs/etc/gpfsready completes
Explanation: and returns a zero exit status, or disable
The named shell script could not be executed. This the verifyGpfsReady option via mmchconfig
message is followed by the error string that is returned verifyGpfsReady=no.
by the exec.
6027-306 [E] Could not initialize inter-node
User response:
communication.
Check file existence and access permissions.
Explanation:
6027-302 [E] Could not execute script The GPFS daemon was unable to initialize the
Explanation: communications required to proceed.
The verifyGpfsReady=yes configuration attribute User response:
is set, but the /var/mmfs/etc/gpfsready script User action depends on the return code shown in
could not be executed. the accompanying message (/usr/include/errno.h).
User response: The communications failure that caused the failure
Make sure /var/mmfs/etc/gpfsready exists and is must be corrected. One possibility is an rc value of
executable, or disable the verifyGpfsReady option 67, indicating that the required port is unavailable.
via mmchconfig verifyGpfsReady=no. This may mean that a previous version of the mmfs
daemon is still running. Killing that daemon may
6027-303 [N] script killed by signal signal resolve the problem.
Explanation: 6027-307 [E] All tries for command thread
The verifyGpfsReady=yes configuration attribute is
allocation failed for msgCommand
set and /var/mmfs/etc/gpfsready script did not
minor commandMinorNumber
complete successfully.
Explanation:
User response:
Make sure /var/mmfs/etc/gpfsready completes
and returns a zero exit status, or disable

730 IBM Storage Scale 5.1.9: Problem Determination Guide


The GPFS daemon exhausted all tries and was unable 6027-316 [E] Unknown config parameter
to allocate a thread to process an incoming command "parameter" in fileName, line
message. This might impact the cluster. number.
User response: Explanation:
Evaluate thread usage using mmfsadm dump threads. There is an unknown parameter in the configuration
Consider tuning the workerThreads parameter using file.
the mmchconfig and then retry the command.
User response:
6027-310 [I] command initializing. {Version Fix the syntax error in the configuration file. Verify that
versionName: Built date time} you are not using a configuration file that was created
on a release of GPFS subsequent to the one you are
Explanation:
currently running.
The mmfsd server has started execution.
6027-317 [A] Old server with PID pid still
User response:
running.
None. Informational message only.
Explanation:
6027-311 [N] programName is shutting down.
An old copy of mmfsd is still running.
Explanation:
User response:
The stated program is about to terminate.
This message would occur only if the user bypasses
User response: the SRC. The normal message in this case would be
None. Informational message only. an SRC message stating that multiple instances are not
allowed. If it occurs, stop the previous instance and
6027-312 [E] Unknown trace class 'traceClass'.
use the SRC commands to restart the daemon.
Explanation:
6027-318 [E] Watchdog: Some process appears
The trace class is not recognized.
stuck; stopped the daemon
User response: process.
Specify a valid trace class.
Explanation:
6027-313 [X] Cannot open configuration file A high priority process got into a loop.
fileName.
User response:
Explanation: Stop the old instance of the mmfs server, then restart
The configuration file could not be opened. it.
User response: 6027-319 Could not create shared segment
The configuration file is /var/mmfs/gen/mmfs.cfg.
Explanation:
Verify that this file and /var/mmfs/gen/mmsdrfs exist
The shared segment could not be created.
in your system.
User response:
6027-314 [E] command requires SuperuserName
This is an error from the AIX operating system. Check
authority to execute.
the accompanying error indications from AIX.
Explanation:
6027-320 Could not map shared segment
The mmfsd server was started by a user without
superuser authority. Explanation:
User response: The shared segment could not be attached.
Log on as a superuser and reissue the command. User response:
6027-315 [E] This is an error from the AIX operating system. Check
Bad config file entry in fileName,
the accompanying error indications from AIX.
line number.
6027-321 Shared segment mapped at wrong
Explanation:
The configuration file has an incorrect entry. address (is value, should be value).
Explanation:
User response:
The shared segment did not get mapped to the
Fix the syntax error in the configuration file. Verify that
expected address.
you are not using a configuration file that was created
on a release of GPFS subsequent to the one that you User response:
are currently running. Contact the IBM Support Center.

Chapter 42. References 731


6027-322 Could not map shared segment in Increase the size of available memory using the
kernel extension mmchconfig command.
Explanation: 6027-335 [E] Configuration error: check
The shared segment could not be mapped in the fileName.
kernel. Explanation:
User response: A configuration error is found.
If an EINVAL error message is displayed, the kernel User response:
extension could not use the shared segment because Check the mmfs.cfg file and other error messages.
it did not have the correct GPFS version number.
Unload the kernel extension and restart the GPFS 6027-336 [E] Value 'value' for configuration
daemon. parameter 'parameter' is not valid.
Check fileName.
6027-323 [A] Error unmapping shared segment.
Explanation:
Explanation: A configuration error was found.
The shared segment could not be detached.
User response:
User response: Check the mmfs.cfg file.
Check reason given by error message.
6027-337 [N] Waiting for resources to be
6027-324 Could not create message queue reclaimed before exiting.
for main process
Explanation:
Explanation: The mmfsd daemon is attempting to terminate, but
The message queue for the main process could not be cannot because data structures in the daemon shared
created. This is probably an operating system error. segment may still be referenced by kernel code. This
User response: message may be accompanied by other messages that
Contact the IBM Support Center. show which disks still have I/O in progress.
6027-328 [W] Value 'value' for 'parameter' is out User response:
of range in fileName. Valid values None. Informational message only.
are value through value. value 6027-338 [N] Waiting for number user(s) of
used. shared segment to release it.
Explanation: Explanation:
An error was found in the /var/mmfs/gen/ The mmfsd daemon is attempting to terminate, but
mmfs.cfg file. cannot because some process is holding the shared
User response: segment while in a system call. The message will
Check the /var/mmfs/gen/mmfs.cfg file. repeat every 30 seconds until the count drops to zero.
6027-329 Cannot pin the main shared User response:
segment: name Find the process that is not responding, and find a way
to get it out of its system call.
Explanation:
Trying to pin the shared segment during initialization. 6027-339 [E] Nonnumeric trace value 'value'
after class 'class'.
User response:
Check the mmfs.cfg file. The pagepool size may be Explanation:
too large. It cannot be more than 80% of real memory. The specified trace value is not recognized.
If a previous mmfsd crashed, check for processes that
User response:
begin with the name mmfs that may be holding on
Specify a valid trace integer value.
to an old pinned shared segment. Issue mmchconfig
command to change the pagepool size. 6027-340 Child process file failed to start
due to error rc: errStr.
6027-334 [E] Error initializing internal
communications. Explanation:
A failure occurred when GPFS attempted to start a
Explanation: program.
The mailbox system used by the daemon for
communication with the kernel cannot be initialized. User response:
If the program was a user exit script, verify the script
User response: file exists and has appropriate permissions assigned.

732 IBM Storage Scale 5.1.9: Problem Determination Guide


If the program was not a user exit script, then this is The GPFS daemon tried to make a connection with
an internal GPFS error or the GPFS installation was another GPFS daemon. However, the other daemon
altered. is not compatible. Its version is greater than the
maximum compatible version of the daemon running
6027-341 [D] Node nodeName is incompatible
on this node. The numbers in square brackets are for
because its maximum compatible
use by the IBM Support Center.
version (number) is less than the
version of this node (number). User response:
[value/value] Verify your GPFS daemon version.
Explanation: 6027-345 Network error on ipAddress, check
The GPFS daemon tried to make a connection with connectivity.
another GPFS daemon. However, the other daemon
Explanation:
is not compatible. Its maximum compatible version is
A TCP error has caused GPFS to exit due to a bad
less than the version of the daemon running on this
return code from an error. Exiting allows recovery to
node. The numbers in square brackets are for use by
proceed on another node and resources are not tied up
the IBM Support Center.
on this node.
User response:
User response:
Verify your GPFS daemon version.
Follow network problem determination procedures.
6027-342 [E] Node nodeName is incompatible
6027-346 [E] Incompatible daemon version. My
because its minimum compatible
version = number, repl.my_version
version is greater than the version
= number
of this node (number). [value/
value] Explanation:
The GPFS daemon tried to make a connection with
Explanation:
another GPFS daemon. However, the other GPFS
The GPFS daemon tried to make a connection with
daemon is not the same version and it sent a reply
another GPFS daemon. However, the other daemon
indicating its version number is incompatible.
is not compatible. Its minimum compatible version is
greater than the version of the daemon running on this User response:
node. The numbers in square brackets are for use by Verify your GPFS daemon version.
the IBM Support Center.
6027-347 [E] Remote host ipAddress refused
User response: connection because IP address
Verify your GPFS daemon version. ipAddress was not in the node list
file
6027-343 [E] Node nodeName is incompatible
because its version (number) is Explanation:
less than the minimum compatible The GPFS daemon tried to make a connection with
version of this node (number). another GPFS daemon. However, the other GPFS
[value/value] daemon sent a reply indicating it did not recognize the
IP address of the connector.
Explanation:
The GPFS daemon tried to make a connection with User response:
another GPFS daemon. However, the other daemon is Add the IP address of the local host to the node list file
not compatible. Its version is less than the minimum on the remote host.
compatible version of the daemon running on this
6027-348 [E] Bad "subnets" configuration:
node. The numbers in square brackets are for use by
invalid subnet "ipAddress".
the IBM Support Center.
Explanation:
User response:
A subnet specified by the subnets configuration
Verify your GPFS daemon version.
parameter could not be parsed.
6027-344 [E] Node nodeName is incompatible
User response:
because its version is greater
Run the mmlsconfig command and check the value
than the maximum compatible
of the subnets parameter. Each subnet must be
version of this node (number).
specified as a dotted-decimal IP address. Run the
[value/value]
mmchconfig subnets command to correct the
Explanation: value.

Chapter 42. References 733


6027-349 [E] Bad "subnets" configuration: A disk being deleted is found listed in the disks= list
invalid cluster name pattern for a file system.
"clusterNamePattern". User response:
Explanation: Remove the disk from list.
A cluster name pattern specified by the subnets 6027-361 [E] Local access to disk failed with
configuration parameter could not be parsed.
EIO, switching to access the disk
User response: remotely.
Run the mmlsconfig command and check the value Explanation:
of the subnets parameter. The optional cluster name Local access to the disk failed. To avoid unmounting
pattern following subnet address must be a shell-style of the file system, the disk will now be accessed
pattern allowing '*', '/' and '[...]' as wild cards. Run remotely.
the mmchconfig subnets command to correct the
value. User response:
Wait until work continuing on the local node
6027-350 [E] Bad "subnets" configuration: completes. Then determine why local access to the
primary IP address ipAddress is on disk failed, correct the problem and restart the
a private subnet. Use a public IP daemon. This will cause GPFS to begin accessing the
address instead. disk locally again.
Explanation: 6027-362 Attention: No disks were deleted,
GPFS is configured to allow multiple IP addresses but some data was migrated. The
per node (subnets configuration parameter), but the
file system may no longer be
primary IP address of the node (the one specified
properly balanced.
when the cluster was created or when the node was
added to the cluster) was found to be on a private Explanation:
subnet. If multiple IP addresses are used, the primary The mmdeldisk command did not complete migrating
address must be a public IP address. data off the disks being deleted. The disks were
restored to normal ready, status, but the migration has
User response:
left the file system unbalanced. This may be caused by
Remove the node from the cluster; then add it back
having too many disks unavailable or insufficient space
using a public IP address.
to migrate all of the data to other disks.
6027-358 Communication with User response:
mmspsecserver through socket Check disk availability and space requirements.
name failed, err value: errorString, Determine the reason that caused the command to
msgType messageType. end before successfully completing the migration and
Explanation: disk deletion. Reissue the mmdeldisk command.
Communication failed between spsecClient (the 6027-363 I/O error writing disk descriptor
daemon) and spsecServer.
for disk name.
User response: Explanation:
Verify both the communication socket and the An I/O error occurred when the mmadddisk command
mmspsecserver process. was writing a disk descriptor on a disk. This could have
6027-359 The mmspsecserver process been caused by either a configuration error or an error
is shutting down. Reason: in the path to the disk.
explanation. User response:
Explanation: Determine the reason the disk is inaccessible for
The mmspsecserver process received a signal from writing and reissue the mmadddisk command.
the mmfsd daemon or encountered an error on 6027-364 Error processing disks.
execution.
Explanation:
User response: An error occurred when the mmadddisk command
Verify the reason for shutdown. was reading disks in the file system.
6027-360 Disk name must be removed from User response:
the /etc/filesystems stanza before Determine the reason why the disks are inaccessible
it can be deleted. for reading, then reissue the mmadddisk command.
Explanation: 6027-365 [I] Rediscovered local access to disk.

734 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: the status of the disk to replacement or specify a
Rediscovered local access to disk, which failed earlier new disk that has a status of replacement.
with EIO. For good performance, the disk will now be
6027-374 Disk name may not be replaced.
accessed locally.
Explanation:
User response:
A disk being replaced with mmrpldisk does not have
Wait until work continuing on the local node
a status of ready or suspended.
completes. This will cause GPFS to begin accessing
the disk locally again. User response:
Use the mmlsdisk command to display disk status.
6027-369 I/O error writing file system
Issue the mmchdisk command to change the status of
descriptor for disk name.
the disk to be replaced to either ready or suspended.
Explanation:
6027-375 Disk name diskName already in
mmadddisk detected an I/O error while writing a file
file system.
system descriptor on a disk.
Explanation:
User response:
The replacement disk name specified in the
Determine the reason the disk is inaccessible for
mmrpldisk command already exists in the file
writing and reissue the mmadddisk command.
system.
6027-370 mmdeldisk completed.
User response:
Explanation: Specify a different disk as the replacement disk.
The mmdeldisk command has completed.
6027-376 Previous replace command must
User response: be completed before starting a
None. Informational message only. new one.
6027-371 Cannot delete all disks in the file Explanation:
system The mmrpldisk command failed because the status
of other disks shows that a replace command did not
Explanation:
complete.
An attempt was made to delete all the disks in a file
system. User response:
Issue the mmlsdisk command to display disk status.
User response:
Retry the failed mmrpldisk command or issue the
Either reduce the number of disks to be deleted or use
mmchdisk command to change the status of the disks
the mmdelfs command to delete the file system.
that have a status of replacing or replacement.
6027-372 Replacement disk must be in the
6027-377 Cannot replace a disk that is in
same failure group as the disk
use.
being replaced.
Explanation:
Explanation:
Attempting to replace a disk in place, but the disk
An improper failure group was specified for
specified in the mmrpldisk command is still available
mmrpldisk.
for use.
User response:
User response:
Specify a failure group in the disk descriptor for the
Use the mmchdisk command to stop GPFS's use of the
replacement disk that is the same as the failure group
disk.
of the disk being replaced.
6027-378 [I] I/O still in progress near sector
6027-373 Disk diskName is being replaced,
number on disk diskName.
so status of disk diskName must
be replacement. Explanation:
The mmfsd daemon is attempting to terminate, but
Explanation:
cannot because data structures in the daemon shared
The mmrpldisk command failed when retrying a
segment may still be referenced by kernel code.
replace operation because the new disk does not have
In particular, the daemon has started an I/O that
the correct status.
has not yet completed. It is unsafe for the daemon
User response: to terminate until the I/O completes, because of
Issue the mmlsdisk command to display disk status. asynchronous activity in the device driver that will
Then either issue the mmchdisk command to change access data structures belonging to the daemon.

Chapter 42. References 735


User response: 6027-386 Value value for the 'sector size'
Either wait for the I/O operation to time out, or issue a option for disk name is invalid.
device-dependent command to terminate the I/O.
Explanation:
6027-379 Could not invalidate disk(s). When parsing disk lists, the sector size given is not
valid.
Explanation:
Trying to delete a disk and it could not be written to in User response:
order to invalidate its contents. Specify a correct sector size.
User response: 6027-387 Value value for the 'failure group'
No action needed if removing that disk permanently. option for disk name is out of
However, if the disk is ever to be used again, the -v range. Valid values are number
flag must be specified with a value of no when using through number.
either the mmcrfs or mmadddisk command.
Explanation:
6027-380 Disk name missing from disk When parsing disk lists, the failure group given is not
descriptor list entry name. valid.
Explanation: User response:
When parsing disk lists, no disks were named. Specify a correct failure group.
User response: 6027-388 Value value for the 'failure group'
Check the argument list of the command. option for disk name is invalid.
6027-382 Value value for the 'sector size' Explanation:
option for disk disk is not a When parsing disk lists, the failure group given is not
multiple of value. valid.
Explanation: User response:
When parsing disk lists, the sector size given is not a Specify a correct failure group.
multiple of the default sector size.
6027-389 Value value for the 'has metadata'
User response: option for disk name is out of
Specify a correct sector size. range. Valid values are number
6027-383 through number.
Disk name name appears more
than once. Explanation:
When parsing disk lists, the 'has metadata' value given
Explanation:
is not valid.
When parsing disk lists, a duplicate name is found.
User response:
User response:
Specify a correct 'has metadata' value.
Remove the duplicate name.
6027-390 Value value for the 'has metadata'
6027-384 Disk name name already in file
option for disk name is invalid.
system.
Explanation:
Explanation:
When parsing disk lists, the 'has metadata' value given
When parsing disk lists, a disk name already exists in
is not valid.
the file system.
User response:
User response:
Specify a correct 'has metadata' value.
Rename or remove the duplicate disk.
6027-391 Value value for the 'has data'
6027-385 Value value for the 'sector size'
option for disk name is out of
option for disk name is out of
range. Valid values are number
range. Valid values are number
through number.
through number.
Explanation:
Explanation:
When parsing disk lists, the 'has data' value given is
When parsing disk lists, the sector size given is not
not valid.
valid.
User response:
User response:
Specify a correct 'has data' value.
Specify a correct sector size.

736 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-392 Value value for the 'has data' 2. The paths to the disks are correctly defined and
option for disk name is invalid. operational.
Explanation: 6027-417 Bad file system descriptor.
When parsing disk lists, the 'has data' value given is
Explanation:
not valid.
A file system descriptor that is not valid was
User response: encountered.
Specify a correct 'has data' value.
6027-393 Either the 'has data' option or the User response
'has metadata' option must be '1' Verify:
for disk diskName.
1. The disks are correctly defined on all nodes.
Explanation: 2. The paths to the disks are correctly defined and
When parsing disk lists the 'has data' or 'has metadata' operational.
value given is not valid.
6027-418 Inconsistent file system
User response: quorum. readQuorum=value
Specify a correct 'has data' or 'has metadata' value. writeQuorum=value
6027-394 Too many disks specified for file quorumSize=value.
system. Maximum = number. Explanation:
Explanation: A file system descriptor that is not valid was
Too many disk names were passed in the disk encountered.
descriptor list. User response:
User response: Start any disks that have been stopped by the
Check the disk descriptor list or the file containing the mmchdisk command or by hardware failures. If the
list. problem persists, run offline mmfsck.
6027-399 Not enough items in disk 6027-419 Failed to read a file system
descriptor list entry, need fields. descriptor.
Explanation: Explanation:
When parsing a disk descriptor, not enough fields were Not enough valid replicas of the file system descriptor
specified for one disk. could be read from the file system.
User response: User response:
Correct the disk descriptor to use the correct disk Start any disks that have been stopped by the
descriptor syntax. mmchdisk command or by hardware failures. Verify
that paths to all disks are correctly defined and
6027-416 Incompatible file system operational.
descriptor version or not
formatted. 6027-420 Inode size must be greater than
zero.
Explanation Explanation:
Possible reasons for the error are: An internal consistency check has found a problem
with file system parameters.
1. A file system descriptor version that is not valid was
encountered. User response:
Record the above information. Contact the IBM
2. No file system descriptor can be found.
Support Center.
3. Disks are not correctly defined on all active nodes.
6027-421 Inode size must be a multiple of
4. Disks, logical volumes, network shared disks, or
logical sector size.
virtual shared disks were incorrectly re-configured
after creating a file system. Explanation:
An internal consistency check has found a problem
with file system parameters.
User response
Verify: User response:
Record the above information. Contact the IBM
1. The disks are correctly defined on all nodes.
Support Center.

Chapter 42. References 737


6027-422 Inode size must be at least as An internal consistency check has found a problem
large as the logical sector size. with file system parameters.
Explanation: User response:
An internal consistency check has found a problem Record the above information. Contact the IBM
with file system parameters. Support Center.
User response: 6027-428 Indirect block size must be a
Record the above information. Contact the IBM multiple of the minimum fragment
Support Center. size.
6027-423 Minimum fragment size must be a Explanation:
multiple of logical sector size. An internal consistency check has found a problem
with file system parameters.
Explanation:
An internal consistency check has found a problem User response:
with file system parameters. Record the above information. Contact the IBM
Support Center.
User response:
Record the above information. Contact the IBM 6027-429 Indirect block size must be less
Support Center. than full data block size.
6027-424 Minimum fragment size must be Explanation:
greater than zero. An internal consistency check has found a problem
with file system parameters.
Explanation:
An internal consistency check has found a problem User response:
with file system parameters. Record the above information. Contact the IBM
Support Center.
User response:
Record the above information. Contact the IBM 6027-430 Default metadata replicas must
Support Center. be less than or equal to default
maximum number of metadata
6027-425 File system block size of blockSize replicas.
is larger than maxblocksize
parameter. Explanation:
An internal consistency check has found a problem
Explanation: with file system parameters.
An attempt is being made to mount a file system
whose block size is larger than the maxblocksize User response:
parameter as set by mmchconfig. Record the above information. Contact the IBM
Support Center.
User response:
Use the mmchconfig maxblocksize=xxx command 6027-431 Default data replicas must be less
to increase the maximum allowable block size. than or equal to default maximum
number of data replicas.
6027-426 Warning: mount detected
unavailable disks. Use mmlsdisk Explanation:
fileSystem to see details. An internal consistency check has found a problem
with file system parameters.
Explanation:
The mount command detected that some disks User response:
needed for the file system are unavailable. Record the above information. Contact the IBM
Support Center.
User response:
Without file system replication enabled, the mount 6027-432 Default maximum metadata
will fail. If it has replication, the mount may succeed replicas must be less than or equal
depending on which disks are unavailable. Use to value.
mmlsdisk to see details of the disk status. Explanation:
6027-427 Indirect block size must be at An internal consistency check has found a problem
least as large as the minimum with file system parameters.
fragment size. User response:
Explanation: Record the above information. Contact the IBM
Support Center.

738 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-433 Default maximum data replicas 6027-445 Value for option '-m' cannot
must be less than or equal to exceed the number of metadata
value. failure groups.
Explanation: Explanation:
An internal consistency check has found a problem The current number of replicas of metadata cannot
with file system parameters. be larger than the number of failure groups that are
enabled to hold metadata.
User response:
Record the above information. Contact the IBM User response:
Support Center. Use a smaller value for -m on the mmchfs command,
or increase the number of failure groups by adding
6027-434 Indirect blocks must be at least as
disks to the file system.
big as inodes.
6027-446 Value for option '-r' cannot exceed
Explanation:
the number of data failure groups.
An internal consistency check has found a problem
with file system parameters. Explanation:
The current number of replicas of data cannot be
User response:
larger than the number of failure groups that are
Record the above information. Contact the IBM
enabled to hold data.
Support Center.
User response:
6027-435 [N] The file system descriptor quorum
Use a smaller value for -r on the mmchfs command,
has been overridden.
or increase the number of failure groups by adding
Explanation: disks to the file system.
The mmfsctl exclude command was previously
6027-451 No disks= list found in mount
issued to override the file system descriptor quorum
options.
after a disaster.
Explanation:
User response:
No 'disks=' clause found in the mount options list
None. Informational message only.
when opening a file system.
6027-438 Duplicate disk name name.
User response:
Explanation: Check the operating system's file system database and
An internal consistency check has found a problem local mmsdrfs file for this file system.
with file system parameters.
6027-452 No disks found in disks= list.
User response:
Explanation:
Record the above information. Contact the IBM
No disks listed when opening a file system.
Support Center.
User response:
6027-439 Disk name sector size value does
Check the operating system's file system database and
not match sector size value of
local mmsdrfs file for this file system.
other disk(s).
6027-453 No disk name found in a clause of
Explanation:
the list.
An internal consistency check has found a problem
with file system parameters. Explanation:
No disk name found in a clause of the disks= list.
User response:
Record the above information. Contact the IBM User response:
Support Center. Check the operating system's file system database and
local mmsdrfs file for this file system.
6027-441 Unable to open disk 'name' on
node nodeName. 6027-461 Unable to find name device.
Explanation: Explanation:
A disk name that is not valid was specified in a GPFS Self explanatory.
disk command.
User response:
User response: There must be a /dev/sgname special device
Correct the parameters of the executing GPFS disk defined. Check the error code. This could indicate a
command.

Chapter 42. References 739


configuration error in the specification of disks, logical The file system name found in the descriptor on
volumes, network shared disks, or virtual shared disks. disk does not match the corresponding device name
in /etc/filesystems.
6027-462 name must be a char or block
special device. User response:
Check the operating system's file system database.
Explanation:
Opening a file system. 6027-470 Disk name may still belong to
file system filesystem. Created on
User response:
IPandTime.
There must be a /dev/sgname special device
defined. This could indicate a configuration error in the Explanation:
specification of disks, logical volumes, network shared The disk being added by the mmcrfs, mmadddisk, or
disks, or virtual shared disks. mmrpldisk command appears to still belong to some
file system.
6027-463 SubblocksPerFullBlock was not
32. User response:
Verify that the disks you are adding do not belong to an
Explanation:
active file system, and use the -v no option to bypass
The value of the SubblocksPerFullBlock variable was
this check. Use this option only if you are sure that no
not 32. This situation should never exist, and indicates
other file system has this disk configured because you
an internal error.
may cause data corruption in both file systems if this is
User response: not the case.
Record the above information and contact the IBM
6027-471 Disk diskName: Incompatible file
Support Center.
system descriptor version or not
6027-465 The average file size must be at formatted.
least as large as the minimum
fragment size. Explanation
Explanation: Possible reasons for the error are:
When parsing the command line of tscrfs, it was
1. A file system descriptor version that is not valid was
discovered that the average file size is smaller than the
encountered.
minimum fragment size.
2. No file system descriptor can be found.
User response:
Correct the indicated command parameters. 3. Disks are not correctly defined on all active nodes.
4. Disks, logical volumes, network shared disks, or
6027-468 Disk name listed in fileName or
virtual shared disks were incorrectly reconfigured
local mmsdrfs file, not found in
after creating a file system.
device name. Run: mmcommon
recoverfs name.
User response
Explanation: Verify:
Tried to access a file system but the disks listed in the
operating system's file system database or the local 1. The disks are correctly defined on all nodes.
mmsdrfs file for the device do not exist in the file 2. The paths to the disks are correctly defined and
system. operative.
User response: 6027-472 [E] File system format version
Check the configuration and availability of disks. Run versionString is not supported.
the mmcommon recoverfs device command. If this
does not resolve the problem, configuration data in Explanation:
the SDR may be incorrect. If no user modifications The current file system format version is not
have been made to the SDR, contact the IBM Support supported.
Center. If user modifications have been made, correct
these modifications. User response
6027-469 File system name does not match Verify:
descriptor. 1. The disks are correctly defined on all nodes.
Explanation: 2. The paths to the disks are correctly defined and
operative.

740 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-473 [X] File System fileSystem unmounted The file system is still mounted or another GPFS
by the system with return code administration command (mm…) is running against the
value reason code value file system.
Explanation: User response:
Console log entry caused by a forced unmount due to Unmount the file system if it is mounted, and wait
disk or communication failure. for any command that is running to complete before
reissuing the mmchfs -z command.
User response:
Correct the underlying problem and remount the file 6027-479 [N] Mount of fsName was blocked by
system. fileName
6027-474 [X] Recovery Log I/O failed, Explanation:
unmounting file system fileSystem The internal or external mount of the file system was
blocked by the existence of the specified file.
Explanation:
I/O to the recovery log failed. User response:
If the file system needs to be mounted, remove the
User response: specified file.
Check the paths to all disks making up the file system.
Run the mmlsdisk command to determine if GPFS has 6027-480 Cannot enable DMAPI in a file
declared any disks unavailable. Repair any paths to system with existing snapshots.
disks that have failed. Remount the file system. Explanation:
6027-475 The option '--inode-limit' The user is not allowed to enable DMAPI for a file
cannot be enabled. Use option '-V' system with existing snapshots.
to enable most recent features. User response:
Explanation: Delete all existing snapshots in the file system and
mmchfs --inode-limit cannot be enabled under repeat the mmchfs command.
the current file system format version. 6027-481 [E] Remount failed for mountid id:
User response: errnoDescription
Run mmchfs -V, this will change the file system Explanation:
format to the latest format supported. mmfsd restarted and tried to remount any file systems
6027-476 Restricted mount using only that the VFS layer thinks are still mounted.
available file system descriptor. User response:
Explanation: Check the errors displayed and the errno description.
Fewer than the necessary number of file system 6027-482 [E] Remount failed for device name:
descriptors were successfully read. Using the best
errnoDescription
available descriptor to allow the restricted mount to
continue. Explanation:
mmfsd restarted and tried to remount any file systems
User response: that the VFS layer thinks are still mounted.
Informational message only.
User response:
6027-477 The option -z cannot be enabled. Check the errors displayed and the errno description.
Use the -V option to enable most
recent features. 6027-483 [N] Remounted name
Explanation: Explanation:
The file system format version does not support the -z mmfsd restarted and remounted the specified file
option on the mmchfs command. system because it was in the kernel's list of previously
mounted file systems.
User response:
Change the file system format version by issuing User response:
mmchfs -V. Informational message only.
6027-478 The option -z could not be 6027-484 Remount failed for device after
changed. fileSystem is still in use. daemon restart.
Explanation: Explanation:
A remount failed after daemon restart. This ordinarily
occurs because one or more disks are unavailable.

Chapter 42. References 741


Other possibilities include loss of connectivity to one groups. This is allowed, but the
or more disks. files will not be replicated and will
therefore be at risk.
User response:
Issue the mmlsdisk command and check for down Explanation:
disks. Issue the mmchdisk command to start any You specified a number of replicas that exceeds the
down disks, then remount the file system. If there is number of failure groups available.
another problem with the disks or the connections
User response:
to the disks, take necessary corrective actions and
Reissue the command with a smaller replication factor,
remount the file system.
or increase the number of failure groups.
6027-485 Perform mmchdisk for any disk
6027-490 [N] The descriptor replica on disk
failures and re-mount.
diskName has been excluded.
Explanation:
Explanation:
Occurs in conjunction with 6027-484.
The file system descriptor quorum has been
User response: overridden and, as a result, the specified disk was
Follow the User response for 6027-484. excluded from all operations on the file system
descriptor quorum.
6027-486 No local device specified for
fileSystemName in clusterName. User response:
None. Informational message only.
Explanation:
While attempting to mount a remote file system from 6027-491 Incompatible file system format.
another cluster, GPFS was unable to determine the Only file systems formatted with
local device name for this file system. GPFS 3.2 or later can be mounted
on this platform.
User response:
There must be a /dev/sgname special device defined. Explanation:
Check the error code. This is probably a configuration User running GPFS on Microsoft Windows tried to
error in the specification of a remote file system. mount a file system that was formatted with a version
Run mmremotefs show to check that the remote file of GPFS that did not have Windows support.
system is properly configured.
User response:
6027-487 Failed to write the file system Create a new file system using current GPFS code.
descriptor to disk diskName.
6027-492 The file system is already at file
Explanation: system version number
An error occurred when mmfsctl include was
Explanation:
writing a copy of the file system descriptor to one of
The user tried to upgrade the file system format using
the disks specified on the command line. This could
mmchfs -V --version=v, but the specified version
have been caused by a failure of the corresponding
is smaller than the current version of the file system.
disk device, or an error in the path to the disk.
User response:
User response:
Specify a different value for the --version option.
Verify that the disks are correctly defined on all nodes.
Verify that paths to all disks are correctly defined and 6027-493 File system version number is not
operational. supported on nodeName nodes in
the cluster.
6027-488 Error opening the exclusion disk
file fileName. Explanation:
The user tried to upgrade the file system format using
Explanation:
mmchfs -V, but some nodes in the local cluster are
Unable to retrieve the list of excluded disks from an
still running an older GPFS release that does support
internal configuration file.
the new format version.
User response:
User response:
Ensure that GPFS executable files have been properly
Install a newer version of GPFS on those nodes.
installed on all nodes. Perform required configuration
steps prior to starting GPFS. 6027-494 File system version number is
not supported on the following
6027-489 Attention: The desired replication
nodeName remote nodes mounting
factor exceeds the number of
the file system:
available dataOrMetadata failure

742 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: A user running GPFS on Microsoft Windows tried to
The user tried to upgrade the file system format using mount a file system that was formatted with a version
mmchfs -V, but the file system is still mounted on of GPFS that did not have Windows support.
some nodes in remote clusters that do not support the
User response:
new format version.
Create a new file system using current GPFS code.
User response:
6027-499 [X] An unexpected Device Mapper
Unmount the file system on the nodes that do not
path dmDevice (nsdId) has been
support the new format version.
detected. The new path does not
6027-495 You have requested that the file have a Persistent Reserve set
system be upgraded to version up. File system fileSystem will be
number. This will enable new internally unmounted.
functionality but will prevent you
Explanation:
from using the file system with
A new device mapper path is detected or a previously
earlier releases of GPFS. Do you
failed path is activated after the local device discovery
want to continue?
has finished. This path lacks a Persistent Reserve, and
Explanation: cannot be used. All device paths must be active at
Verification request in response to the mmchfs -V mount time.
full command. This is a request to upgrade the file
User response:
system and activate functions that are incompatible
Check the paths to all disks making up the file system.
with a previous release of GPFS.
Repair any paths to disks which have failed. Remount
User response: the file system.
Enter yes if you want the conversion to take place.
6027-500 name loaded and configured.
6027-496 You have requested that the file
Explanation:
system version for local access be
The kernel extension was loaded and configured.
upgraded to version number. This
will enable some new functionality User response:
but will prevent local nodes from None. Informational message only.
using the file system with earlier
6027-501 name:module moduleName
releases of GPFS. Remote nodes
unloaded.
are not affected by this change. Do
you want to continue? Explanation:
The kernel extension was unloaded.
Explanation:
Verification request in response to the mmchfs -V User response:
command. This is a request to upgrade the file system None. Informational message only.
and activate functions that are incompatible with a
6027-502 Incorrect parameter: name.
previous release of GPFS.
Explanation:
User response:
mmfsmnthelp was called with an incorrect parameter.
Enter yes if you want the conversion to take place.
User response:
6027-497 The file system has already been
Contact the IBM Support Center.
upgraded to number using -V full.
It is not possible to revert back. 6027-504 Not enough memory to allocate
internal data structure.
Explanation:
The user tried to upgrade the file system format using Explanation:
mmchfs -V compat, but the file system has already Self explanatory.
been fully upgraded.
User response:
User response: Increase ulimit or paging space
Informational message only.
6027-505 Internal error, aborting.
6027-498 Incompatible file system format.
Explanation:
Only file systems formatted with
Self explanatory.
GPFS 3.2.1.5 or later can be
mounted on this platform. User response:
Contact the IBM Support Center.
Explanation:

Chapter 42. References 743


6027-506 program: loadFile is already loaded Take the action indicated by other error messages and
at address. error log entries.
Explanation: 6027-516 Cannot mount fileSystem
The program was already loaded at the address Explanation:
displayed. There was an error mounting the named GPFS file
User response: system. Errors in the disk path usually cause this
None. Informational message only. problem.
6027-507 program: loadFile is not loaded. User response:
Take the action indicated by other error messages and
Explanation: error log entries.
The program could not be loaded.
6027-517 Cannot mount fileSystem:
User response: errorString
None. Informational message only.
Explanation:
6027-510 Cannot mount fileSystem on There was an error mounting the named GPFS file
mountPoint: errorString system. Errors in the disk path usually cause this
Explanation: problem.
There was an error mounting the GPFS file system. User response:
User response: Take the action indicated by other error messages and
Determine action indicated by the error messages and error log entries.
error log entries. Errors in the disk path often cause
6027-518 Cannot mount fileSystem: Already
this problem.
mounted.
6027-511 Cannot unmount fileSystem: Explanation:
errorDescription An attempt has been made to mount a file system that
Explanation: is already mounted.
There was an error unmounting the GPFS file system. User response:
User response: None. Informational message only.
Take the action indicated by errno description. 6027-519 Cannot mount fileSystem on
6027-512 name not listed in /etc/vfs mountPoint: File system table full.
Explanation: Explanation:
Error occurred while installing the GPFS kernel An attempt has been made to mount a file system
extension, or when trying to mount a file system. when the file system table is full.
User response: User response:
Check for the mmfs entry in /etc/vfs None. Informational message only.
6027-514 Cannot mount fileSystem on 6027-520 Cannot mount fileSystem: File
mountPoint: Already mounted. system table full.
Explanation: Explanation:
An attempt has been made to mount a file system that An attempt has been made to mount a file system
is already mounted. when the file system table is full.
User response: User response:
None. Informational message only. None. Informational message only.
6027-515 Cannot mount fileSystem on 6027-530 Mount of name failed: cannot
mountPoint mount restorable file system for
read/write.
Explanation:
There was an error mounting the named GPFS file Explanation:
system. Errors in the disk path usually cause this A file system marked as enabled for restore cannot
problem. be mounted read/write.
User response: User response:
None. Informational message only.

744 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-531 The following disks of name will be memory size with the mmchconfig
formatted on node nodeName: list. command or add additional RAM
to system.
Explanation:
Output showing which disks will be formatted by the Explanation:
mmcrfs command. Insufficient memory for GPFS internal data structures
with current system and GPFS configuration.
User response:
None. Informational message only. User response:
Reduce page pool usage with the mmchconfig
6027-532 [E] The quota record recordNumber in command, or add additional RAM to system.
file fileName is not valid.
6027-537 Disks up to size size can be added
Explanation: to this file system.
A quota entry contained a checksum that is not valid.
Explanation:
User response: Based on the parameters given to the mmcrfs
Remount the file system with quotas disabled. Restore command and the size and number of disks being
the quota file from back up, and run mmcheckquota. formatted, GPFS has formatted its allocation maps to
6027-533 [W] Inode space inodeSpace in file allow disks up the given size to be added to this file
system fileSystem is approaching system by the mmadddisk command.
the limit for the maximum number User response:
of inodes. None, informational message only. If the reported
Explanation: maximum disk size is smaller than necessary, delete
The number of files created is approaching the file the file system with mmdelfs and reissue the mmcrfs
system limit. command with larger disks or a larger value for the -n
parameter.
User response:
Use the mmchfileset command to increase the 6027-538 Error accessing disks.
maximum number of files to avoid reaching the inode Explanation:
limit and possible performance degradation. The mmcrfs command encountered an error accessing
6027-534 Cannot create a snapshot in one or more of the disks.
a DMAPI-enabled file system, User response:
rc=returnCode. Verify that the disk descriptors are coded correctly and
Explanation: that all named disks exist and are online.
You cannot create a snapshot in a DMAPI-enabled file 6027-539 Unable to clear descriptor areas
system.
for fileSystem.
User response: Explanation:
Use the mmchfs command to disable DMAPI, and The mmdelfs command encountered an error while
reissue the command. invalidating the file system control structures on one
6027-535 Disks up to size size can be added or more disks in the file system being deleted.
to storage pool pool. User response:
Explanation: If the problem persists, specify the -p option on the
Based on the parameters given to mmcrfs and the mmdelfs command.
size and number of disks being formatted, GPFS has 6027-540 Formatting file system.
formatted its allocation maps to allow disks up the
given size to be added to this storage pool by the Explanation:
mmadddisk command. The mmcrfs command began to write file system data
structures onto the new disks.
User response:
None. Informational message only. If the reported User response:
maximum disk size is smaller than necessary, delete None. Informational message only.
the file system with mmdelfs and rerun mmcrfs 6027-541 Error formatting file system.
with either larger disks or a larger value for the -n
parameter. Explanation:
mmcrfs command encountered an error while
6027-536 Insufficient system memory to run formatting a new file system. This is often an I/O error.
GPFS daemon. Reduce page pool

Chapter 42. References 745


User response: Explanation:
Check the subsystems in the path to the disk. Follow Fileset was already unlinked.
the instructions from other messages that appear with
User response:
this one.
None. Informational message only.
6027-542 [N] Fileset in file system
6027-548 Fileset filesetName unlinked from
fileSystem:filesetName (id filesetId)
filesetName.
has been incompletely deleted.
Explanation:
Explanation:
A fileset being deleted contains junctions to other
A fileset delete operation was interrupted, leaving this
filesets. The cited fileset were unlinked.
fileset in an incomplete state.
User response:
User response:
None. Informational message only.
Reissue the fileset delete command.
6027-549 [E] Failed to open name.
6027-543 Error writing file system descriptor
for fileSystem. Explanation:
The mount command was unable to access a file
Explanation:
system. Check the subsystems in the path to the disk.
The mmcrfs command could not successfully write
This is often an I/O error.
the file system descriptor in a particular file system.
Check the subsystems in the path to the disk. This is User response:
often an I/O error. Follow the suggested actions for the other messages
that occur with this one.
User response:
Check system error log, rerun mmcrfs. 6027-550 [X] Allocation manager for fileSystem
failed to revoke ownership from
6027-544 Could not invalidate disk of
node nodeName.
fileSystem.
Explanation:
Explanation:
An irrecoverable error occurred trying to revoke
A disk could not be written to invalidate its contents.
ownership of an allocation region. The allocation
Check the subsystems in the path to the disk. This is
manager has panicked the file system to prevent
often an I/O error.
corruption of on-disk data.
User response:
User response:
Ensure the indicated logical volume is writable.
Remount the file system.
6027-545 Error processing fileset metadata
6027-551 fileSystem is still in use.
file.
Explanation:
Explanation:
The mmdelfs or mmcrfs command found that the
There is no I/O path to critical metadata or metadata
named file system is still mounted or that another
has been corrupted.
GPFS command is running against the file system.
User response:
User response:
Verify that the I/O paths to all disks are valid and
Unmount the file system if it is mounted, or wait
that all disks are either in the 'recovering' or 'up'
for GPFS commands in progress to terminate before
availability states. If all disks are available and the
retrying the command.
problem persists, issue the mmfsck command to
repair damaged metadata 6027-552 Scan completed successfully.
6027-546 Error processing allocation map Explanation:
for storage pool poolName. The scan function has completed without error.
Explanation: User response:
There is no I/O path to critical metadata, or metadata None. Informational message only.
has been corrupted.
6027-553 Scan failed on number user or
User response: system files.
Verify that the I/O paths to all disks are valid, and
Explanation:
that all disks are either in the 'recovering' or 'up'
Data may be lost as a result of pointers that are not
availability. Issue the mmlsdisk command.
valid or unavailable disks.
6027-547 Fileset filesetName was unlinked.

746 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: 6027-559 Some data could not be read or
Some files may have to be restored from backup written.
copies. Issue the mmlsdisk command to check the
Explanation:
availability of all the disks that make up the file
An I/O error has occurred or some disks are in the
system.
stopped state.
6027-554 Scan failed on number out of
User response:
number user or system files.
Check the availability of all disks and the path to all
Explanation: disks, and reissue the command.
Data may be lost as a result of pointers that are not
6027-560 File system is already suspended.
valid or unavailable disks.
Explanation:
User response:
The tsfsctl command was asked to suspend a
Some files may have to be restored from backup
suspended file system.
copies. Issue the mmlsdisk command to check the
availability of all the disks that make up the file User response:
system. None. Informational message only.
6027-555 The desired replication factor 6027-561 Error migrating log.
exceeds the number of available
Explanation:
failure groups.
There are insufficient available disks to continue
Explanation: operation.
You have specified a number of replicas that exceeds
User response:
the number of failure groups available.
Restore the unavailable disks and reissue the
User response: command.
Reissue the command with a smaller replication factor
or increase the number of failure groups. 6027-562 Error processing inodes.
Explanation:
6027-556 Not enough space for the desired
There is no I/O path to critical metadata or metadata
number of replicas.
has been corrupted.
Explanation:
User response:
In attempting to restore the correct replication, GPFS
Verify that the I/O paths to all disks are valid and
ran out of space in the file system. The operation can
that all disks are either in the recovering or up
continue but some data is not fully replicated.
availability. Issue the mmlsdisk command.
User response:
6027-563 File system is already running.
Make additional space available and reissue the
command. Explanation:
The tsfsctl command was asked to resume a file
6027-557 Not enough space or available
system that is already running.
disks to properly balance the file.
User response:
Explanation:
None. Informational message only.
In attempting to stripe data within the file system, data
was placed on a disk other than the desired one. This 6027-564 Error processing inode allocation
is normally not a problem. map.
User response: Explanation:
Run mmrestripefs to rebalance all files. There is no I/O path to critical metadata or metadata
6027-558 has been corrupted.
Some data are unavailable.
User response:
Explanation:
Verify that the I/O paths to all disks are valid and
An I/O error has occurred or some disks are in the
that all disks are either in the recovering or up
stopped state.
availability. Issue the mmlsdisk command.
User response:
6027-565 Scanning user file metadata ...
Check the availability of all disks by issuing the
mmlsdisk command and check the path to all disks. Explanation:
Reissue the command. Progress information.

Chapter 42. References 747


User response: 6027-572 Completed creation of file system
None. Informational message only. fileSystem.
6027-566 Error processing user file Explanation:
metadata. The mmcrfs command has successfully completed.
Explanation: User response:
Error encountered while processing user file metadata. None. Informational message only.
User response: 6027-573 All data on the following disks of
None. Informational message only. fileSystem will be destroyed:
6027-567 Waiting for pending file system Explanation:
scan to finish ... Produced by the mmdelfs command to list the disks
in the file system that is about to be destroyed. Data
Explanation:
stored on the disks will be lost.
Progress information.
User response:
User response:
None. Informational message only.
None. Informational message only.
6027-574 Completed deletion of file system
6027-568 Waiting for number pending file
fileSystem.
system scans to finish ...
Explanation:
Explanation:
Progress information. The mmdelfs command has successfully completed.
User response:
User response:
None. Informational message only.
None. Informational message only.
6027-575 Unable to complete low level
6027-569 Incompatible parameters. Unable
format for fileSystem. Failed with
to allocate space for file system
error errorCode
metadata. Change one or more of
the following as suggested and try Explanation:
again: The mmcrfs command was unable to create the low
level file structures for the file system.
Explanation:
Incompatible file system parameters were detected. User response:
Check other error messages and the error log. This is
User response:
usually an error accessing disks.
Refer to the details given and correct the file system
parameters. 6027-576 Storage pools have not been
6027-570 enabled for file system fileSystem.
Incompatible parameters. Unable
to create file system. Change Explanation:
one or more of the following as User invoked a command with a storage pool option
suggested and try again: (-p or -P) before storage pools were enabled.
Explanation: User response:
Incompatible file system parameters were detected. Enable storage pools with the mmchfs -V command,
or correct the command invocation and reissue the
User response:
command.
Refer to the details given and correct the file system
parameters. 6027-577 Attention: number user or system
6027-571 files are not properly replicated.
Logical sector size value must be
the same as disk sector size. Explanation:
GPFS has detected files that are not replicated
Explanation:
correctly due to a previous failure.
This message is produced by the mmcrfs command if
the sector size given by the -l option is not the same User response:
as the sector size given for disks in the -d option. Issue the mmrestripefs command at the first
opportunity.
User response:
Correct the options and reissue the command.

748 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-578 Attention: number out of number of an inconsistent file system. Suggested parameter
user or system files are not changes are given.
properly replicated: User response:
Explanation: Reissue the mmcrfs command with the suggested
GPFS has detected files that are not replicated parameter changes.
correctly
6027-585 Incompatible parameters. Unable
6027-579 Some unreplicated file system to allocate space for ACL data.
metadata has been lost. File Change one or more of the
system usable only in restricted following as suggested and try
mode. again:
Explanation: Explanation:
A disk was deleted that contained vital file system Inconsistent parameters have been passed to the
metadata that was not replicated. mmcrfs command, which would result in the creation
of an inconsistent file system. The parameters entered
User response: require more space than is available. Suggested
Mount the file system in restricted mode (-o rs) and parameter changes are given.
copy any user data that may be left on the file system.
Then delete the file system. User response:
Reissue the mmcrfs command with the suggested
6027-580 Unable to access vital system parameter changes.
metadata. Too many disks are
unavailable. 6027-586 Quota server initialization failed.
Explanation: Explanation:
Metadata is unavailable because the disks on which Quota server initialization has failed. This message
the data reside are stopped, or an attempt was made may appear as part of the detail data in the quota error
to delete them. log.
User response: User response:
Either start the stopped disks, try to delete the disks Check status and availability of the disks. If quota
again, or recreate the file system. files have been corrupted, restore them from the last
available backup. Finally, reissue the command.
6027-581 Unable to access vital system
metadata, file system corrupted. 6027-587 Unable to initialize quota client
because there is no quota server.
Explanation:
Please check error log on the
When trying to access the files system, the metadata
file system manager node. The
was unavailable due to a disk being deleted.
mmcheckquota command must
User response: be run with the file system
Determine why a disk is unavailable. unmounted before retrying the
command.
6027-582 Some data has been lost.
Explanation:
Explanation: startQuotaClient failed.
An I/O error has occurred or some disks are in the
stopped state. User response:
If the quota file could not be read (check error log
User response: on file system manager. Issue the mmlsmgr command
Check the availability of all disks by issuing the to determine which node is the file system manager),
mmlsdisk command and check the path to all disks. then the mmcheckquota command must be run with
Reissue the command. the file system unmounted.
6027-584 Incompatible parameters. Unable 6027-588 No more than number nodes can
to allocate space for root directory. mount a file system.
Change one or more of the
following as suggested and try Explanation:
again: The limit of the number of nodes that can mount a file
system was exceeded.
Explanation:
Inconsistent parameters have been passed to the User response:
mmcrfs command, which would result in the creation

Chapter 42. References 749


Observe the stated limit for how many nodes can Explanation:
mount a file system. The specified disk is too large compared to the disks
that were initially used to create the storage pool.
6027-589 Scanning file system metadata,
phase number ... User response:
Specify a smaller disk or add the disk to a new storage
Explanation:
pool.
Progress information.
6027-595 [E] While creating quota files, file
User response:
fileName, with no valid quota
None. Informational message only.
information was found in the
6027-590 [W] GPFS is experiencing a shortage of root directory. Remove files with
pagepool. This message will not be reserved quota file names (for
repeated for at least one hour. example, user.quota) without valid
quota information from the root
Explanation:
directory by: - mounting the file
Pool starvation occurs, buffers have to be continually
system without quotas, - removing
stolen at high aggressiveness levels.
the files, and - remounting the
User response: file system with quotas to recreate
Issue the mmchconfig command to increase the size new quota files. To use quota file
of pagepool. names other than the reserved
names, use the mmcheckquota
6027-591 Unable to allocate sufficient
command.
inodes for file system metadata.
Increase the value for option and Explanation:
try again. While mounting a file system, the state of the
file system descriptor indicates that quota files do
Explanation:
not exist. However, files that do not contain quota
Too few inodes have been specified on the -N option of
information but have one of the reserved names:
the mmcrfs command.
user.quota, group.quota, or fileset.quota
User response: exist in the root directory.
Increase the size of the -N option and reissue the
mmcrfs command. User response
6027-592 Mount of fileSystem is waiting To mount the file system so that new quota files will be
for the mount disposition to be created, perform these steps:
set by some data management
1. Mount the file system without quotas.
application.
2. Verify that there are no files in the root
Explanation: directory with the reserved names: user.quota,
Data management utilizing DMAPI is enabled for the group.quota, or fileset.quota.
file system, but no data management application has
set a disposition for the mount event. 3. Remount the file system with quotas. To mount
the file system with other files used as quota files,
User response: issue the mmcheckquota command.
Start the data management application and verify that
the application sets the mount disposition. 6027-596 [I] While creating quota files,
file fileName containing quota
6027-593 [E] The root quota entry is not found information was found in the root
in its assigned record directory. This file will be used as
Explanation: quotaType quota file.
On mount, the root entry is not found in the first record Explanation:
of the quota file. While mounting a file system, the state of the
User response: file system descriptor indicates that quota files
Issue the mmcheckquota command to verify that the do not exist. However, files that have one of the
use of root has not been lost. reserved names user.quota, group.quota, or
fileset.quota and contain quota information, exist
6027-594 Disk diskName cannot be added to in the root directory. The file with the reserved name
storage pool poolName. Allocation will be used as the quota file.
map cannot accommodate disks
larger than size MB. User response:

750 IBM Storage Scale 5.1.9: Problem Determination Guide


None. Informational message. 6027-601 Error changing pool size.
6027-597 [E] The quota command was Explanation:
requested to process quotas for a The mmchconfig command failed to change the pool
type (user, group, or fileset), which size to the requested value.
is not enabled.
User response:
Explanation: Follow the suggested actions in the other messages
A quota command was requested to process quotas that occur with this one.
for a user, group, or fileset quota type, which is not
enabled. 6027-602 ERROR: file system not mounted.
Mount file system fileSystem and
User response: retry command.
Verify that the user, group, or fileset quota type is
enabled and reissue the command. Explanation:
A GPFS command that requires the file system be
6027-598 [E] The supplied file does not contain mounted was issued.
quota information.
User response:
Explanation: Mount the file system and reissue the command.
A file supplied as a quota file does not contain quota
information. 6027-603 Current pool size: valueK = valueM,
max block size: valueK = valueM.

User response Explanation:


Change the file so it contains valid quota information Displays the current pool size.
and reissue the command. User response:
To mount the file system so that new quota files are None. Informational message only.
created: 6027-604 [E] Parameter incompatibility. File
1. Mount the file system without quotas. system block size is larger than
maxblocksize parameter.
2. Verify there are no files in the root directory with
the reserved user.quota or group.quota name. Explanation:
An attempt is being made to mount a file system
3. Remount the file system with quotas.
whose block size is larger than the maxblocksize
6027-599 [E] File supplied to the command does parameter as set by mmchconfig.
not exist in the root directory. User response:
Explanation: Use the mmchconfig maxblocksize=xxx command
The user-supplied name of a new quota file has not to increase the maximum allowable block size.
been found. 6027-605 [N] File system has been renamed.
User response: Explanation:
Ensure that a file with the supplied name exists. Then Self-explanatory.
reissue the command.
User response:
6027-600 On node nodeName an earlier error None. Informational message only.
may have caused some file system
data to be inaccessible at this 6027-606 [E] The node number nodeNumber is
time. Check error log for additional not defined in the node list
information. After correcting the Explanation:
problem, the file system can be A node matching nodeNumber was not found in the
mounted again to restore normal GPFS configuration file.
data access.
User response:
Explanation: Perform required configuration steps prior to starting
An earlier error may have caused some file system GPFS on the node.
data to be inaccessible at this time.
6027-607 mmcommon getEFOptions
User response: fileSystem failed. Return code
Check the error log for additional information. After value.
correcting the problem, the file system can be
mounted again. Explanation:

Chapter 42. References 751


The mmcommon getEFOptions command failed Informational. When disk leasing is in use, wait for
while looking up the names of the disks in a the existing lease to expire before performing log and
file system. This error usually occurs during mount token manager recovery.
processing.
User response:
User response: None.
Check the preceding messages. A frequent cause for
6027-612 Unable to run command while the
such errors is lack of space in /var.
file system is suspended.
6027-608 [E] File system manager takeover
Explanation:
failed.
A command that can alter data in a file system was
Explanation: issued while the file system was suspended.
An attempt to takeover as file system manager failed.
User response:
The file system is unmounted to allow another node to
Resume the file system and reissue the command.
try.
6027-613 [N] Expel node request from node.
User response:
Expelling: node
Check the return code. This is usually due to network
or disk connectivity problems. Issue the mmlsdisk Explanation:
command to determine if the paths to the disk are One node is asking to have another node expelled
unavailable, and issue the mmchdisk if necessary. from the cluster, usually because they have
communications problems between them. The cluster
6027-609 File system fileSystem unmounted
manager node will decide which one will be expelled.
because it does not have a
manager. User response:
Check that the communications paths are available
Explanation:
between the two nodes.
The file system had to be unmounted because a
file system manager could not be assigned. An 6027-614 Value value for option name is out
accompanying message tells which node was the last of range. Valid values are number
manager. through number.
User response: Explanation:
Examine error log on the last file system manager. The value for an option in the command line
Issue the mmlsdisk command to determine if a arguments is out of range.
number of disks are down. Examine the other error
User response:
logs for an indication of network, disk, or virtual shared
Correct the command line and reissue the command.
disk problems. Repair the base problem and issue the
mmchdisk command if required. 6027-615 mmcommon getContactNodes
clusterName failed. Return code
6027-610 Cannot mount file system
value.
fileSystem because it does not
have a manager. Explanation:
mmcommon getContactNodes failed while looking
Explanation:
up contact nodes for a remote cluster, usually while
The file system had to be unmounted because a
attempting to mount a file system from a remote
file system manager could not be assigned. An
cluster.
accompanying message tells which node was the last
manager. User response:
Check the preceding messages, and consult the earlier
User response:
chapters of this document. A frequent cause for such
Examine error log on the last file system manager
errors is lack of space in /var.
node. Issue the mmlsdisk command to determine
if a number of disks are down. Examine the other 6027-616 [X] Duplicate address ipAddress in
error logs for an indication of disk or network shared node list
disk problems. Repair the base problem and issue the
Explanation:
mmchdisk command if required.
The IP address appears more than once in the node
6027-611 [I] Recovery: fileSystem, delay list file.
number sec. for safe recovery.
User response:
Explanation: Check the node list shown by the mmlscluster
command.

752 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-617 [I] Recovered number nodes for and mmremotecluster show to display information
cluster clusterName. about the remote cluster.
Explanation: 6027-623 All disks up and ready
The asynchronous part (phase 2) of node failure Explanation:
recovery has completed. Self-explanatory.
User response: User response:
None. Informational message only. None. Informational message only.
6027-618 [X] Local host not found in node list 6027-624 No disks
(local ip interfaces: interfaceList)
Explanation:
Explanation: Self-explanatory.
The local host specified in the node list file could not
be found. User response:
None. Informational message only.
User response:
Check the node list shown by the mmlscluster 6027-625 File system manager takeover
command. already pending.
6027-619 Negative grace times are not Explanation:
allowed. A request to migrate the file system manager failed
because a previous migrate request has not yet
Explanation: completed.
The mmedquota command received a negative value
for the -t option. User response:
None. Informational message only.
User response:
Reissue the mmedquota command with a nonnegative 6027-626 Migrate to node nodeName already
value for grace time. pending.
6027-620 Hard quota limit must not be less Explanation:
than soft limit. A request to migrate the file system manager failed
because a previous migrate request has not yet
Explanation: completed.
The hard quota limit must be greater than or equal to
the soft quota limit. User response:
None. Informational message only.
User response:
Reissue the mmedquota command and enter valid 6027-627 Node nodeName is already
values when editing the information. manager for fileSystem.
6027-621 Negative quota limits are not Explanation:
allowed. A request has been made to change the file system
manager node to the node that is already the manager.
Explanation:
The quota value must be positive. User response:
None. Informational message only.
User response:
Reissue the mmedquota command and enter valid 6027-628 Sending migrate request to
values when editing the information. current manager node nodeName.
6027-622 [E] Failed to join remote cluster Explanation:
clusterName A request has been made to change the file system
manager node.
Explanation:
The node was not able to establish communication User response:
with another cluster, usually while attempting to None. Informational message only.
mount a file system from a remote cluster. 6027-629 [N] Node nodeName resigned as
User response: manager for fileSystem.
Check other console messages for additional Explanation:
information. Verify that contact nodes for the remote Progress report produced by the mmchmgr command.
cluster are set correctly. Run mmremotefs show
User response:

Chapter 42. References 753


None. Informational message only. The file system manager node could not be replaced.
This is usually caused by other system errors, such as
6027-630 [N] Node nodeName appointed as
disk or communication errors.
manager for fileSystem.
User response:
Explanation:
See accompanying messages for the base failure.
The mmchmgr command successfully changed the
node designated as the file system manager. 6027-636 [E] Disk marked as stopped or offline.
User response: Explanation:
None. Informational message only. A disk continues to be marked down due to a previous
error and was not opened again.
6027-631 Failed to appoint node nodeName
as manager for fileSystem. User response:
Check the disk status by issuing the mmlsdisk
Explanation:
command, then issue the mmchdisk start
A request to change the file system manager node has
command to restart the disk.
failed.
6027-637 [E] RVSD is not active.
User response:
Accompanying messages will describe the reason for Explanation:
the failure. Also, see the mmfs.log file on the target The RVSD subsystem needs to be activated.
node.
User response:
6027-632 Failed to appoint new manager for See the appropriate IBM Reliable Scalable Cluster
fileSystem. Technology (RSCT) document search on diagnosing
IBM Virtual Shared Disk problems.
Explanation:
An attempt to change the file system manager node 6027-638 [E] File system fileSystem unmounted
has failed. by node nodeName
User response: Explanation:
Accompanying messages will describe the reason for Produced in the console log on a forced unmount
the failure. Also, see the mmfs.log file on the target of the file system caused by disk or communication
node. failures.
6027-633 The best choice node nodeName User response:
is already the manager for Check the error log on the indicated node. Correct the
fileSystem. underlying problem and remount the file system.
Explanation: 6027-639 [E] File system cannot be mounted
Informational message about the progress and in restricted mode and ro or rw
outcome of a migrate request. concurrently
User response: Explanation:
None. Informational message only. There has been an attempt to concurrently mount a
file system on separate nodes in both a normal mode
6027-634 Node name or number node is not
and in 'restricted' mode.
valid.
User response:
Explanation:
Decide which mount mode you want to use, and use
A node number, IP address, or host name that is not
that mount mode on both nodes.
valid has been entered in the configuration file or as
input for a command. 6027-640 [E] File system is mounted
User response: Explanation:
Validate your configuration information and the A command has been issued that requires that the file
condition of your network. This message may result system be unmounted.
from an inability to translate a node name.
User response:
6027-635 [E] The current file system manager Unmount the file system and reissue the command.
failed and no new manager will be
6027-641 [E] Unable to access vital system
appointed.
metadata. Too many disks are
Explanation: unavailable or the file system is
corrupted.

754 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation 6027-646 [E] File system unmounted due to loss
An attempt has been made to access a file system, but of cluster membership.
the metadata is unavailable. This can be caused by: Explanation:
1. The disks on which the metadata resides are either Quorum was lost, causing file systems to be
stopped or there was an unsuccessful attempt to unmounted.
delete them. User response:
2. The file system is corrupted. Get enough nodes running the GPFS daemon to form a
quorum.
User response 6027-647 [E] File fileName could not be run with
To access the file system: err errno.
1. If the disks are the problem either start the Explanation:
stopped disks or try to delete them. The specified shell script could not be run. This
2. If the file system has been corrupted, you will have message is followed by the error string that is returned
to recreate it from backup medium. by the exec.

6027-642 [N] File system has been deleted. User response:


Check file existence and access permissions.
Explanation:
Self-explanatory. 6027-648 EDITOR environment variable
must be full pathname.
User response:
None. Informational message only. Explanation:
The value of the EDITOR environment variable is not
6027-643 [I] Node nodeName completed take an absolute path name.
over for fileSystem.
User response:
Explanation: Change the value of the EDITOR environment variable
The mmchmgr command completed successfully. to an absolute path name.
User response: 6027-649 Error reading the mmpmon
None. Informational message only. command file.
6027-644 The previous error was detected Explanation:
on node nodeName. An error occurred when reading the mmpmon command
Explanation: file.
An unacceptable error was detected. This usually User response:
occurs when attempting to retrieve file system Check file existence and access permissions.
information from the operating system's file system
database or the cached GPFS system control data. 6027-650 [X] The mmfs daemon is shutting
The message identifies the node where the error was down abnormally.
encountered. Explanation:
User response: The GPFS daemon is shutting down as a result of an
See accompanying messages for the base failure. A irrecoverable condition, typically a resource shortage.
common cause for such errors is lack of space in /var. User response:
6027-645 Attention: mmcommon Review error log entries, correct a resource shortage
getEFOptions fileSystem failed. condition, and restart the GPFS daemon.
Checking fileName. 6027-660 Error displaying message from
Explanation: mmfsd.
The names of the disks in a file system were not found Explanation:
in the cached GPFS system data, therefore an attempt GPFS could not properly display an output string
will be made to get the information from the operating sent from the mmfsd daemon due to some error. A
system's file system database. description of the error follows.
User response: User response:
If the command fails, see “File system fails to mount” Check that GPFS is properly installed.
on page 377. A common cause for such errors is lack
of space in /var.

Chapter 42. References 755


6027-661 mmfsd waiting for primary node Check internode communication configuration and
nodeName. ensure that enough GPFS nodes are up to form a
quorum.
Explanation:
The mmfsd server has to wait during start up because 6027-667 Could not set up socket
mmfsd on the primary node is not yet ready. Explanation:
User response: One of the calls to create or bind the socket used
None. Informational message only. for sending parameters and messages between the
command and the daemon failed.
6027-662 mmfsd timed out waiting for
primary node nodeName. User response:
Check additional error messages.
Explanation:
The mmfsd server is about to terminate. 6027-668 Could not send message to file
system daemon
User response:
Ensure that the mmfs.cfg configuration file contains Explanation:
the correct host name or IP address of the primary Attempt to send a message to the file system failed.
node. Check mmfsd on the primary node. User response:
6027-663 Lost connection to file system Check if the file system daemon is up and running.
daemon. 6027-669 Could not connect to file system
Explanation: daemon.
The connection between a GPFS command and the
Explanation:
mmfsd daemon has broken. The daemon has probably
The TCP connection between the command and the
crashed.
daemon could not be established.
User response:
User response:
Ensure that the mmfsd daemon is running. Check the
Check additional error messages.
error log.
6027-670 Value for 'option' is not valid. Valid
6027-664 Unexpected message from file values are list.
system daemon.
Explanation:
Explanation: The specified value for the given command option was
The version of the mmfsd daemon does not match the not valid. The remainder of the line will list the valid
version of the GPFS command. keywords.
User response: User response:
Ensure that all GPFS software components are at the Correct the command line.
same version.
6027-671 Keyword missing or incorrect.
6027-665 Failed to connect to file system
daemon: errorString Explanation:
A missing or incorrect keyword was encountered while
Explanation: parsing command line arguments
An error occurred while trying to create a session with
mmfsd. User response:
Correct the command line.
User response:
Ensure that the mmfsd daemon is running. Also, only 6027-672 Too few arguments specified.
root can run most GPFS commands. The mode bits of
Explanation:
the commands must be set-user-id to root.
Too few arguments were specified on the command
6027-666 Failed to determine file system line.
manager. User response:
Explanation: Correct the command line.
While running a GPFS command in a multiple node 6027-673 Too many arguments specified.
configuration, the local file system daemon is unable
to determine which node is managing the file system Explanation:
affected by the command. Too many arguments were specified on the command
line.
User response:

756 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: User response:
Correct the command line. Check the disk hardware and the software subsystems
in the path to the disk.
6027-674 Too many values specified for
option name. 6027-681 Required option name was not
specified.
Explanation:
Too many values were specified for the given option on Explanation:
the command line. A required option was not specified on the command
line.
User response:
Correct the command line. User response:
Correct the command line.
6027-675 Required value for option is
missing. 6027-682 Device argument is missing.
Explanation: Explanation:
A required value was not specified for the given option The device argument was not specified on the
on the command line. command line.
User response: User response:
Correct the command line. Correct the command line.
6027-676 Option option specified more than 6027-683 Disk name is invalid.
once.
Explanation:
Explanation: An incorrect disk name was specified on the command
The named option was specified more than once on line.
the command line.
User response:
User response: Correct the command line.
Correct the command line.
6027-684 Value value for option is incorrect.
6027-677 Option option is incorrect.
Explanation:
Explanation: An incorrect value was specified for the named option.
An incorrect option was specified on the command
User response:
line.
Correct the command line.
User response:
6027-685 Value value for option option is out
Correct the command line.
of range. Valid values are number
6027-678 Misplaced or incorrect parameter through number.
name.
Explanation:
Explanation: An out of range value was specified for the named
A misplaced or incorrect parameter was specified on option.
the command line.
User response:
User response: Correct the command line.
Correct the command line.
6027-686 option (value) exceeds option
6027-679 Device name is not valid. (value).
Explanation: Explanation:
An incorrect device name was specified on the The value of the first option exceeds the value of the
command line. second option. This is not permitted.
User response: User response:
Correct the command line. Correct the command line.
6027-680 [E] Disk failure. Volume name. rc = 6027-687 Disk name is specified more than
value. Physical volume name. once.
Explanation: Explanation:
An I/O request to a disk or a request to fence a disk The named disk was specified more than once on the
has failed in such a manner that GPFS can no longer command line.
use the disk.

Chapter 42. References 757


User response: Could not access the given disk.
Correct the command line.
User response:
6027-688 Failed to read file system Check the disk hardware and the path to the disk.
descriptor.
6027-694 Disk not started; disk name has a
Explanation: bad volume label.
The disk block containing critical information about the
Explanation:
file system could not be read from disk.
The volume label on the disk does not match that
User response: expected by GPFS.
This is usually an error in the path to the disks. If
User response:
there are associated messages indicating an I/O error
Check the disk hardware. For hot-pluggable drives,
such as ENODEV or EIO, correct that error and retry
ensure that the proper drive has been plugged in.
the operation. If there are no associated I/O errors,
then run the mmfsck command with the file system 6027-695 [E] File system is read-only.
unmounted.
Explanation:
6027-689 Failed to update file system An operation was attempted that would require
descriptor. modifying the contents of a file system, but the file
system is read-only.
Explanation:
The disk block containing critical information about the User response:
file system could not be written to disk. Make the file system R/W before retrying the
operation.
User response:
This is a serious error, which may leave the file system 6027-696 [E] Too many disks are unavailable.
in an unusable state. Correct any I/O errors, then run
Explanation:
the mmfsck command with the file system unmounted
A file system operation failed because all replicas of a
to make repairs.
data or metadata block are currently unavailable.
6027-690 Failed to allocate I/O buffer.
User response:
Explanation: Issue the mmlsdisk command to check the
Could not obtain enough memory (RAM) to perform an availability of the disks in the file system; correct disk
operation. hardware problems, and then issue the mmchdisk
command with the start option to inform the file
User response:
system that the disk or disks are available again.
Either retry the operation when the mmfsd daemon
is less heavily loaded, or increase the size of one or 6027-697 [E] No log available.
more of the memory pool parameters by issuing the
Explanation:
mmchconfig command.
A file system operation failed because no space for
6027-691 Failed to send message to node logging metadata changes could be found.
nodeName.
User response:
Explanation: Check additional error message. A likely reason for
A message to another file system node could not be this error is that all disks with available log space are
sent. currently unavailable.
User response: 6027-698 [E] Not enough memory to allocate
Check additional error message and the internode internal data structure.
communication configuration.
Explanation:
6027-692 Value for option is not valid. Valid A file system operation failed because no memory is
values are yes, no. available for allocating internal data structures.
Explanation: User response:
An option that is required to be yes or no is neither. Stop other processes that may have main memory
pinned for their use.
User response:
Correct the command line. 6027-699 [E] Inconsistency in file system
metadata.
6027-693 Cannot open disk name.
Explanation:
Explanation:
File system metadata on disk has been corrupted.

758 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: persists, issue the mmfsck command with the file
This is an extremely serious error that may cause system unmounted to make repairs.
loss of data. Issue the mmfsck command with the file
6027-703 [X] Some file system data are
system unmounted to make repairs. There will be a
inaccessible at this time.
POSSIBLE FILE CORRUPTION entry in the system
Check error log for additional
error log that should be forwarded to the IBM Support
information.
Center.
Explanation:
6027-700 [E] Log recovery failed.
The file system has encountered an error that is
Explanation: serious enough to make some or all data inaccessible.
An error was encountered while restoring file system This message indicates that an error occurred that left
metadata from the log. the file system in an unusable state.
User response: User response:
Check additional error message. A likely reason for Possible reasons include too many unavailable disks or
this error is that none of the replicas of the log could insufficient memory for file system control structures.
be accessed because too many disks are currently Check other error messages as well as the error log
unavailable. If the problem persists, issue the mmfsck for additional information. Unmount the file system
command with the file system unmounted. and correct any I/O errors. Then remount the file
system and try the operation again. If the problem
6027-701 [X] Some file system data are
persists, issue the mmfsck command with the file
inaccessible at this time.
system unmounted to make repairs.
Explanation:
6027-704 Attention: Due to an earlier error
The file system has encountered an error that is
normal access to this file system
serious enough to make some or all data inaccessible.
has been disabled. Check error log
This message indicates that an occurred that left the
for additional information. After
file system in an unusable state.
correcting the problem, the file
User response: system must be unmounted and
Possible reasons include too many unavailable disks or then mounted again to restore
insufficient memory for file system control structures. normal data access.
Check other error messages as well as the error log
Explanation:
for additional information. Unmount the file system
The file system has encountered an error that is
and correct any I/O errors. Then remount the file
serious enough to make some or all data inaccessible.
system and try the operation again. If the problem
This message indicates that an error occurred that left
persists, issue the mmfsck command with the file
the file system in an unusable state.
system unmounted to make repairs.
User response:
6027-702 [X] Some file system data are
Possible reasons include too many unavailable disks or
inaccessible at this time.
insufficient memory for file system control structures.
Check error log for additional
Check other error messages as well as the error log
information. After correcting the
for additional information. Unmount the file system
problem, the file system must be
and correct any I/O errors. Then remount the file
unmounted and then mounted to
system and try the operation again. If the problem
restore normal data access.
persists, issue the mmfsck command with the file
Explanation: system unmounted to make repairs.
The file system has encountered an error that is
6027-705 Error code value.
serious enough to make some or all data inaccessible.
This message indicates that an error occurred that left Explanation:
the file system in an unusable state. Provides additional information about an error.
User response: User response:
Possible reasons include too many unavailable disks or See accompanying error messages.
insufficient memory for file system control structures.
6027-706 The device name has no
Check other error messages as well as the error log
corresponding entry in fileName or
for additional information. Unmount the file system
has an incomplete entry.
and correct any I/O errors. Then remount the file
system and try the operation again. If the problem Explanation:

Chapter 42. References 759


The command requires a device that has a file system 6027-713 Unable to start because conflicting
associated with it. program name is running. Waiting
until it completes.
User response:
Check the operating system's file system database Explanation:
(the given file) for a valid device entry. A program detected that it cannot start because
a conflicting program is running. The program will
6027-707 Unable to open file fileName.
automatically start once the conflicting program has
Explanation: ended, as long as there are no other conflicting
The named file cannot be opened. programs running at that time.
User response: User response:
Check that the file exists and has the correct None. Informational message only.
permissions.
6027-714 Terminating because conflicting
6027-708 Keyword name is incorrect. Valid program name is running.
values are list.
Explanation:
Explanation: A program detected that it must terminate because a
An incorrect keyword was encountered. conflicting program is running.
User response: User response:
Correct the command line. Reissue the command once the conflicting program
has ended.
6027-709 Incorrect response. Valid
responses are "yes", "no", or 6027-715 command is finished waiting.
"noall" Starting execution now.
Explanation: Explanation:
A question was asked that requires a yes or no A program detected that it can now begin running
answer. The answer entered was neither yes, no, nor because a conflicting program has ended.
noall.
User response:
User response: None. Information message only.
Enter a valid response.
6027-716 [E] Some file system data or metadata
6027-710 Attention: has been lost.
Explanation: Explanation:
Precedes an attention messages. Unable to access some piece of file system data that
has been lost due to the deletion of disks beyond the
User response:
replication factor.
None. Informational message only.
User response:
6027-711 [E] Specified entity, such as a disk or
If the function did not complete, try to mount the file
file system, does not exist.
system in restricted mode.
Explanation:
6027-717 [E] Must execute mmfsck before
A file system operation failed because the specified
entity, such as a disk or file system, could not be mount.
found. Explanation:
User response: An attempt has been made to mount a file system on
Specify existing disk, file system, etc. which an incomplete mmfsck command was run.

6027-712 [E] User response:


Error in communications between
Reissue the mmfsck command to the repair file
mmfsd daemon and client
system, then reissue the mount command.
program.
6027-718 The mmfsd daemon is not ready to
Explanation:
A message sent between the mmfsd daemon and the handle commands yet.
client program had an incorrect format or content. Explanation:
The mmfsd daemon is not accepting messages
User response:
because it is restarting or stopping.
Verify that the mmfsd daemon is running.
User response:

760 IBM Storage Scale 5.1.9: Problem Determination Guide


None. Informational message only. correcting the problem, the file
system must be mounted again to
6027-719 [E] Device type not supported.
restore normal data access.
Explanation:
Explanation:
A disk being added to a file system with the
The file system has encountered an error that is
mmadddisk or mmcrfs command is not a character
serious enough to make some or all data inaccessible.
mode special file, or has characteristics not recognized
This message indicates that an error occurred that left
by GPFS.
the file system in an unusable state. Possible reasons
User response: include too many unavailable disks or insufficient
Check the characteristics of the disk being added to memory for file system control structures.
the file system.
User response:
6027-720 [E] Actual sector size does not match Check other error messages as well as the error log for
given sector size. additional information. Correct any I/O errors. Then,
remount the file system and try the operation again. If
Explanation:
the problem persists, issue the mmfsck command with
A disk being added to a file system with the
the file system unmounted to make repairs.
mmadddisk or mmcrfs command has a physical
sector size that differs from that given in the disk 6027-724 [E] Incompatible file system format.
description list.
Explanation:
User response: An attempt was made to access a file system that
Check the physical sector size of the disk being added was formatted with an older version of the product
to the file system. that is no longer compatible with the version currently
running.
6027-721 [E] Host 'name' in fileName is not
valid. User response:
To change the file system format version to the current
Explanation:
version, issue the -V option on the mmchfs command.
A host name or IP address that is not valid was found
in a configuration file. 6027-725 The mmfsd daemon is not ready to
handle commands yet. Waiting for
User response:
quorum.
Check the configuration file specified in the error
message. Explanation:
The GPFS mmfsd daemon is not accepting messages
6027-722 Attention: Due to an earlier error
because it is waiting for quorum.
normal access to this file system
has been disabled. Check error log User response:
for additional information. The file Determine why insufficient nodes have joined the
system must be mounted again to group to achieve quorum and rectify the problem.
restore normal data access.
6027-726 [E] Quota initialization/start-up
Explanation: failed.
The file system has encountered an error that is
Explanation:
serious enough to make some or all data inaccessible.
Quota manager initialization was unsuccessful. The file
This message indicates that an error occurred that left
system manager finished without quotas. Subsequent
the file system in an unusable state. Possible reasons
client mount requests will fail.
include too many unavailable disks or insufficient
memory for file system control structures. User response:
Check the error log and correct I/O errors. It may be
User response:
necessary to issue the mmcheckquota command with
Check other error messages as well as the error log for
the file system unmounted.
additional information. Correct any I/O errors. Then,
remount the file system and try the operation again. If 6027-727 Specified driver type type does not
the problem persists, issue the mmfsck command with match disk name driver type type.
the file system unmounted to make repairs.
Explanation:
6027-723 Attention: Due to an earlier error The driver type specified on the mmchdisk command
normal access to this file system does not match the current driver type of the disk.
has been disabled. Check error log
User response:
for additional information. After
Verify the driver type and reissue the command.

Chapter 42. References 761


6027-728 Specified sector size value does Reissue the mmedquota command. Change only the
not match disk name sector size values for the limits and follow the instructions given.
value. 6027-734 [W] Quota check for 'fileSystem' ended
Explanation: prematurely.
The sector size specified on the mmchdisk command Explanation:
does not match the current sector size of the disk.
The user interrupted and terminated the command.
User response: User response:
Verify the sector size and reissue the command. If ending the command was not intended, reissue the
6027-729 Attention: No changes for disk mmcheckquota command.
name were specified. 6027-735 Error editing string from mmfsd.
Explanation: Explanation:
The disk descriptor in the mmchdisk command does An internal error occurred in the mmfsd when editing a
not specify that any changes are to be made to the string.
disk.
User response:
User response: None. Informational message only.
Check the disk descriptor to determine if changes are
needed. 6027-736 Attention: Due to an earlier error
normal access to this file system
6027-730 command on fileSystem. has been disabled. Check error
Explanation: log for additional information. The
Quota was activated or deactivated as stated file system must be unmounted
as a result of the mmquotaon, mmquotaoff, and then mounted again to restore
mmdefquotaon, or mmdefquotaoff commands. normal data access.
User response: Explanation:
None, informational only. This message is enabled The file system has encountered an error that is
with the -v option on the mmquotaon, mmquotaoff, serious enough to make some or all data inaccessible.
mmdefquotaon, or mmdefquotaoff commands. This message indicates that an error occurred that left
the file system in an unusable state. Possible reasons
6027-731 Error number while performing include too many unavailable disks or insufficient
command for name quota on memory for file system control structures.
fileSystem
User response:
Explanation: Check other error messages as well as the error log
An error occurred when switching quotas of a certain for additional information. Unmount the file system
type on or off. If errors were returned for multiple file and correct any I/O errors. Then, remount the file
systems, only the error code is shown. system and try the operation again. If the problem
User response: persists, issue the mmfsck command with the file
Check the error code shown by the message to system unmounted to make repairs.
determine the reason. 6027-737 Attention: No metadata disks
6027-732 Error while performing command remain.
on fileSystem. Explanation:
Explanation: The mmchdisk command has been issued, but no
An error occurred while performing the stated metadata disks remain.
command when listing or reporting quotas. User response:
User response: None. Informational message only.
None. Informational message only. 6027-738 Attention: No data disks remain.
6027-733 Edit quota: Incorrect format! Explanation:
Explanation: The mmchdisk command has been issued, but no data
The format of one or more edited quota limit entries disks remain.
was not correct. User response:
User response: None. Informational message only.

762 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-739 Attention: Due to an earlier Issue an mmchdisk start command when more
configuration change the file disks are available.
system is no longer properly 6027-744 Unable to run command while
balanced. the file system is mounted in
Explanation: restricted mode.
The mmlsdisk command found that the file system is
Explanation:
not properly balanced. A command that can alter the data in a file system was
User response: issued while the file system was mounted in restricted
Issue the mmrestripefs -b command at your mode.
convenience. User response:
6027-740 Attention: Due to an earlier Mount the file system in read-only or read-write mode
configuration change the file or unmount the file system and then reissue the
system is no longer properly command.
replicated. 6027-745 fileSystem: no quotaType quota
Explanation: management enabled.
The mmlsdisk command found that the file system is Explanation:
not properly replicated. A quota command of the cited type was issued for
User response: the cited file system when no quota management was
Issue the mmrestripefs -r command at your enabled.
convenience User response:
6027-741 Attention: Due to an earlier Enable quota management and reissue the command.
configuration change the file 6027-746 Editing quota limits for this user or
system may contain data that is at group not permitted.
risk of being lost.
Explanation:
Explanation: The root user or system group was specified for
The mmlsdisk command found that critical data quota limit editing in the mmedquota command.
resides on disks that are suspended or being deleted.
User response:
User response: Specify a valid user or group in the mmedquota
Issue the mmrestripefs -m command as soon as command. Editing quota limits for the root user or
possible. system group is prohibited.
6027-742 Error occurred while executing a 6027-747 [E] Too many nodes in cluster (max
command for fileSystem. number) or file system (max
Explanation: number).
A quota command encountered a problem on a Explanation:
file system. Processing continues with the next file The operation cannot succeed because too many
system. nodes are involved.
User response: User response:
None. Informational message only. Reduce the number of nodes to the applicable stated
6027-743 Initial disk state was updated limit.
successfully, but another error 6027-748 fileSystem: no quota management
may have changed the state again. enabled
Explanation: Explanation:
The mmchdisk command encountered an error after A quota command was issued for the cited file system
the disk status or availability change was already when no quota management was enabled.
recorded in the file system configuration. The most
likely reason for this problem is that too many disks User response:
have become unavailable or are still unavailable after Enable quota management and reissue the command.
the disk state change. 6027-749 Pool size changed to number K =
User response: number M.
Explanation:

Chapter 42. References 763


Pool size successfully changed. User response:
None. Informational message only.
User response:
None. Informational message only. 6027-756 [E] Configuration invalid or
inconsistent between different
6027-750 [E] The node address ipAddress is not
nodes.
defined in the node list
Explanation:
Explanation:
Self-explanatory.
An address does not exist in the GPFS configuration
file. User response:
Check cluster and file system configuration.
User response:
Perform required configuration steps prior to starting 6027-757 name is not an excluded disk.
GPFS on the node.
Explanation:
6027-751 [E] Error code value Some of the disks passed to the mmfsctl include
command are not marked as excluded in the
Explanation:
mmsdrfs file.
Provides additional information about an error.
User response:
User response:
Verify the list of disks supplied to this command.
See accompanying error messages.
6027-758 Disk(s) not started; disk name has
6027-752 [E] Lost membership in cluster
a bad volume label.
clusterName. Unmounting file
systems. Explanation:
The volume label on the disk does not match that
Explanation:
expected by GPFS.
This node has lost membership in the cluster. Either
GPFS is no longer available on enough nodes to User response:
maintain quorum, or this node could not communicate Check the disk hardware. For hot-pluggable drives,
with other members of the quorum. This could be make sure the proper drive has been plugged in.
caused by a communications failure between nodes,
6027-759 fileSystem is still in use.
or multiple GPFS failures.
Explanation:
User response:
The mmfsctl include command found that the
See associated error logs on the failed nodes for
named file system is still mounted, or another GPFS
additional problem determination information.
command is running against the file system.
6027-753 [E] Could not run command command
User response:
Explanation: Unmount the file system if it is mounted, or wait
The GPFS daemon failed to run the specified for GPFS commands in progress to terminate before
command. retrying the command.
User response: 6027-760 [E] Unable to perform i/o to the disk.
Verify correct installation. This node is either fenced from
accessing the disk or this node's
6027-754 Error reading string for mmfsd.
disk lease has expired.
Explanation:
Explanation:
GPFS could not properly read an input string.
A read or write to the disk failed due to either being
User response: fenced from the disk or no longer having a disk lease.
Check that GPFS is properly installed.
User response:
6027-755 [I] Waiting for challenge Verify disk hardware fencing setup is correct if being
challengeValue (node nodeNumber, used. Ensure network connectivity between this node
sequence sequenceNumber) to be and other nodes is operational.
responded during disk election
6027-761 [W] Attention: excessive timer drift
Explanation: between node and node (number
The node has challenged another node, which won the over number sec).
previous election and is waiting for the challenger to
Explanation:
respond.

764 IBM Storage Scale 5.1.9: Problem Determination Guide


GPFS has detected an unusually large difference in This node sent an expel request to the cluster
the rate of clock ticks (as returned by the times() manager node to expel another node.
system call) between two nodes. Another node's TOD
User response:
clock and tick rate changed dramatically relative to
Check network connection between this node and the
this node's TOD clock and tick rate.
node specified above.
User response:
6027-768 Wrong number of operands for
Check error log for hardware or device driver problems
mmpmon command 'command'.
that might cause timer interrupts to be lost or a recent
large adjustment made to the TOD clock. Explanation:
The command read from the input file has the wrong
6027-762 No quota enabled file system
number of operands.
found.
User response:
Explanation:
Correct the command invocation and reissue the
There is no quota-enabled file system in this cluster.
command.
User response:
6027-769 Malformed mmpmon command
None. Informational message only.
'command'.
6027-763 uidInvalidate: Incorrect option
Explanation:
option.
The command read from the input file is malformed,
Explanation: perhaps with an unknown keyword.
An incorrect option passed to the uidinvalidate
User response:
command.
Correct the command invocation and reissue the
User response: command.
Correct the command invocation.
6027-770 Error writing user.quota file.
6027-764 Error invalidating UID remapping
Explanation:
cache for domain.
An error occurred while writing the cited quota file.
Explanation:
User response:
An incorrect domain name passed to the
Check the status and availability of the disks and
uidinvalidate command.
reissue the command.
User response:
6027-771 Error writing group.quota file.
Correct the command invocation.
Explanation:
6027-765 [W] Tick value hasn't changed for
An error occurred while writing the cited quota file.
nearly number seconds
User response:
Explanation:
Check the status and availability of the disks and
Clock ticks incremented by AIX have not been
reissue the command.
incremented.
6027-772 Error writing fileset.quota file.
User response:
Check the error log for hardware or device driver Explanation:
problems that might cause timer interrupts to be lost. An error occurred while writing the cited quota file.
6027-766 [N] This node will be expelled from User response:
cluster cluster due to expel msg Check the status and availability of the disks and
from node reissue the command.
Explanation: 6027-773 fileSystem: quota check may be
This node is being expelled from the cluster. incomplete because of SANergy
activity on number files.
User response:
Check the network connection between this node and Explanation:
the node specified above. The online quota check may be incomplete due to
active SANergy activities on the file system.
6027-767 [N] Request sent to node to expel node
from cluster cluster User response:
Reissue the quota check when there is no SANergy
Explanation:
activity.

Chapter 42. References 765


6027-774 fileSystem: quota management is Explanation:
not enabled, or one or more quota The fileset name provided on the command line is
clients are not available. incorrect.
Explanation: User response:
An attempt was made to perform quotas commands Correct the fileset name and reissue the command.
without quota management enabled, or one or more
6027-780 Incorrect path to fileset junction
quota clients failed during quota check. junctionName.
User response: Explanation:
Correct the cause of the problem, and then reissue the The path to the fileset junction is incorrect.
quota command.
User response:
6027-775 During mmcheckquota processing, Correct the junction path and reissue the command.
number node(s) failed.
It is recommended that 6027-781 Storage pools have not been
mmcheckquota be repeated. enabled for file system fileSystem.
Explanation: Explanation:
Nodes failed while an online quota check was running. The user invoked a command with a storage pool
option (-p or -P) before storage pools were enabled.
User response:
Reissue the quota check command. User response:
Enable storage pools with the mmchfs -V command,
6027-776 fileSystem: There was not enough or correct the command invocation and reissue the
space for the report. Please repeat command.
quota check!
6027-784 [E] Device not ready.
Explanation:
The vflag is set in the tscheckquota command, Explanation:
but either no space or not enough space could be A device is not ready for operation.
allocated for the differences to be printed. User response:
User response: Check previous messages for further information.
Correct the space problem and reissue the quota 6027-785 [E] Cannot establish connection.
check.
Explanation:
6027-777 [I] Recovering nodes: nodeList This node cannot establish a connection to another
Explanation: node.
Recovery for one or more nodes has begun. User response:
User response: Check previous messages for further information.
No response is needed if this message is followed by 6027-786 [E] Message failed because the
'recovered nodes' entries specifying the nodes. If this
destination node refused the
message is not followed by such a message, determine
connection.
why recovery did not complete.
Explanation:
6027-778 [I] Recovering nodes in cluster This node sent a message to a node that refuses to
cluster: nodeList establish a connection.
Explanation: User response:
Recovery for one or more nodes in the cited cluster Check previous messages for further information.
has begun.
6027-787 [E] Security configuration data is
User response:
inconsistent or unavailable.
No response is needed if this message is followed
by 'recovered nodes' entries on the cited cluster Explanation:
specifying the nodes. If this message is not followed There was an error configuring security on this node.
by such a message, determine why recovery did not User response:
complete. Check previous messages for further information.
6027-779 Incorrect fileset name 6027-788 [E] Failed to load or initialize security
filesetName. library.

766 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: An incorrect file name was specified to tschpolicy.
There was an error loading or initializing the security
User response:
library on this node.
Correct the command invocation and reissue the
User response: command.
Check previous messages for further information.
6027-796 Failed to read fileName: errorCode.
6027-789 Unable to read offsets offset to
Explanation:
offset for inode inode snap snap,
An incorrect file name was specified to tschpolicy.
from disk diskName, sector sector.
User response:
Explanation:
Correct the command invocation and reissue the
The mmdeldisk -c command found that the cited
command.
addresses on the cited disk represent data that is no
longer readable. 6027-797 Failed to stat fileName: errorCode.
User response: Explanation:
Save this output for later use in cleaning up failing An incorrect file name was specified to tschpolicy.
disks.
User response:
6027-790 Specified storage pool poolName Correct the command invocation and reissue the
does not match disk diskName command.
storage pool poolName. Use
6027-798 Policy files are limited to number
mmdeldisk and mmadddisk to
bytes.
change a disk's storage pool.
Explanation:
Explanation:
A user-specified policy file exceeded the maximum-
An attempt was made to change a disk's storage pool
allowed length.
assignment using the mmchdisk command. This can
only be done by deleting the disk from its current User response:
storage pool and then adding it to the new pool. Install a smaller policy file.
User response: 6027-799 Policy `policyName' installed and
Delete the disk from its current storage pool and then broadcast to all nodes.
add it to the new pool.
Explanation:
6027-792 Policies have not been enabled for Self-explanatory.
file system fileSystem.
User response:
Explanation: None. Informational message only.
The cited file system must be upgraded to use policies.
6027-850 Unable to issue this command
User response: from a non-root user.
Upgrade the file system via the mmchfs -V command.
Explanation:
6027-793 No policy file was installed for file tsiostat requires root privileges to run.
system fileSystem.
User response:
Explanation: Get the system administrator to change the executable
No policy file was installed for this file system. to set the UID to 0.
User response: 6027-851 Unable to process interrupt
Install a policy file. received.
6027-794 Failed to read policy file for file Explanation:
system fileSystem. An interrupt occurred that tsiostat cannot process.
Explanation: User response:
Failed to read the policy file for the requested file Contact the IBM Support Center.
system.
6027-852 interval and count must be
User response: positive integers.
Reinstall the policy file.
Explanation:
6027-795 Failed to open fileName: errorCode. Incorrect values were supplied for tsiostat
parameters.
Explanation:

Chapter 42. References 767


User response: Check for additional error messages. Resolve the
Correct the command invocation and reissue the problems before reattempting the failing operation.
command.
6027-860 -d is not appropriate for an NFSv4
6027-853 interval must be less than 1024. ACL
Explanation: Explanation:
An incorrect value was supplied for the interval Produced by the mmgetacl or mmputacl commands
parameter. when the -d option was specified, but the object has
an NFS Version 4 ACL (does not have a default).
User response:
Correct the command invocation and reissue the User response:
command. None. Informational message only.
6027-854 count must be less than 1024. 6027-861 Set afm ctl failed
Explanation: Explanation:
An incorrect value was supplied for the count The tsfattr call failed.
parameter.
User response:
User response: Check for additional error messages. Resolve the
Correct the command invocation and reissue the problems before reattempting the failing operation.
command.
6027-862 Incorrect storage pool name
6027-855 Unable to connect to server, poolName.
mmfsd is not started.
Explanation:
Explanation: An incorrect storage pool name was provided.
The tsiostat command was issued but the file
User response:
system is not started.
Determine the correct storage pool name and reissue
User response: the command.
Contact your system administrator.
6027-863 File cannot be assigned to storage
6027-856 No information to report. pool 'poolName'.
Explanation: Explanation:
The tsiostat command was issued but no file The file cannot be assigned to the specified pool.
systems are mounted.
User response:
User response: Determine the correct storage pool name and reissue
Contact your system administrator. the command.
6027-857 Error retrieving values. 6027-864 Set storage pool failed.
Explanation: Explanation:
The tsiostat command was issued and an internal An incorrect storage pool name was provided.
error occurred.
User response:
User response: Determine the correct storage pool name and reissue
Contact the IBM Support Center. the command.
6027-858 File system not mounted. 6027-865 Restripe file data failed.
Explanation: Explanation:
The requested file system is not mounted. An error occurred while restriping the file data.
User response: User response:
Mount the file system and reattempt the failing Check the error code and reissue the command.
operation.
6027-866 [E] Storage pools have not been
6027-859 Set DIRECTIO failed enabled for this file system.
Explanation: Explanation:
The tsfattr call failed. The user invoked a command with a storage pool
option (-p or -P) before storage pools were enabled.
User response:
User response:

768 IBM Storage Scale 5.1.9: Problem Determination Guide


Enable storage pools via mmchfs -V, or correct the Explanation:
command invocation and reissue the command. An error occurred while attempting to access the
named GPFS file system or path.
6027-867 Change storage pool is not
permitted. User response:
Verify the invocation parameters and make sure the
Explanation:
command is running under a user ID with sufficient
The user tried to change a file's assigned storage pool
authority (root or administrator privileges). Mount the
but was not root or superuser.
GPFS file system. Correct the command invocation and
User response: reissue the command.
Reissue the command as root or superuser.
6027-873 [W] Error on
6027-868 mmchattr failed. gpfs_stat_inode([pathName/
fileName],inodeNumber.genNumbe
Explanation:
r): errorString
An error occurred while changing a file's attributes.
Explanation:
User response:
An error occurred during a gpfs_stat_inode
Check the error code and reissue the command.
operation.
6027-869 File replication exceeds number
User response:
of failure groups in destination
Reissue the command. If the problem persists, contact
storage pool.
the IBM Support Center.
Explanation:
6027-874 [E] Error: incorrect Date@Time (YYYY-
The tschattr command received incorrect command
MM-DD@HH:MM:SS) specification:
line arguments.
specification
User response:
Explanation:
Correct the command invocation and reissue the
The Date@Time command invocation argument could
command.
not be parsed.
6027-870 [E] Error on getcwd(): errorString. Try
User response:
an absolute path instead of just
Correct the command invocation and try
pathName
again. The syntax should look similar to:
Explanation: 2005-12-25@07:30:00.
The getcwd system call failed.
6027-875 [E] Error on gpfs_stat(pathName):
User response: errorString
Specify an absolute path starting with '/' on the
Explanation:
command invocation, so that the command will not
An error occurred while attempting to stat() the
need to invoke getcwd.
cited path name.
6027-871 [E] Error on
User response:
gpfs_get_pathname_from_
Determine whether the cited path name exists and
fssnaphandle(pathName):
is accessible. Correct the command arguments as
errorString.
necessary and reissue the command.
Explanation:
6027-876 [E] Error starting directory
An error occurred during
scan(pathName): errorString
a gpfs_get_pathname_from_fssnaphandle
operation. Explanation:
The specified path name is not a directory.
User response:
Verify the invocation parameters and make sure the User response:
command is running under a user ID with sufficient Determine whether the specified path name exists
authority (root or administrator privileges). Specify and is an accessible directory. Correct the command
a GPFS file system device name or a GPFS directory arguments as necessary and reissue the command.
path name as the first argument. Correct the command
6027-877 [E] Error opening pathName:
invocation and reissue the command.
errorString
6027-872 [E] pathName is not within a mounted
Explanation:
GPFS file system.

Chapter 42. References 769


An error occurred while attempting to open the Reissue the command. If the problem persists, contact
named file. Its pool and replication attributes remain the IBM Support Center.
unchanged.
6027-883 Error on
User response: gpfs_next_inode(maxInodeNumbe
Investigate the file and possibly reissue the command. r): errorString
The file may have been removed or locked by another
Explanation:
application.
An error occurred during a gpfs_next_inode
6027-878 [E] Error on gpfs_fcntl(pathName): operation.
errorString (offset=offset)
User response:
Explanation: Reissue the command. If the problem persists, contact
An error occurred while attempting fcntl on the the IBM Support Center.
named file. Its pool or replication attributes may not
6027-884 Error during directory scan
have been adjusted.
[E:nnn]
User response:
Explanation:
Investigate the file and possibly reissue the command.
A terminal error occurred during the directory scan
Use the mmlsattr and mmchattr commands to
phase of the command.
examine and change the pool and replication
attributes of the named file. User response:
Verify the command arguments. Reissue the
6027-879 [E] Error deleting pathName:
command. If the problem persists, contact the IBM
errorString
Support Center.
Explanation:
6027-885 Error during inode scan:
An error occurred while attempting to delete the
[E:nnn] errorString
named file.
Explanation:
User response:
A terminal error occurred during the inode scan phase
Investigate the file and possibly reissue the command.
of the command.
The file may have been removed or locked by another
application. User response:
Verify the command arguments. Reissue the
6027-880 Error on
command. If the problem persists, contact the IBM
gpfs_seek_inode(inodeNumber):
Support Center.
errorString
6027-886 Error during policy decisions scan
Explanation:
[E:nnn]
An error occurred during a gpfs_seek_inode
operation. Explanation:
A terminal error occurred during the policy decisions
User response:
phase of the command.
Reissue the command. If the problem persists, contact
the contact the IBM Support Center User response:
Verify the command arguments. Reissue the
6027-881 [E] Error on gpfs_iopen([rootPath/
command. If the problem persists, contact the IBM
pathName],inodeNumber):
Support Center.
errorString
6027-887 [W] Error on
Explanation:
gpfs_igetstoragepool(dataPoolId):
An error occurred during a gpfs_iopen operation.
errorString
User response:
Explanation:
Reissue the command. If the problem persists, contact
An error occurred during a gpfs_igetstoragepool
the IBM Support Center.
operation. Possible inode corruption.
6027-882 [E] Error on gpfs_ireaddir(rootPath/
User response:
pathName): errorString
Use mmfsck command. If the problem persists,
Explanation: contact the IBM Support Center.
An error occurred during a gpfs_ireaddir()
6027-888 [W] Error on
operation.
gpfs_igetfilesetname(filesetId):
User response: errorString

770 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: 6027-894 [X] Error on pthread_mutex_lock:
An error occurred during a gpfs_igetfilesetname errorString
operation. Possible inode corruption.
Explanation:
User response: An error occurred during a pthread_mutex_lock
Use mmfsck command. If the problem persists, operation.
contact the IBM Support Center.
User response:
6027-889 [E] Error on Contact the IBM Support Center.
gpfs_get_fssnaphandle(rootPath):
6027-895 [X] Error on pthread_mutex_unlock:
errorString.
errorString
Explanation:
Explanation:
An error occurred during a gpfs_get_fssnaphandle
An error occurred during a pthread_mutex_unlock
operation.
operation.
User response:
User response:
Reissue the command. If the problem persists, contact
Contact the IBM Support Center.
the IBM Support Center.
6027-896 [X] Error on pthread_cond_init:
6027-890 [E] Error on
errorString
gpfs_open_inodescan(rootPath):
errorString Explanation:
Explanation: An error occurred during a pthread_cond_init
operation.
An error occurred during a gpfs_open_inodescan()
operation. User response:
Contact the IBM Support Center.
User response:
Reissue the command. If the problem persists, contact 6027-897 [X] Error on pthread_cond_signal:
the IBM Support Center. errorString
6027-891 [X] WEIGHT(thresholdValue) Explanation:
UNKNOWN pathName An error occurred during a pthread_cond_signal
operation.
Explanation:
The named file was assigned the indicated weight, but User response:
the rule type is UNKNOWN. Contact the IBM Support Center.
User response: 6027-898 [X] Error on pthread_cond_broadcast:
Contact the IBM Support Center. errorString
6027-892 [E] Error on pthread_create: where Explanation:
#threadNumber_or_portNumber_or An error occurred during a
_ socketNumber: errorString pthread_cond_broadcast operation.
Explanation: User response:
An error occurred while creating the thread during a Contact the IBM Support Center.
pthread_create operation.
6027-899 [X] Error on pthread_cond_wait:
User response: errorString
Consider some of the command parameters that might
affect memory usage. For further assistance, contact Explanation:
An error occurred during a pthread_cond_wait
the IBM Support Center.
operation.
6027-893 [X] Error on pthread_mutex_init:
User response:
errorString
Contact the IBM Support Center.
Explanation:
6027-900 [E] Error opening work file fileName:
An error occurred during a pthread_mutex_init
operation. errorString
Explanation:
User response:
An error occurred while attempting to open the named
Contact the IBM Support Center.
work file.

Chapter 42. References 771


User response: Consider some of the command parameters that might
Investigate the file and possibly reissue the command. affect memory usage. For further assistance, contact
Check that the path name is defined and accessible. the IBM Support Center.
6027-901 [E] Error writing to work file fileName: 6027-906 Error on system(command)
errorString [E:nnn]
Explanation: Explanation:
An error occurred while attempting to write to the An error occurred during the system call with the
named work file. specified argument string.
User response: User response:
Investigate the file and possibly reissue the command. Read and investigate related error messages.
Check that there is sufficient free space in the file
6027-907 Error from sort_file(inodeListname,
system.
[E:nnn] sortCommand,sortInodeOptions,te
6027-902 [E] Error parsing work file fileName. mpDir)
Service index: number
Explanation:
Explanation: An error occurred while sorting the named work
An error occurred while attempting to read the file using the named sort command with the given
specified work file. options and working directory.
User response:
Investigate the file and possibly reissue the command. User response
Make sure that there is enough free space in the file Check these:
system. If the error persists, contact the IBM Support
• The sort command is installed on your system.
Center.
• The sort command supports the given options.
6027-903 Error while loading policy rules.
[E:nnn] • The working directory is accessible.
• The file system has sufficient free space.
Explanation:
An error occurred while attempting to read or parse 6027-908 [W] Attention: In RULE 'ruleName'
the policy file, which may contain syntax errors. (ruleNumber), the pool named
Subsequent messages include more information about by "poolName 'poolType'" is not
the error. defined in the file system.
User response: Explanation:
Read all of the related error messages and try to The cited pool is not defined in the file system.
correct the problem.
6027-904 [E] Error returnCode from PD User response
writer for inode=inodeNumber Correct the rule and reissue the command.
pathname=pathName
This is not an irrecoverable error; the command will
Explanation: continue to run. Of course it will not find any files in an
An error occurred while writing the policy decision for incorrect FROM POOL and it will not be able to migrate
the candidate file with the indicated inode number any files to an incorrect TO POOL.
and path name to a work file. There probably will be
6027-909 [E] Error on pthread_join: where
related error messages.
#threadNumber: errorString
User response:
Explanation:
Read all the related error messages. Attempt to
An error occurred while reaping the thread during a
correct the problems.
pthread_join operation.
6027-905 [E] Error: Out of memory. Service
User response:
index: number
Contact the IBM Support Center.
Explanation:
6027-910 Error during policy execution
The command has exhausted virtual memory.
[E:nnn]
User response:
Explanation:

772 IBM Storage Scale 5.1.9: Problem Determination Guide


A terminating error occurred during the policy 6027-916 Too many disks unavailable to
execution phase of the command. properly balance file.
User response: Explanation:
Verify the command arguments and reissue the While restriping a file, the tschattr or
command. If the problem persists, contact the IBM tsrestripefile command found that there were
Support Center. too many disks unavailable to properly balance the file.
6027-911 [E] Error on changeSpecification User response:
change for pathName. errorString Reissue the command after adding or restarting file
system disks.
Explanation:
This message provides more details about a 6027-917 All replicas of a data block were
gpfs_fcntl() error. previously deleted.
User response: Explanation:
Use the mmlsattr and mmchattr commands to While restriping a file, the tschattr or
examine the file, and then reissue the change tsrestripefile command found that all replicas of
command. a data block were previously deleted.
6027-912 [E] Error on restriping of pathName. User response:
errorString Reissue the command after adding or restarting file
system disks.
Explanation:
This provides more details on a gpfs_fcntl() error. 6027-918 Cannot make this change to a
User response: nonzero length file.
Use the mmlsattr and mmchattr commands to Explanation:
examine the file and then reissue the restriping GPFS does not support the requested change to the
command. replication attributes.
6027-913 Desired replication exceeds User response:
number of failure groups. You may want to create a new file with the desired
attributes and then copy your data to that file and
Explanation:
rename it appropriately. Be sure that there are
While restriping a file, the tschattr or
sufficient disks assigned to the pool with different
tsrestripefile command found that the desired
failure groups to support the desired replication
replication exceeded the number of failure groups.
attributes.
User response:
6027-919 Replication parameter range error
Reissue the command after adding or restarting file
system disks. (value, value).
Explanation:
6027-914 Insufficient space in one of the
Similar to message 6027-918. The (a,b) numbers are
replica failure groups.
the allowable range of the replication attributes.
Explanation:
User response:
While restriping a file, the tschattr or
You may want to create a new file with the desired
tsrestripefile command found there was
attributes and then copy your data to that file and
insufficient space in one of the replica failure groups.
rename it appropriately. Be sure that there are
User response: sufficient disks assigned to the pool with different
Reissue the command after adding or restarting file failure groups to support the desired replication
system disks. attributes.
6027-915 Insufficient space to properly 6027-920 [E] Error on pthread_detach(self):
balance file. where: errorString
Explanation: Explanation:
While restriping a file, the tschattr or An error occurred during a pthread_detach
tsrestripefile command found that there was operation.
insufficient space to properly balance the file.
User response:
User response: Contact the IBM Support Center.
Reissue the command after adding or restarting file
system disks.

Chapter 42. References 773


6027-921 [E] Error on socket The file cannot be assigned to the specified pool.
socketName(hostName): User response:
errorString Determine the correct storage pool name and reissue
Explanation: the command.
An error occurred during a socket operation. 6027-927 System file cannot be assigned to
User response: storage pool 'poolName'.
Verify any command arguments related to Explanation:
interprocessor communication and then reissue the The file cannot be assigned to the specified pool.
command. If the problem persists, contact the IBM
Support Center. User response:
Determine the correct storage pool name and reissue
6027-922 [X] Error in Mtconx - p_accepts should the command.
not be empty
6027-928 [E] Error: File system or device
Explanation: fileSystem has no global snapshot
The program discovered an inconsistency or logic error with name snapshotName.
within itself.
Explanation:
User response: The specified file system does not have a global
Contact the IBM Support Center. snapshot with the specified snapshot name.
6027-923 [W] Error - command client is an User response:
incompatible version: hostName Use the mmlssnapshot command to list the snapshot
protocolVersion names for the file system. Alternatively, specify the full
Explanation: pathname of the desired snapshot directory instead of
While operating in master/client mode, the command using the -S option.
discovered that the client is running an incompatible 6027-929 [W] Attention: In RULE 'ruleName'
version.
(ruleNumber), both pools
User response: 'poolName' and 'poolName' are
Ensure the same version of the command software is EXTERNAL. This is not a supported
installed on all nodes in the clusters and then reissue migration.
the command. Explanation:
6027-924 [X] Error - unrecognized client The command does not support migration between
response from hostName: two EXTERNAL pools.
clientResponse
Explanation: User response
Similar to message 6027-923, except this may be an Correct the rule and reissue the command.
internal logic error. Note: This is not an unrecoverable error. The
User response: command will continue to run.
Ensure the latest, same version software is installed
6027-930 [W] Attention: In RULE 'ruleName' LIST
on all nodes in the clusters and then reissue the
name 'listName' appears, but there
command. If the problem persists, contact the IBM
is no corresponding EXTERNAL
Support Center.
LIST 'listName' EXEC ... OPTS ...
6027-925 Directory cannot be assigned to rule to specify a program to
storage pool 'poolName'. process the matching files.
Explanation: Explanation:
The file cannot be assigned to the specified pool. There should be an EXTERNAL LIST rule for every list
named by your LIST rules.
User response:
Determine the correct storage pool name and reissue
the command. User response
Add an "EXTERNAL LIST listName EXEC scriptName
6027-926 Symbolic link cannot be assigned OPTS opts" rule.
to storage pool 'poolName'.
Note: This is not an unrecoverable error. For execution
Explanation: with -I defer, file lists are generated and saved, so

774 IBM Storage Scale 5.1.9: Problem Determination Guide


EXTERNAL LIST rules are not strictly necessary for specify a path name within that snapshot (for
correct execution. example, /gpfs/FileSystemName/.snapshots/
SnapShotName/Directory).
6027-931 [E] Error - The policy evaluation phase
did not complete. 6027-935 [W] Attention: In RULE 'ruleName'
(ruleNumber) LIMIT or REPLICATE
Explanation: clauses are ignored; not supported
One or more errors prevented the policy evaluation for migration to EXTERNAL pool
phase from examining all of the files. 'storagePoolName'.
User response: Explanation:
Consider other messages emitted by the command. GPFS does not support the LIMIT or REPLICATE
Take appropriate action and then reissue the clauses during migration to external pools.
command.
User response:
6027-932 [E] Error - The policy execution phase Correct the policy rule to avoid this warning message.
did not complete.
6027-936 [W] Error - command master is an
Explanation: incompatible version.
One or more errors prevented the policy execution
phase from operating on each chosen file. Explanation:
While operating in master/client mode, the command
User response: discovered that the master is running an incompatible
Consider other messages emitted by the command. version.
Take appropriate action and then reissue the
command. User response:
Upgrade the command software on all nodes and
6027-933 [W] EXEC 'wouldbeScriptPathname' reissue the command.
of EXTERNAL POOL or LIST
'PoolOrListName' fails TEST with 6027-937 [E] Error creating shared temporary
code scriptReturnCode on this sub-directory subDirName:
node. subDirPath
Explanation: Explanation:
Each EXEC defined in an EXTERNAL POOL or LIST The mkdir command failed on the named subdirectory
rule is run in TEST mode on each node. Each path.
invocation that fails with a nonzero return code is User response:
reported. Command execution is terminated on any Specify an existing writable shared directory as
node that fails any of these tests. the shared temporary directory argument to the
User response: policy command. The policy command will create a
Correct the EXTERNAL POOL or LIST rule, the EXEC subdirectory within that.
script, or do nothing because this is not necessarily 6027-938 [E] Error closing work file fileName:
an error. The administrator may suppress execution
errorString
of the mmapplypolicy command on some nodes by
deliberately having one or more EXECs return nonzero Explanation:
codes. An error occurred while attempting to close the named
work file or socket.
6027-934 [W] Attention: Specified snapshot:
'SnapshotName' will be ignored User response:
because the path specified: Record the above information. Contact the IBM
'PathName' is not within that Support Center.
snapshot. 6027-939 [E] Error on
Explanation: gpfs_quotactl(pathName,command
The command line specified both a path name to be Code, resourceId): errorString
scanned and a snapshot name, but the snapshot name Explanation:
was not consistent with the path name. An error occurred while attempting
User response: gpfs_quotactl().
If you wanted the entire snapshot, just specify User response:
the GPFS file system name or device name.
If you wanted a directory within a snapshot,

Chapter 42. References 775


Correct the policy rules and/or enable GPFS quota Explanation:
tracking. If problem persists contact the IBM Support The tsfattr call failed.
Center.
User response:
6027-940 Open failed. Check for additional error messages. Resolve the
problems before reattempting the failing operation.
Explanation:
The open() system call was not successful. 6027-949 [E] fileName: invalid clone attributes.
User response: Explanation:
Check additional error messages. Self explanatory.
6027-941 Set replication failed. User response:
Check for additional error messages. Resolve the
Explanation:
problems before reattempting the failing operation.
The open() system call was not successful.
6027-950 File cloning requires the 'fastea'
User response:
[E:nnn] feature to be enabled.
Check additional error messages.
Explanation:
6027-943 -M and -R are only valid for zero
The file system fastea feature is not enabled.
length files.
User response:
Explanation:
Enable the fastea feature by issuing the mmchfs -V
The mmchattr command received command line
and mmmigratefs --fastea commands.
arguments that were not valid.
6027-951 [E] Error on operationName to work
User response:
file fileName: errorString
Correct command line and reissue the command.
Explanation:
6027-944 -m value exceeds number of
An error occurred while attempting to do a (write-like)
failure groups for metadata.
operation on the named work file.
Explanation:
User response:
The mmchattr command received command line
Investigate the file and possibly reissue the command.
arguments that were not valid.
Check that there is sufficient free space in the file
User response: system.
Correct command line and reissue the command.
6027-953 Failed to get a handle for
6027-945 -r value exceeds number of failure fileset filesetName, snapshot
groups for data. snapshotName in file system
fileSystem. errorMessage.
Explanation:
The mmchattr command received command line Explanation:
arguments that were not valid. Failed to get a handle for a specific fileset snapshot in
the file system.
User response:
Correct command line and reissue the command. User response:
Correct the command line and reissue the command.
6027-946 Not a regular file or directory.
If the problem persists, contact the IBM Support
Explanation: Center.
An mmlsattr or mmchattr command error occurred.
6027-954 Failed to get the maximum inode
User response: number in the active file system.
Correct the problem and reissue the command. errorMessage.
6027-947 Stat failed: A file or directory in Explanation:
the path name does not exist. Failed to get the maximum inode number in the
current active file system.
Explanation:
A file or directory in the path name does not exist. User response:
Correct the command line and reissue the command.
User response:
If the problem persists, contact the IBM Support
Correct the problem and reissue the command.
Center.
6027-948 fileName: get clone attributes
[E:nnn] failed: errorString

776 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-955 Failed to set the maximum User response:
allowed memory for the specified None.
fileSystem command. 6027-963 EDITOR environment variable not
Explanation: set
Failed to set the maximum allowed memory for the Explanation:
specified command.
Self-explanatory.
User response: User response:
Correct the command line and reissue the command. Set the EDITOR environment variable and reissue the
If the problem persists, contact the IBM Support command.
Center.
6027-964 EDITOR environment variable
6027-956 Cannot allocate enough buffer to must be an absolute path name
record different items.
Explanation:
Explanation: Self-explanatory.
Cannot allocate enough buffer to record different
items which are used in the next phase. User response:
Set the EDITOR environment variable correctly and
User response: reissue the command.
Correct the command line and reissue the command.
If the problem persists, contact the system 6027-965 Cannot create temporary file
administrator. Explanation:
6027-957 Failed to get the root directory Self-explanatory.
inode of fileset filesetName User response:
Explanation: Contact your system administrator.
Failed to get the root directory inode of a fileset. 6027-966 Cannot access fileName
User response: Explanation:
Correct the command line and reissue the command. Self-explanatory.
If the problem persists, contact the IBM Support
Center. User response:
Verify file permissions.
6027-959 'fileName' is not a regular file.
6027-967 Should the modified ACL be
Explanation:
applied? (yes) or (no)
Only regular files are allowed to be clone parents.
Explanation:
User response: Self-explanatory.
This file is not a valid target for mmclone operations.
User response:
6027-960 cannot access 'fileName': Respond yes if you want to commit the changes, no
errorString. otherwise.
Explanation: 6027-971 Cannot find fileName
This message provides more details about a stat()
error. Explanation:
Self-explanatory.
User response:
Correct the problem and reissue the command. User response:
Verify the file name and permissions.
6027-961 Cannot execute command.
6027-972 name is not a directory (-d not
Explanation:
valid).
The mmeditacl command cannot invoke the
mmgetacl or mmputacl command. Explanation:
Self-explanatory.
User response:
Contact your system administrator. User response:
None, only directories are allowed to have default
6027-962 Failed to list fileset filesetName. ACLs.
Explanation: 6027-973 Cannot allocate number byte
Failed to list specific fileset.
buffer for ACL.

Chapter 42. References 777


Explanation: Self-explanatory.
There was not enough available memory to process
User response:
the request.
Specify a valid group name and reissue the command.
User response:
6027-982 name is not a valid ACL entry type.
Contact your system administrator.
Explanation:
6027-974 Failure reading ACL (rc=number).
Specify a valid ACL entry type and reissue the
Explanation: command.
An unexpected error was encountered by mmgetacl
User response:
or mmeditacl.
Correct the problem and reissue the command.
User response:
6027-983 name is not a valid permission set.
Examine the return code, contact the IBM Support
Center if necessary. Explanation:
Specify a valid permission set and reissue the
6027-976 Failure writing ACL (rc=number).
command.
Explanation:
User response:
An unexpected error encountered by mmputacl or
Correct the problem and reissue the command.
mmeditacl.
6027-984 [E] This command will run on a
User response:
remote node
Examine the return code, Contact the IBM Support
Center if necessary. Explanation:
mmputacl was invoked for a file that resides on a
6027-977 Authorization failure
file system in a remote cluster, and UID remapping
Explanation: is enabled. To parse the user and group names
An attempt was made to create or modify the ACL for a from the ACL file correctly, the command will be run
file that you do not own. transparently on a node in the remote cluster.
User response: User response:
Only the owner of a file or the root user can create or None. Informational message only.
change the access control list for a file.
6027-985 An error was encountered while
6027-978 Incorrect, duplicate, or missing deleting the ACL (rc=value).
access control entry detected.
Explanation:
Explanation: An unexpected error was encountered by tsdelacl.
An access control entry in the ACL that was created
User response:
had incorrect syntax, one of the required access
Examine the return code and contact the IBM Support
control entries is missing, or the ACL contains
Center, if necessary.
duplicate access control entries.
6027-986 Cannot open fileName.
User response:
Correct the problem and reissue the command. Explanation:
Self-explanatory.
6027-979 Incorrect ACL entry: entry.
User response:
Explanation:
Verify the file name and permissions.
Self-explanatory.
6027-987 name is not a valid special name.
User response:
Correct the problem and reissue the command. Explanation:
Produced by the mmputacl command when the NFS
6027-980 name is not a valid user name.
V4 'special' identifier is followed by an unknown
Explanation: special id string. name is one of the following:
Self-explanatory. 'owner@', 'group@', 'everyone@'.
User response: User response:
Specify a valid user name and reissue the command. Specify a valid NFS V4 special name and reissue the
command.
6027-981 name is not a valid group name.
6027-988 type is not a valid NFS V4 type.
Explanation:

778 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: One of the mm*acl commands specified an incorrect
Produced by the mmputacl command when the type value with the -k option.
field in an ACL entry is not one of the supported NFS
User response:
Version 4 type values. type is one of the following:
Correct the aclType value and reissue the command.
'allow' or 'deny'.
6027-994 ACL permissions cannot be denied
User response:
to the file owner.
Specify a valid NFS V4 type and reissue the command.
Explanation:
6027-989 name is not a valid NFS V4 flag.
The mmputacl command found that the
Explanation: READ_ACL, WRITE_ACL, READ_ATTR, or WRITE_ATTR
A flag specified in an ACL entry is not one of the permissions are explicitly being denied to the file
supported values, or is not valid for the type of object owner. This is not permitted, in order to prevent the
(inherit flags are valid for directories only). Valid values file being left with an ACL that cannot be modified.
are FileInherit, DirInherit, and InheritOnly.
User response:
User response: Do not select the READ_ACL, WRITE_ACL,
Specify a valid NFS V4 option and reissue the READ_ATTR, or WRITE_ATTR permissions on deny
command. ACL entries for the OWNER.
6027-990 Missing permissions (value found, 6027-995 This command will run on a
value are required). remote node, nodeName.
Explanation: Explanation:
The permissions listed are less than the number The mmputacl command was invoked for a file that
required. resides on a file system in a remote cluster, and UID
remapping is enabled. To parse the user and group
User response:
names from the ACL file correctly, the command will
Add the missing permissions and reissue the
be run transparently on a node in the remote cluster.
command.
User response:
6027-991 Combining FileInherit and
None. Informational message only.
DirInherit makes the mask
ambiguous. 6027-996 Error reading policy text from:
[E:nnn] fileName
Explanation:
Produced by the mmputacl command when WRITE/ Explanation:
CREATE is specified without MKDIR (or the other An error occurred while attempting to open or read the
way around), and both the FILE_INHERIT and specified policy file. The policy file may be missing or
DIR_INHERIT flags are specified. inaccessible.
User response: User response:
Make separate FileInherit and DirInherit Read all of the related error messages and try to
entries and reissue the command. correct the problem.
6027-992 Subdirectory name already exists. 6027-997 [W] Attention: RULE 'ruleName'
Unable to create snapshot. attempts to redefine EXTERNAL
POOLorLISTliteral 'poolName',
Explanation:
ignored.
tsbackup was unable to create a snapshot because
the snapshot subdirectory already exists. This Explanation:
condition sometimes is caused by issuing an IBM Execution continues as if the specified rule was not
Storage Protect restore operation without specifying present.
a different subdirectory as the target of the restore.
User response:
User response: Correct or remove the policy rule.
Remove or rename the existing subdirectory and then
6027-998 [E] Error in FLR/PDR serving for client
retry the command.
clientHostNameAndPortNumber:
6027-993 Keyword aclType is incorrect. Valid FLRs=numOfFileListRecords
values are: 'posix', 'nfs4', 'native'. PDRs=numOfPolicyDecisionRespon
ses
Explanation:
pdrs=numOfPolicyDecisionRespons
eRecords

Chapter 42. References 779


Explanation: Explanation:
A protocol error has been detected among cooperating A [nodelist] line in the input stream is not of the
mmapplypolicy processes. format: [nodelist]. This covers syntax errors not
covered by messages 6027-1004 and 6027-1005.
User response:
Reissue the command. If the problem persists, contact
the IBM Support Center. User response
Fix the format of the list of nodes in the mmfs.cfg
6027-999 [E] Authentication failed:
input file. This is usually the NodeFile specified on the
myNumericNetworkAddress with
mmchconfig command.
partnersNumericNetworkAddress
(code=codeIndicatingProtocolStepS If no user-specified lines are in error, contact the IBM
equence rc=errnoStyleErrorCode) Support Center.
Explanation: If user-specified lines are in error, correct these lines.
Two processes at the specified network addresses
6027-1007 attribute found in common
failed to authenticate. The cooperating processes
multiple times: attribute.
should be on the same network; they should not be
separated by a firewall. Explanation:
The attribute specified on the command line is
User response:
in the main input stream multiple times. This is
Correct the configuration and try the operation again.
occasionally legal, such as with the trace attribute.
If the problem persists, contact the IBM Support
These attributes, however, are not meant to be
Center.
repaired by mmfixcfg.
6027-1004 Incorrect [nodelist] format in file:
User response:
nodeListLine
Fix the configuration file (mmfs.cfg or mmfscfg1 in
Explanation: the SDR). All attributes modified by GPFS configuration
A [nodelist] line in the input stream is not a comma- commands may appear only once in common sections
separated list of nodes. of the configuration file.
6027-1008 Attribute found in custom multiple
User response times: attribute.
Fix the format of the [nodelist] line in the mmfs.cfg
Explanation:
input file. This is usually the NodeFile specified on the
The attribute specified on the command line is in a
mmchconfig command.
custom section multiple times. This is occasionally
If no user-specified [nodelist] lines are in error, legal. These attributes are not meant to be repaired
contact the IBM Support Center. by mmfixcfg.
If user-specified [nodelist] lines are in error, correct User response:
these lines. Fix the configuration file (mmfs.cfg or mmfscfg1 in
the SDR). All attributes modified by GPFS configuration
6027-1005 Common is not sole item on [] line
commands may appear only once in custom sections
number.
of the configuration file.
Explanation:
6027-1022 Missing mandatory arguments on
A [nodelist] line in the input stream contains common
command line.
plus any other names.
Explanation:
User response Some, but not enough, arguments were specified to
the mmcrfsc command.
Fix the format of the [nodelist] line in the mmfs.cfg
input file. This is usually the NodeFile specified on the User response:
mmchconfig command. Specify all arguments as per the usage statement that
follows.
If no user-specified [nodelist] lines are in error,
contact the IBM Support Center. 6027-1023 invalid maxBlockSize parameter:
value
If user-specified [nodelist] lines are in error, correct
these lines. Explanation:
The first argument to the mmcrfsc command is
6027-1006 Incorrect custom [ ] line number.
maximum block size and should be greater than 0.

780 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Use options only once.
The maximum block size should be greater than
6027-1034 Missing argument after
0. The mmcrfs command should never call the
optionName option.
mmcrfsc command without a valid maximum block
size argument. Contact the IBM Support Center. Explanation:
An option was not followed by an argument.
6027-1028 Incorrect value for -name flag.
User response:
Explanation:
All options need an argument. Specify one.
An incorrect argument was specified with an option
that requires one of a limited number of allowable 6027-1035 Option -optionName is mandatory.
options (for example, -s or any of the yes | no
Explanation:
options).
A mandatory input option was not specified.
User response:
User response:
Use one of the valid values for the specified option.
Specify all mandatory options.
6027-1029 Incorrect characters in integer
6027-1036 Option expected at string.
field for -name option.
Explanation:
Explanation:
Something other than an expected option was
An incorrect character was specified with the indicated
encountered on the latter portion of the command line.
option.
User response:
User response:
Follow the syntax shown. Options may not have
Use a valid integer for the indicated option.
multiple values. Extra arguments are not allowed.
6027-1030 Value below minimum for
6027-1038 IndirectSize must be <= BlockSize
-optionLetter option. Valid range is
and must be a multiple of
from value to value
LogicalSectorSize (512).
Explanation:
Explanation:
The value specified with an option was below the
The IndirectSize specified was not a multiple of 512 or
minimum.
the IndirectSize specified was larger than BlockSize.
User response:
User response:
Use an integer in the valid range for the indicated
Use valid values for IndirectSize and BlockSize.
option.
6027-1039 InodeSize must be a multiple of
6027-1031 Value above maximum for option
LocalSectorSize (512).
-optionLetter. Valid range is from
value to value. Explanation:
The specified InodeSize was not a multiple of 512.
Explanation:
The value specified with an option was above the User response:
maximum. Use a valid value for InodeSize.
User response: 6027-1040 InodeSize must be less than or
Use an integer in the valid range for the indicated equal to Blocksize.
option.
Explanation:
6027-1032 Incorrect option optionName. The specified InodeSize was not less than or equal to
Blocksize.
Explanation:
An unknown option was specified. User response:
Use a valid value for InodeSize.
User response:
Use only the options shown in the syntax. 6027-1042 DefaultMetadataReplicas must
be less than or equal to
6027-1033 Option optionName specified
MaxMetadataReplicas.
twice.
Explanation:
Explanation:
The specified DefaultMetadataReplicas was greater
An option was specified more than once on the
than MaxMetadataReplicas.
command line.
User response:
User response:

Chapter 42. References 781


Specify a valid value for DefaultMetadataReplicas. User response:
Specify a valid value or increase the value of the
6027-1043 DefaultDataReplicas must be less
allowed block size by specifying a larger value on
than or equal MaxDataReplicas.
the maxblocksize parameter of the mmchconfig
Explanation: command.
The specified DefaultDataReplicas was greater than
6027-1113 Incorrect option: option.
MaxDataReplicas.
Explanation:
User response:
The specified command option is not valid.
Specify a valid value for DefaultDataReplicas.
User response:
6027-1055 LogicalSectorSize must be a
Specify a valid option and reissue the command.
multiple of 512
6027-1119 Obsolete option: option.
Explanation:
The specified LogicalSectorSize was not a multiple of Explanation:
512. A command received an option that is not valid any
more.
User response:
Specify a valid LogicalSectorSize. User response:
Correct the command line and reissue the command.
6027-1056 Blocksize must be a multiple of
LogicalSectorSize × 32 6027-1120 Interrupt received: No changes
made.
Explanation:
The specified Blocksize was not a multiple of Explanation:
LogicalSectorSize × 32. A GPFS administration command (mm…) received an
interrupt before committing any changes.
User response:
Specify a valid value for Blocksize. User response:
None. Informational message only.
6027-1057 InodeSize must be less than or
equal to Blocksize. 6027-1123 Disk name must be specified in
disk descriptor.
Explanation:
The specified InodeSize was not less than or equal to Explanation:
Blocksize. The disk name positional parameter (the first field) in
a disk descriptor was empty. The bad disk descriptor is
User response:
displayed following this message.
Specify a valid value for InodeSize.
User response:
6027-1059 Mode must be M or S: mode
Correct the input and rerun the command.
Explanation:
6027-1124 Disk usage must be dataOnly,
The first argument provided in the mmcrfsc command
metadataOnly, descOnly, or
was not M or S.
dataAndMetadata.
User response:
Explanation:
The mmcrfsc command should not be called by
The disk usage parameter has a value that is not valid.
a user. If any other command produces this error,
contact the IBM Support Center. User response:
Correct the input and reissue the command.
6027-1084 The specified block size (valueK)
exceeds the maximum allowed 6027-1132 Interrupt received: changes not
block size currently in effect propagated.
(valueK). Either specify a smaller
Explanation:
value for the -B parameter, or
An interrupt was received after changes were
increase the maximum block
committed but before the changes could be
size by issuing: mmchconfig
propagated to all the nodes.
maxblocksize=valueK and restart
the GPFS daemon. User response:
All changes will eventually propagate as nodes recycle
Explanation:
or other GPFS administration commands are issued.
The specified value for block size was greater than the
Changes can be activated now by manually restarting
value of the maxblocksize configuration parameter.
the GPFS daemons.

782 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1133 Interrupt received. Only a subset An -F flag was not followed by the path name of a disk
of the parameters were changed. descriptor file.
Explanation: User response:
An interrupt was received in mmchfs before all of the Specify a valid disk descriptor file.
requested changes could be completed. 6027-1142 File fileName already exists.
User response: Explanation:
Use mmlsfs to see what the currently active settings The specified file already exists.
are. Reissue the command if you want to change
additional parameters. User response:
Rename the file or specify a different file name and
6027-1135 Restriping may not have finished. reissue the command.
Explanation: 6027-1143 Cannot open fileName.
An interrupt occurred during restriping.
Explanation:
User response: A file could not be opened.
Restart the restripe. Verify that the file system was not
damaged by running the mmfsck command. User response:
Verify that the specified file exists and that you have
6027-1136 option option specified twice. the proper authorizations.
Explanation: 6027-1144 Incompatible cluster types. You
An option was specified multiple times on a command cannot move file systems that
line.
were created by GPFS cluster type
User response: sourceCluster into GPFS cluster
Correct the error on the command line and reissue the type targetCluster.
command.
Explanation:
6027-1137 option value must be yes or no. The source and target cluster types are incompatible.
Explanation: User response:
A yes or no option was used with something other than Contact the IBM Support Center for assistance.
yes or no. 6027-1145 parameter must be greater than 0:
User response: value
Correct the error on the command line and reissue the Explanation:
command. A negative value had been specified for the named
6027-1138 Incorrect extra argument: parameter, which requires a positive value.
argument User response:
Explanation: Correct the input and reissue the command.
Non-option arguments followed the mandatory 6027-1147 Error converting diskName into an
arguments.
NSD.
User response: Explanation:
Unlike most POSIX commands, the main arguments Error encountered while converting a disk into an NSD.
come first, followed by the optional arguments.
Correct the error and reissue the command. User response:
Check the preceding messages for more information.
6027-1140 Incorrect integer for option:
number. 6027-1148 File system fileSystem already
exists in the cluster. Use mmchfs
Explanation:
-W to assign a new device name
An option requiring an integer argument was followed
for the existing file system.
by something that cannot be parsed as an integer.
Explanation:
User response: You are trying to import a file system into the cluster
Specify an integer with the indicated option. but there is already a file system with the same name
6027-1141 No disk descriptor file specified. in the cluster.
Explanation: User response:

Chapter 42. References 783


Remove or rename the file system with the conflicting Explanation:
name. The user specified a node that is not valid.
6027-1149 fileSystem is defined to have User response:
mount point mountpoint. There is Specify a valid node.
already such a mount point in the
6027-1155 The NSD servers for the following
cluster. Use mmchfs -T to assign
disks from file system fileSystem
a new mount point to the existing
were reset or not defined: diskList
file system.
Explanation:
Explanation:
Either the mmimportfs command encountered disks
The cluster into which the file system is being
with no NSD servers, or was forced to reset the NSD
imported already contains a file system with the same
server information for one or more disks.
mount point as the mount point of the file system
being imported. User response:
After the mmimportfs command finishes, use the
User response:
mmchnsd command to assign NSD server nodes to the
Use the -T option of the mmchfs command to change
disks as needed.
the mount point of the file system that is already in the
cluster and then rerun the mmimportfs command. 6027-1156 The NSD servers for the following
free disks were reset or not
6027-1150 Error encountered while importing
defined: diskList
disk diskName.
Explanation:
Explanation:
Either the mmimportfs command encountered disks
The mmimportfs command encountered problems
with no NSD servers, or was forced to reset the NSD
while processing the disk.
server information for one or more disks.
User response:
User response:
Check the preceding messages for more information.
After the mmimportfs command finishes, use the
6027-1151 Disk diskName already exists in mmchnsd command to assign NSD server nodes to the
the cluster. disks as needed.
Explanation: 6027-1157 Use the mmchnsd command to
You are trying to import a file system that has a disk assign NSD servers as needed.
with the same name as some disk from a file system
Explanation:
that is already in the cluster.
Either the mmimportfs command encountered disks
User response: with no NSD servers, or was forced to reset the NSD
Remove or replace the disk with the conflicting name. server information for one or more disks. Check the
preceding messages for detailed information.
6027-1152 Block size must be 64K, 128K,
256K, 512K, 1M, 2M, 4M, 8M or User response:
16M. After the mmimportfs command finishes, use the
mmchnsd command to assign NSD server nodes to the
Explanation:
disks as needed.
The specified block size value is not valid.
6027-1159 The following file systems were
User response:
not imported: fileSystemList
Specify a valid block size value.
Explanation:
6027-1153 At least one node in the cluster
The mmimportfs command was not able to import
must be defined as a quorum
the specified file systems. Check the preceding
node.
messages for error information.
Explanation:
User response:
All nodes were explicitly designated or allowed to
Correct the problems and reissue the mmimportfs
default to be nonquorum.
command.
User response:
6027-1160 The drive letters for the following
Specify which of the nodes should be considered
file systems have been reset:
quorum nodes and reissue the command.
fileSystemList.
6027-1154 Incorrect node node specified for
Explanation:
command.

784 IBM Storage Scale 5.1.9: Problem Determination Guide


The drive letters associated with the specified file After the mmimportfs command finishes, use the
systems are already in use by existing file systems and mmchconfig command to enable Persistent Reserve
have been reset. in the cluster as needed.
User response: 6027-1166 The PR attributes for the following
After the mmimportfs command finishes, use the -t free disks were reset or not yet
option of the mmchfs command to assign new drive established: diskList
letters as needed.
Explanation:
6027-1161 Use the dash character (-) The mmimportfs command disabled the Persistent
to separate multiple node Reserve attribute for one or more disks.
designations.
User response:
Explanation: After the mmimportfs command finishes, use the
A command detected an incorrect character used as a mmchconfig command to enable Persistent Reserve
separator in a list of node designations. in the cluster as needed.
User response: 6027-1167 Use mmchconfig to enable
Correct the command line and reissue the command. Persistent Reserve in the cluster
as needed.
6027-1162 Use the semicolon character (;) to
separate the disk names. Explanation:
The mmimportfs command disabled the Persistent
Explanation:
Reserve attribute for one or more disks.
A command detected an incorrect character used as a
separator in a list of disk names. User response:
After the mmimportfs command finishes, use the
User response:
mmchconfig command to enable Persistent Reserve
Correct the command line and reissue the command.
in the cluster as needed.
6027-1163 GPFS is still active on nodeName.
6027-1168 Inode size must be 512, 1K or 4K.
Explanation:
Explanation:
The GPFS daemon was discovered to be active on the
The specified inode size is not valid.
specified node during an operation that requires the
daemon to be stopped. User response:
Specify a valid inode size.
User response:
Stop the daemon on the specified node and rerun the 6027-1169 attribute must be value.
command.
Explanation:
6027-1164 Use mmchfs -t to assign drive The specified value of the given attribute is not valid.
letters as needed.
User response:
Explanation: Specify a valid value.
The mmimportfs command was forced to reset
6027-1178 parameter must be from value to
the drive letters associated with one or more file
value: valueSpecified
systems. Check the preceding messages for detailed
information. Explanation:
A parameter value specified was out of range.
User response:
After the mmimportfs command finishes, use the -t User response:
option of the mmchfs command to assign new drive Keep the specified value within the range shown.
letters as needed.
6027-1188 Duplicate disk specified: disk
6027-1165 The PR attributes for the following
Explanation:
disks from file system fileSystem
A disk was specified more than once on the command
were reset or not yet established:
line.
diskList
User response:
Explanation:
Specify each disk only once.
The mmimportfs command disabled the Persistent
Reserve attribute for one or more disks. 6027-1189 You cannot delete all the disks.
User response: Explanation:

Chapter 42. References 785


The number of disks to delete is greater than or equal If any disks are in a non-ready state, steps should be
to the number of disks in the file system. taken to bring these disks into the ready state, or to
remove them from the file system. This can be done
User response:
by mounting the file system, or by using the mmchdisk
Delete only some of the disks. If you want to delete
command for a mounted or unmounted file system.
them all, use the mmdelfs command.
When maintenance is complete or the failure has been
6027-1197 parameter must be greater than repaired, use the mmchdisk command with the start
value: value. option. If the failure cannot be repaired without loss of
data, you can use the mmdeldisk command to delete
Explanation:
the disks.
An incorrect value was specified for the named
parameter. 6027-1204 command failed.
User response: Explanation:
Correct the input and reissue the command. An internal command failed. This is usually a call to the
GPFS daemon.
6027-1200 tscrfs failed. Cannot create device
User response:
Explanation:
Check the error message from the command that
The internal tscrfs command failed.
failed.
User response:
6027-1205 Failed to connect to remote cluster
Check the error message from the command that
clusterName.
failed.
Explanation:
6027-1201 Disk diskName does not belong to
Attempt to establish a connection to the specified
file system fileSystem.
cluster was not successful. This can be caused by a
Explanation: number of reasons: GPFS is down on all of the contact
The specified disk was not found to be part of the cited nodes, the contact node list is obsolete, the owner of
file system. the remote cluster revoked authorization, and so forth.
User response: User response:
If the disk and file system were specified as part of a If the error persists, contact the administrator of
GPFS command, reissue the command with a disk that the remote cluster and verify that the contact node
belongs to the specified file system. information is current and that the authorization key
files are current as well.
6027-1202 Active disks are missing from the
GPFS configuration data. 6027-1206 File system fileSystem belongs
Explanation: to cluster clusterName. Command
A GPFS disk command found that one or more active is not allowed for remote file
disks known to the GPFS daemon are not recorded in systems.
the GPFS configuration data. A list of the missing disks Explanation:
follows. The specified file system is not local to the cluster, but
belongs to the cited remote cluster.
User response:
Contact the IBM Support Center. User response:
Choose a local file system, or issue the command on a
6027-1203 Attention: File system fileSystem
node in the remote cluster.
may have some disks that are
in a non-ready state. Issue the 6027-1207 There is already an existing file
command: mmcommon recoverfs system using value.
fileSystem
Explanation:
Explanation: The mount point or device name specified matches
The specified file system may have some disks that are that of an existing file system. The device name and
in a non-ready state. mount point must be unique within a GPFS cluster.
User response:
User response Choose an unused name or path.
Run mmcommon recoverfs fileSystem to ensure that
the GPFS configuration data for the file system is 6027-1208 File system fileSystem not found in
current, and then display the states of the disks in the cluster clusterName.
file system using the mmlsdisk command.

786 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: Examine the error code and other messages to
The specified file system does not belong to the cited determine the reason for the failure. Correct the
remote cluster. The local information about the file problem and reissue the command.
system is not current. The file system may have been
6027-1214 Unable to enable Persistent
deleted, renamed, or moved to a different cluster.
Reserve on the following disks:
User response: diskList
Contact the administrator of the remote cluster that
Explanation:
owns the file system and verify the accuracy of
The command was unable to set up all of the disks to
the local information. Use the mmremotefs show
use Persistent Reserve.
command to display the local information about the
file system. Use the mmremotefs update command User response:
to make the necessary changes. Examine the disks and the additional error information
to determine if the disks should have supported
6027-1209 GPFS is down on this node.
Persistent Reserve. Correct the problem and reissue
Explanation: the command.
GPFS is not running on this node.
6027-1215 Unable to reset the Persistent
User response: Reserve attributes on one or more
Ensure that GPFS is running and reissue the command. disks on the following nodes:
nodeList
6027-1210 GPFS is not ready to handle
commands yet. Explanation:
The command could not reset Persistent Reserve on at
Explanation:
least one disk on the specified nodes.
GPFS is in the process of initializing or waiting for
quorum to be reached. User response:
Examine the additional error information to determine
User response:
whether nodes were down or if there was a disk error.
Reissue the command.
Correct the problems and reissue the command.
6027-1211 fileSystem refers to file system
6027-1216 File fileName contains additional
fileSystem in cluster clusterName.
error information.
Explanation:
Explanation:
Informational message.
The command generated a file containing additional
User response: error information.
None.
User response:
6027-1212 File system fileSystem does not Examine the additional error information.
belong to cluster clusterName.
6027-1217 A disk descriptor contains an
Explanation: incorrect separator character.
The specified file system refers to a file system that is
Explanation:
remote to the cited cluster. Indirect remote file system
A command detected an incorrect character used as a
access is not allowed.
separator in a disk descriptor.
User response:
User response:
Contact the administrator of the remote cluster that
Correct the disk descriptor and reissue the command.
owns the file system and verify the accuracy of
the local information. Use the mmremotefs show 6027-1218 Node nodeName does not have a
command to display the local information about the GPFS server license designation.
file system. Use the mmremotefs update command
Explanation:
to make the necessary changes.
The function that you are assigning to the node
6027-1213 command failed. Error code requires the node to have a GPFS server license.
errorCode.
User response:
Explanation: Use the mmchlicense command to assign a valid
An internal command failed. This is usually a call to the GPFS license to the node or specify a different node.
GPFS daemon.
6027-1219 NSD discovery on node nodeName
User response: failed with return code value.

Chapter 42. References 787


Explanation: User response:
The NSD discovery process on the specified node Reissue the command with a valid file system.
failed with the specified return code.
6027-1225 Explicit drive letters are supported
User response: only in a Windows environment.
Determine why the node cannot access the specified Specify a mount point or allow the
NSDs. Correct the problem and reissue the command. default settings to take effect.
6027-1220 Node nodeName cannot be used Explanation:
as an NSD server for Persistent An explicit drive letter was specified on the mmmount
Reserve disk diskName because it command but the target node does not run the
is not an AIX node. Windows operating system.
Explanation: User response:
The node shown was specified as an NSD server for Specify a mount point or allow the default settings for
diskName, but the node does not support Persistent the file system to take effect.
Reserve.
6027-1226 Explicit mount points are
User response: not supported in a Windows
Specify a node that supports Persistent Reserve as an environment. Specify a drive letter
NSD server. or allow the default settings to
take effect.
6027-1221 The number of NSD servers
exceeds the maximum (value) Explanation:
allowed. An explicit mount point was specified on the
mmmount command but the target node runs the
Explanation:
Windows operating system.
The number of NSD servers in the disk descriptor
exceeds the maximum allowed. User response:
Specify a drive letter or allow the default settings for
User response:
the file system to take effect.
Change the disk descriptor to specify no more NSD
servers than the maximum allowed. 6027-1227 The main GPFS cluster
configuration file is locked.
6027-1222 Cannot assign a minor number
Retrying …
for file system fileSystem (major
number deviceMajorNumber). Explanation:
Another GPFS administration command has locked the
Explanation:
cluster configuration file. The current process will try
The command was not able to allocate a minor
to obtain the lock a few times before giving up.
number for the new file system.
User response:
User response:
None. Informational message only.
Delete unneeded /dev entries for the specified major
number and reissue the command. 6027-1228 Lock creation successful.
6027-1223 ipAddress cannot be used for NFS Explanation:
serving; it is used by the GPFS The holder of the lock has released it and the current
daemon. process was able to obtain it.
Explanation: User response:
The IP address shown has been specified for use by None. Informational message only. The command will
the GPFS daemon. The same IP address cannot be now continue.
used for NFS serving because it cannot be failed over.
6027-1229 Timed out waiting for lock. Try
User response: again later.
Specify a different IP address for NFS use and reissue
Explanation:
the command.
Another GPFS administration command kept the main
6027-1224 There is no file system with drive GPFS cluster configuration file locked for over a
letter driveLetter. minute.
Explanation: User response:
No file system in the GPFS cluster has the specified
drive letter.

788 IBM Storage Scale 5.1.9: Problem Determination Guide


Try again later. If no other GPFS administration Explanation:
command is presently running, see “GPFS cluster The cited kernel extension does not exist.
configuration data file issues” on page 354.
User response:
6027-1230 diskName is a tiebreaker disk and Create the needed kernel extension by compiling
cannot be deleted. a custom mmfslinux module for your kernel (see
steps in /usr/lpp/mmfs/src/README), or copy
Explanation:
the binaries from another node with the identical
A request was made to GPFS to delete a node quorum
environment.
tiebreaker disk.
6027-1236 Unable to verify kernel/module
User response:
configuration.
Specify a different disk for deletion.
Explanation:
6027-1231 GPFS detected more than eight
The mmfslinux kernel extension does not exist.
quorum nodes while node quorum
with tiebreaker disks is in use. User response:
Create the needed kernel extension by compiling
Explanation:
a custom mmfslinux module for your kernel (see
A GPFS command detected more than eight quorum
steps in /usr/lpp/mmfs/src/README), or copy
nodes, but this is not allowed while node quorum with
the binaries from another node with the identical
tiebreaker disks is in use.
environment.
User response:
6027-1237 The GPFS daemon is still running;
Reduce the number of quorum nodes to a maximum of
use the mmshutdown command.
eight, or use the normal node quorum algorithm.
Explanation:
6027-1232 GPFS failed to initialize the
An attempt was made to unload the GPFS kernel
tiebreaker disks.
extensions while the GPFS daemon was still running.
Explanation:
User response:
A GPFS command unsuccessfully attempted to
Use the mmshutdown command to shut down the
initialize the node quorum tiebreaker disks.
daemon.
User response:
6027-1238 Module fileName is still in use.
Examine prior messages to determine why GPFS was
Unmount all GPFS file systems
unable to initialize the tiebreaker disks and correct the
and issue the command: mmfsadm
problem. After that, reissue the command.
cleanup
6027-1233 Incorrect keyword: value.
Explanation:
Explanation: An attempt was made to unload the cited module
A command received a keyword that is not valid. while it was still in use.
User response: User response:
Correct the command line and reissue the command. Unmount all GPFS file systems and issue the
command mmfsadm cleanup. If this does not solve
6027-1234 Adding node node to the cluster
the problem, reboot the machine.
will exceed the quorum node limit.
6027-1239 Error unloading module
Explanation:
moduleName.
An attempt to add the cited node to the cluster
resulted in the quorum node limit being exceeded. Explanation:
GPFS was unable to unload the cited module.
User response:
Change the command invocation to not exceed the User response:
node quorum limit, and reissue the command. Unmount all GPFS file systems and issue the
command mmfsadm cleanup. If this does not solve
6027-1235 The fileName kernel extension
the problem, reboot the machine.
does not exist. Use the
mmbuildgpl command to create 6027-1240 Module fileName is already loaded.
the needed kernel extension for
Explanation:
your kernel or copy the binaries
An attempt was made to load the cited module, but it
from another node with the
was already loaded.
identical environment.
User response:

Chapter 42. References 789


None. Informational message only. After the cluster is created, use the specified
command to establish the desired configuration
6027-1241 diskName was not found in /proc/
parameter.
partitions.
6027-1246 configParameter is an obsolete
Explanation:
parameter. Line in error:
The cited disk was not found in /proc/partitions.
configLine. The line is ignored;
User response: processing continues.
Take steps to cause the disk to appear in /proc/
Explanation:
partitions, and then reissue the command.
The specified parameter is not used by GPFS anymore.
6027-1242 GPFS is waiting for
User response:
requiredCondition
None. Informational message only.
Explanation:
6027-1247 configParameter cannot appear
GPFS is unable to come up immediately due to the
in a node-override section. Line
stated required condition not being satisfied yet.
in error: configLine. The line is
ignored; processing continues.
User response
Explanation:
This is an informational message. If the required
The specified parameter must have the same value
condition is not met, this message appears
across all nodes in the cluster.
periodically. Check whether the specified command
is still running. If the command is not running User response:
on any node in the cluster, manually clear the None. Informational message only.
mmRunningCommand lock by issuing the following
6027-1248 Mount point cannot be a relative
command:
path name: path
mmcommon freelocks mmRunningCommand
Explanation:
The mount point does not begin with /.
6027-1243 command: Processing user
configuration file fileName User response:
Specify the absolute path name for the mount point.
Explanation:
Progress information for the mmcrcluster command. 6027-1249 operand cannot be a relative path
name: path.
User response:
None. Informational message only. Explanation:
The specified path name does not begin with '/'.
6027-1244 configParameter is set by the
mmcrcluster processing. Line in User response:
error: configLine. The line will be Specify the absolute path name.
ignored; processing continues.
6027-1250 Key file is not valid.
Explanation:
Explanation:
The specified parameter is set by the mmcrcluster
While attempting to establish a connection to another
command and cannot be overridden by the user.
node, GPFS detected that the format of the public key
User response: file is not valid.
None. Informational message only.
User response:
6027-1245 configParameter must be set with Use the mmremotecluster command to specify the
the command command. Line correct public key.
in error: configLine. The line is
6027-1251 Key file mismatch.
ignored; processing continues.
Explanation:
Explanation:
While attempting to establish a connection to another
The specified parameter has additional dependencies
node, GPFS detected that the public key file does not
and cannot be specified prior to the completion of the
match the public key file of the cluster to which the file
mmcrcluster command.
system belongs.
User response:
User response:
Use the mmremotecluster command to specify the
correct public key.

790 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1252 Node nodeName already belongs Run the specified command to commit the current
to the GPFS cluster. public/private key pair.
Explanation: 6027-1258 You must establish a cipher list
A GPFS command found that a node to be added to a first. Run: command.
GPFS cluster already belongs to the cluster. Explanation:
User response: You are attempting to commit an SSL private key but a
Specify a node that does not already belong to the cipher list has not been established yet.
GPFS cluster. User response:
6027-1253 Incorrect value for option option. Run the specified command to specify a cipher list.
Explanation: 6027-1259 command not found. Ensure
The provided value for the specified option is not valid. the OpenSSL code is properly
installed.
User response:
Correct the error and reissue the command. Explanation:
The specified command was not found.
6027-1254 Warning: Not all nodes have
proper GPFS license designations. User response:
Use the mmchlicense command to Ensure the OpenSSL code is properly installed and
designate licenses as needed. reissue the command.
Explanation: 6027-1260 File fileName does not contain any
Not all nodes in the cluster have valid license typeOfStanza stanzas.
designations.
Explanation:
User response: The input file should contain at least one specified
Use mmlslicense to see the current license stanza.
designations. Use mmchlicense to assign valid GPFS
User response:
licenses to all nodes as needed.
Correct the input file and reissue the command.
6027-1255 There is nothing to commit. You 6027-1261 descriptorField must be specified
must first run: command. in descriptorType descriptor.
Explanation: Explanation:
You are attempting to commit an SSL private key but A required field of the descriptor was empty.
such a key has not been generated yet. The incorrect descriptor is displayed following this
User response: message.
Run the specified command to generate the public/ User response:
private key pair. Correct the input and reissue the command.
6027-1256 The current authentication files 6027-1262 Unable to obtain the GPFS
are already committed. configuration file lock. Retrying ...
Explanation: Explanation:
You are attempting to commit public/private key files A command requires the lock for the GPFS system
that were previously generated with the mmauth data but was not able to obtain it.
command. The files have already been committed.
User response:
User response: None. Informational message only.
None. Informational message.
6027-1263 Unable to obtain the GPFS
6027-1257 There are uncommitted configuration file lock.
authentication files. You must first
run: command. Explanation:
A command requires the lock for the GPFS system
Explanation: data but was not able to obtain it.
You are attempting to generate new public/private
key files but previously generated files have not been User response:
committed yet. Check the preceding messages, if any. Follow the
procedure in “GPFS cluster configuration data file
User response: issues” on page 354, and then reissue the command.

Chapter 42. References 791


6027-1264 GPFS is unresponsive on this Perform problem determination. See “GPFS
node. commands are unsuccessful” on page 362.
Explanation: 6027-1272 Unknown user name userName.
GPFS is up but not responding to the GPFS commands. Explanation:
User response: The specified value cannot be resolved to a valid user
Wait for GPFS to be active again. If the ID (UID).
problem persists, perform the problem determination User response:
procedures and contact the IBM Support Center. Reissue the command with a valid user name.
6027-1265 [I] The fileName kernel extension 6027-1273 Unknown group name groupName.
does not exist. Invoking
mmbuildgpl command to build Explanation:
the GPFS portability layer. The specified value cannot be resolved to a valid group
ID (GID).
Explanation:
The cited kernel extension does not exist. User response:
Reissue the command with a valid group name.
User response:
Create the needed kernel extension by compiling a 6027-1274 Unexpected error obtaining the
custom mmfslinux module for your kernel. Refer to lockName lock.
the steps that are available in /usr/lpp/mmfs/src/ Explanation:
README or copy the binaries from another node with GPFS cannot obtain the specified lock.
the identical environment.
User response:
6027-1268 Missing arguments. Examine any previous error messages. Correct any
Explanation: problems and reissue the command. If the problem
A GPFS administration command received an persists, perform problem determination and contact
insufficient number of arguments. the IBM Support Center.
User response: 6027-1275 Daemon node adapter Node was
Correct the command line and reissue the command. not found on admin node Node.
6027-1269 The device name device starts with Explanation:
a slash, but not /dev/. An input node descriptor was found to be incorrect.
The node adapter specified for GPFS daemon
Explanation: communications was not found to exist on the cited
The device name does not start with /dev/. GPFS administrative node.
User response: User response:
Correct the device name. Correct the input node descriptor and reissue the
6027-1270 The device name device contains a command.
slash, but not as its first character. 6027-1276 Command failed for disks: diskList.
Explanation: Explanation:
The specified device name contains a slash, but the A GPFS command was unable to complete
first character is not a slash. successfully on the listed disks.
User response: User response:
The device name must be an unqualified device name Correct the problems and reissue the command.
or an absolute device path name, for example: fs0
or /dev/fs0. 6027-1277 No contact nodes were provided
for cluster clusterName.
6027-1271 Unexpected error from command.
Return code: value Explanation:
A GPFS command found that no contact nodes have
Explanation: been specified for the cited cluster.
A GPFS administration command (mm…) received an
unexpected error code from an internally called User response:
command. Use the mmremotecluster command to specify
some contact nodes for the cited cluster.
User response:

792 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1278 None of the contact nodes 6027-1290 GPFS configuration data for file
in cluster clusterName can be system fileSystem may not be in
reached. agreement with the on-disk data
for the file system. Issue the
Explanation:
command: mmcommon recoverfs
A GPFS command was unable to reach any of the
fileSystem
contact nodes for the cited cluster.
Explanation:
User response:
GPFS detected that the GPFS configuration database
Determine why the contact nodes for the cited cluster
data for the specified file system may not be in
cannot be reached and correct the problem, or use
agreement with the on-disk data for the file system.
the mmremotecluster command to specify some
This may be caused by a GPFS disk command that did
additional contact nodes that can be reached.
not complete normally.
6027-1285 [E] The directory /var/mmfs of
User response:
quorum node nodeName is on
Issue the specified command to bring the GPFS
an unsupported file system:
configuration database into agreement with the on-
fileSystemType
disk data.
Explanation:
6027-1291 Options name and name cannot be
Directory /var/mmfs of quorum nodes in CCR cluster
specified at the same time.
cannot be on RAM disks.
Explanation:
User response:
Incompatible options were specified on the command
Move /var/mmfs out of unsupported file system then
line.
retry the command.
User response:
6027-1287 Node nodeName returned ENODEV
Select one of the options and reissue the command.
for disk diskName.
6027-1292 The -N option cannot be used with
Explanation:
attribute name.
The specified node returned ENODEV for the specified
disk. Explanation:
The specified configuration attribute cannot be
User response:
changed on only a subset of nodes. This attribute must
Determine the cause of the ENODEV error for the
be the same on all nodes in the cluster.
specified disk and rectify it. The ENODEV may be due to
disk fencing or the removal of a device that previously User response:
was present. Certain attributes, such as autoload, may not be
customized from node to node. Change the attribute
6027-1288 Remote cluster clusterName was
for the entire cluster.
not found.
6027-1293 There are no remote file systems.
Explanation:
A GPFS command found that the cited cluster has not Explanation:
yet been identified to GPFS as a remote cluster. A value of all was specified for the remote file system
operand of a GPFS command, but no remote file
User response:
systems are defined.
Specify a remote cluster known to GPFS, or use
the mmremotecluster command to make the cited User response:
cluster known to GPFS. None. There are no remote file systems on which to
operate.
6027-1289 Name name is not allowed. It
contains the following invalid 6027-1294 Remote file system fileSystem is
special character: char not defined.
Explanation: Explanation:
The cited name is not allowed because it contains the The specified file system was used for the remote
cited invalid special character. file system operand of a GPFS command, but the file
system is not known to GPFS.
User response:
Specify a name that does not contain an invalid special User response:
character, and reissue the command. Specify a remote file system known to GPFS.

Chapter 42. References 793


6027-1295 The GPFS configuration 6027-1301 The NSD servers specified in the
information is incorrect or not disk descriptor do not match the
available. NSD servers currently in effect.
Explanation: Explanation:
A problem has been encountered while verifying The set of NSD servers specified in the disk descriptor
the configuration information and the execution does not match the set that is currently in effect.
environment.
User response:
User response: Specify the same set of NSD servers in the disk
Check the preceding messages for more information. descriptor as is currently in effect or omit it from the
Correct the problem and restart GPFS. disk descriptor and then reissue the command. Use
the mmchnsd command to change the NSD servers as
6027-1296 Device name cannot be 'all'.
needed.
Explanation:
6027-1302 clusterName is the name of the
A device name of all was specified on a GPFS
local cluster.
command.
Explanation:
User response:
The cited cluster name was specified as the name of
Reissue the command with a valid device name.
a remote cluster, but it is already being used as the
6027-1297 Each device specifies name of the local cluster.
metadataOnly for disk usage. This
User response:
file system could not store data.
Use the mmchcluster command to change the name
Explanation: of the local cluster, and then reissue the command
All disk descriptors specify metadataOnly for disk that failed.
usage.
6027-1303 Only alphanumeric and
User response: punctuation characters are
Change at least one disk descriptor in the file allowed.
system to indicate the usage of dataOnly or
Explanation:
dataAndMetadata.
Not all characters from the required input are
6027-1298 Each device specifies dataOnly for alphanumeric and punctuations.
disk usage. This file system could
User response:
not store metadata.
Reissue the command with the required input that only
Explanation: contains alphanumeric and punctuation characters.
All disk descriptors specify dataOnly for disk usage. The following characters are allowed:
0 to 9
User response:
A to Z
Change at least one disk descriptor in the file
a to z
system to indicate a usage of metadataOnly or
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
dataAndMetadata.
6027-1304 Missing argument after option
6027-1299 Incorrect value value specified for
option.
failure group.
Explanation:
Explanation:
The specified command option requires a value.
The specified failure group is not valid.
User response:
User response:
Specify a value and reissue the command.
Correct the problem and reissue the command.
6027-1305 Prerequisite libraries not found
6027-1300 No file systems were found.
or correct version not installed.
Explanation: Ensure productName is properly
A GPFS command searched for file systems, but none installed.
were found.
Explanation:
User response: The specified software product is missing or is not
Create a GPFS file system before reissuing the properly installed.
command.
User response:

794 IBM Storage Scale 5.1.9: Problem Determination Guide


Verify that the product is installed properly. remote cluster has cluster release
level (minReleaseLevel) less
6027-1306 Command command failed with
than 5.1.3.0, change the value of
return code value.
tscCmdAllowRemoteConnection
Explanation: s in the local cluster to "yes".
A command was not successfully processed.
Explanation:
User response: Self-explanatory.
Correct the failure specified by the command and
User response:
reissue the command.
If the remote cluster's minReleaseLevel
6027-1307 Disk disk on node nodeName is below 5.1.3.0, issue mmchconfig
already has a volume group tscCmdAllowRemoteConnections=yes -i.
vgName that does not appear to
6027-1324 [I] The cluster was created with the
have been created by this program
tscCmdAllowRemoteConnection
in a prior invocation. Correct
s configuration parameter set to
the descriptor file or remove the
"no". If a remote cluster is
volume group and retry.
established with another cluster
Explanation: whose release level
The specified disk already belongs to a volume group. (minReleaseLevel) is less than
5.1.3.0, change the value of
User response:
tscCmdAllowRemoteConnections in
Either remove the volume group or remove the disk
this cluster to "yes".
descriptor and retry.
Explanation:
6027-1308 Node nodeName is being used as a
Self-explanatory.
performance monitoring collector
node. User response:
If the remote cluster's minReleaseLevel
Explanation:
is below 5.1.3.0, issue mmchconfig
The specified node is defined as a performance
tscCmdAllowRemoteConnections=yes -i.
monitoring collector node.
6027-1325 [I] Setting the
User response:
tscCmdAllowRemoteConnection
If you are trying to delete the node from the GPFS
s configuration parameter to "no"
cluster, you must remove the specified node as a
requires all remote clusters to be
performance monitoring collector node.
running with release level
6027-1309 Storage pools are not available in (minReleaseLevel) set to at least
the GPFS Express Edition. 5.1.3.0. Otherwise, set
tscCmdAllowRemoteConnection
Explanation:
s to "yes".
Support for multiple storage pools is not part of the
GPFS Express Edition. Explanation:
Self-explanatory.
User response:
Install the GPFS Standard Edition on all nodes in the User response:
cluster, and then reissue the command. If the remote cluster's minReleaseLevel
is below 5.1.3.0, issue mmchconfig
6027-1312 [E] Remote fileset access control
tscCmdAllowRemoteConnections=yes -i.
operation failed: command
6027-1332 Cannot find disk with command.
Explanation:
An internal remote fileset access command failed for Explanation:
specified operation. The specified disk cannot be found.
User response: User response:
Check for additional error messages. Correct the Specify a correct disk name.
problem and reissue the command.
6027-1333 The following nodes could not
6027-1323 [I] The be restored: nodeList. Correct
tscCmdAllowRemoteConnection the problems and use the
s configuration parameter on the mmsdrrestore command to
local cluster has value "no". If the recover these nodes.

Chapter 42. References 795


Explanation: None. Informational message only.
The mmsdrrestore command was unable to restore
6027-1342 Unmount not finished after value
the configuration information for the listed nodes.
seconds. Waiting value more
User response: seconds.
Correct the problems and reissue the mmsdrrestore
Explanation:
command for these nodes.
Progress information for the mmshutdown command.
6027-1334 Incorrect value for option option.
User response:
Valid values are: validValues.
None. Informational message only.
Explanation:
6027-1343 Unmount not finished after value
An incorrect argument was specified with an option
seconds.
requiring one of a limited number of legal options.
Explanation:
User response:
Progress information for the mmshutdown command.
Use one of the legal values for the indicated option.
User response:
6027-1335 Command completed: Not all
None. Informational message only.
required changes were made.
6027-1344 Shutting down GPFS daemons
Explanation:
Some, but not all, of the required changes were made. Explanation:
Progress information for the mmshutdown command.
User response:
Examine the preceding messages, correct the User response:
problems, and reissue the command. None. Informational message only.
6027-1338 Command is not allowed for 6027-1345 Finished
remote file systems.
Explanation:
Explanation: Progress information for the mmshutdown command.
A command for which a remote file system is not
User response:
allowed was issued against a remote file system.
None. Informational message only.
User response:
6027-1347 Disk with NSD volume id NSD
Choose a local file system, or issue the command on a
volume id no longer exists in
node in the cluster that owns the file system.
the GPFS cluster configuration
6027-1339 Disk usage value is incompatible data but the NSD volume id was
with storage pool name. not erased from the disk. To
remove the NSD volume id, issue:
Explanation:
mmdelnsd -p NSD volume id
A disk descriptor specified a disk usage involving
metadata and a storage pool other than system. Explanation:
A GPFS administration command (mm...) successfully
User response:
removed the disk with the specified NSD volume id
Change the descriptor's disk usage field to dataOnly,
from the GPFS cluster configuration data but was
or do not specify a storage pool name.
unable to erase the NSD volume id from the disk.
6027-1340 File fileName not found. Recover
User response:
the file or run mmauth genkey.
Issue the specified command to remove the NSD
Explanation: volume id from the disk.
The cited file was not found.
6027-1348 Disk with NSD volume id NSD
User response: volume id no longer exists in
Recover the file or run the mmauth genkey command the GPFS cluster configuration
to recreate it. data but the NSD volume id was
not erased from the disk. To
6027-1341 Starting force unmount of GPFS
remove the NSD volume id, issue:
file systems
mmdelnsd -p NSD volume id -N
Explanation: nodeNameList
Progress information for the mmshutdown command.
Explanation:
User response:

796 IBM Storage Scale 5.1.9: Problem Determination Guide


A GPFS administration command (mm...) successfully A value of all was specified for the remote cluster
removed the disk with the specified NSD volume id operand of a GPFS command, but no remote clusters
from the GPFS cluster configuration data but was are defined.
unable to erase the NSD volume id from the disk.
User response:
User response: None. There are no remote clusters on which to
Issue the specified command to remove the NSD operate.
volume id from the disk.
6027-1363 Remote cluster clusterName is not
6027-1352 fileSystem is not a remote file defined.
system known to GPFS.
Explanation:
Explanation: The specified cluster was specified for the remote
The cited file system is not the name of a remote file cluster operand of a GPFS command, but the cluster
system known to GPFS. is not known to GPFS.
User response: User response:
Use the mmremotefs command to identify the cited Specify a remote cluster known to GPFS.
file system to GPFS as a remote file system, and then
6027-1364 No disks specified
reissue the command that failed.
Explanation:
6027-1357 An internode connection between
There were no disks in the descriptor list or file.
GPFS nodes was disrupted.
User response:
Explanation:
Specify at least one disk.
An internode connection between GPFS nodes was
disrupted, preventing its successful completion. 6027-1365 Disk diskName already belongs to
file system fileSystem.
User response:
Reissue the command. If the problem recurs, Explanation:
determine and resolve the cause of the disruption. If The specified disk name is already assigned to a
the problem persists, contact the IBM Support Center. GPFS file system. This may be because the disk was
specified more than once as input to the command, or
6027-1358 No clusters are authorized to
because the disk was assigned to a GPFS file system in
access this cluster.
the past.
Explanation:
User response:
Self-explanatory.
Specify the disk only once as input to the command, or
User response: specify a disk that does not belong to a file system.
This is an informational message.
6027-1366 File system fileSystem has some
6027-1359 Cluster clusterName is not disks that are in a non-ready state.
authorized to access this cluster.
Explanation:
Explanation: The specified file system has some disks that are in a
Self-explanatory. non-ready state.
User response: User response:
This is an informational message. Run mmcommon recoverfs fileSystem to ensure that
the GPFS configuration data for the file system is
6027-1361 Attention: There are no available
current. If some disks are still in a non-ready state,
valid VFS type values for mmfs
display the states of the disks in the file system using
in /etc/vfs.
the mmlsdisk command. Any disks in an undesired
Explanation: non-ready state should be brought into the ready state
An out of range number was used as the vfs number by using the mmchdisk command or by mounting
for GPFS. the file system. If these steps do not bring the disks
into the ready state, use the mmdeldisk command to
User response:
delete the disks from the file system.
The valid range is 8 through 32. Check /etc/vfs and
remove unneeded entries. 6027-1367 Attention: Not all disks were
marked as available.
6027-1362 There are no remote cluster
definitions. Explanation:
Explanation:

Chapter 42. References 797


The process of marking the disks as available could 6027-1374 File system fileSystem was not
not be completed. found in input file fileName.
User response: Explanation:
Before adding these disks to a GPFS file system, you The specified file system was not found in the input file
should either reformat them, or use the -v no option passed to the mmimportfs command. The file system
on the mmcrfs or mmadddisk command. cannot be imported.
6027-1368 This GPFS cluster contains User response:
declarations for remote file Reissue the mmimportfs command while specifying a
systems and clusters. You file system that exists in the input file.
cannot delete the last node.
6027-1375 The following file systems were
First use the delete option
not imported: fileSystem.
of the mmremotecluster and
mmremotefs commands. Explanation:
The mmimportfs command was unable to import one
Explanation:
or more of the file systems in the input file. A list of the
An attempt has been made to delete a GPFS cluster
file systems that could not be imported follows.
that still has declarations for remote file systems and
clusters. User response:
Examine the preceding messages, rectify the problems
User response:
that prevented the importation of the file systems, and
Before deleting the last node of a GPFS cluster, delete
all remote cluster and file system information. Use reissue the mmimportfs command.
the delete option of the mmremotecluster and 6027-1377 Attention: Unknown attribute
mmremotefs commands. specified: name. Press the ENTER
6027-1370 key to continue.
The following nodes could not be
reached: Explanation:
The mmchconfig command received an unknown
Explanation:
attribute.
A GPFS command was unable to communicate with
one or more nodes in the cluster. A list of the nodes User response:
that could not be reached follows. Unless directed otherwise by the IBM Support Center,
press any key to bypass this attribute.
User response:
Determine why the reported nodes could not be 6027-1378 Incorrect record found in the
reached and resolve the problem. mmsdrfs file (code value):
6027-1371 Propagating the cluster Explanation:
configuration data to all affected A line that is not valid was detected in the main GPFS
nodes. This is an asynchronous cluster configuration file /var/mmfs/gen/mmsdrfs.
process.
User response:
Explanation: The data in the cluster configuration file is incorrect.
A process is initiated to distribute the cluster If no user modifications have been made to this file,
configuration data to other nodes in the cluster. contact the IBM Support Center. If user modifications
have been made, correct these modifications.
User response:
This is an informational message. The command does 6027-1379 There is no file system with mount
not wait for the distribution to finish. point mountpoint.
6027-1373 There is no file system information Explanation:
in input file fileName. No file system in the GPFS cluster has the specified
mount point.
Explanation:
The cited input file passed to the mmimportfs User response:
command contains no file system information. No file Reissue the command with a valid file system.
system can be imported.
6027-1380 File system fileSystem is already
User response: mounted at mountpoint.
Reissue the mmimportfs command while specifying a
Explanation:
valid input file.

798 IBM Storage Scale 5.1.9: Problem Determination Guide


The specified file system is mounted at a mount point 6027-1388 File system fileSystem is not
different than the one requested on the mmmount known to the GPFS cluster.
command.
Explanation:
User response: The file system was not found in the GPFS cluster.
Unmount the file system and reissue the command.
User response:
6027-1381 Mount point cannot be specified If the file system was specified as part of a GPFS
when mounting all file systems. command, reissue the command with a valid file
system.
Explanation:
A device name of all and a mount point were 6027-1390 Node node does not belong to the
specified on the mmmount command. GPFS cluster, or was specified as
input multiple times.
User response:
Reissue the command with a device name for a single Explanation:
file system or do not specify a mount point. Nodes that are not valid were specified.
6027-1382 This node does not belong to a User response:
GPFS cluster. Verify the list of nodes. All specified nodes must
belong to the GPFS cluster, and each node can be
Explanation:
specified only once.
The specified node does not appear to belong to a
GPFS cluster, or the GPFS configuration information on 6027-1393 Incorrect node designation
the node has been lost. specified: type.
User response: Explanation:
Informational message. If you suspect that there A node designation that is not valid was specified.
is corruption of the GPFS configuration information, Valid values are client or manager.
recover the data following the procedures outlined in
“Recovery from loss of GPFS cluster configuration data User response:
Correct the command line and reissue the command.
file” on page 355.
6027-1394 Operation not allowed for the local
6027-1383 There is no record for this node
cluster.
in file fileName. Either the node is
not part of the cluster, the file is Explanation:
for a different cluster, or not all of The requested operation cannot be performed for the
the node's adapter interfaces have local cluster.
been activated yet.
User response:
Explanation: Specify the name of a remote cluster.
The mmsdrrestore command cannot find a record for
6027-1397 [E] Operation not allowed for a file
this node in the specified cluster configuration file. The
search of the file is based on the currently active IP system that is in maintenance
addresses of the node as reported by the ifconfig mode: name1
command. Explanation:
The requested operation cannot be performed for a file
User response:
system that is in maintenance mode.
Ensure that all adapter interfaces are properly
functioning. Ensure that the correct GPFS User response:
configuration file is specified on the command line. If Issue mmchfs --maintenance-mode no to turn off
the node indeed is not a member of the cluster, use maintenance mode and then reissue the command.
the mmaddnode command instead.
6027-1450 Could not allocate storage.
6027-1386 Unexpected value for Gpfs object:
Explanation:
value.
Sufficient memory cannot be allocated to run the
Explanation: mmsanrepairfs command.
A function received a value that is not allowed for the
User response:
Gpfs object.
Increase the amount of memory available.
User response:
6027-1500 [E] Open devicetype device failed with
Perform problem determination.
error:

Chapter 42. References 799


Explanation: Explanation:
The "open" of a device failed. Operation of the file Could not obtain a minor number for the specified
system may continue unless this device is needed for block or character device.
operation. If this is a replicated disk device, it will
User response:
often not be needed. If this is a block or character
Problem diagnosis will depend on the subsystem that
device for another subsystem (such as /dev/VSD0)
the device belongs to. For example, device /dev/
then GPFS will discontinue operation.
VSD0 belongs to the IBM Virtual Shared Disk
User response: subsystem and problem determination should follow
Problem diagnosis will depend on the subsystem that guidelines in that subsystem's documentation.
the device belongs to. For instance device "/dev/VSD0"
6027-1507 READ_KEYS ioctl failed
belongs to the IBM Virtual Shared Disk subsystem and
with errno=returnCode,
problem determination should follow guidelines in that
tried timesTried times.
subsystem's documentation. If this is a normal disk
Related values are
device then take needed repair action on the specified
scsi_status=scsiStatusValue,
disk.
sense_key=senseKeyValue,
6027-1501 [X] Volume label of disk name is scsi_asc=scsiAscValue,
name, should be uid. scsi_ascq=scsiAscqValue.
Explanation: Explanation:
The UID in the disk descriptor does not match A READ_KEYS ioctl call failed with the errno= and
the expected value from the file system descriptor. related values shown.
This could occur if a disk was overwritten by
User response:
another application or if the IBM Virtual Shared Disk
Check the reported errno= value and try to correct
subsystem incorrectly identified the disk.
the problem. If the problem persists, contact the IBM
User response: Support Center.
Check the disk configuration.
6027-1508 Registration failed
6027-1502 [X] Volume label of disk diskName is with errno=returnCode,
corrupt. tried timesTried times.
Related values are
Explanation:
scsi_status=scsiStatusValue,
The disk descriptor has a bad magic number,
sense_key=senseKeyValue,
version, or checksum. This could occur if a disk was
scsi_asc=scsiAscValue,
overwritten by another application or if the IBM Virtual
scsi_ascq=scsiAscqValue.
Shared Disk subsystem incorrectly identified the disk.
Explanation:
User response:
A REGISTER ioctl call failed with the errno= and
Check the disk configuration.
related values shown.
6027-1503 Completed adding disks to file
User response:
system fileSystem.
Check the reported errno= value and try to correct
Explanation: the problem. If the problem persists, contact the IBM
The mmadddisk command successfully completed. Support Center.
User response: 6027-1509 READRES ioctl failed
None. Informational message only. with errno=returnCode,
tried timesTried times.
6027-1504 File name could not be run with err
Related values are
error.
scsi_status=scsiStatusValue,
Explanation: sense_key=senseKeyValue,
A failure occurred while trying to run an external scsi_asc=scsiAscValue,
program. scsi_ascq=scsiAscqValue.
User response: Explanation:
Make sure the file exists. If it does, check its access A READRES ioctl call failed with the errno= and
permissions. related values shown.
6027-1505 Could not get minor number for User response:
name.

800 IBM Storage Scale 5.1.9: Problem Determination Guide


Check the reported errno= value and try to correct host_status=hostStatusValue,
the problem. If the problem persists, contact the IBM driver_status=driverStatsValue.
Support Center.
Explanation:
6027-1510 [E] Error mounting file system An ioctl call failed with stated return code, errno
stripeGroup on mountPoint; value, and related values.
errorQualifier (gpfsErrno)
User response:
Explanation: Check the reported errno and correct the problem if
An error occurred while attempting to mount a GPFS possible. Otherwise, contact the IBM Support Center.
file system on Windows.
6027-1516 REGISTER ioctl failed with
User response: rc=returnCode. Related values
Examine the error details, previous errors, and the are SCSI status=scsiStatusValue,
GPFS message log to identify the cause. host_status=hostStatusValue,
driver_status=driverStatsValue.
6027-1511 [E] Error unmounting file system
stripeGroup; errorQualifier Explanation:
(gpfsErrno) An ioctl call failed with stated return code, errno
value, and related values.
Explanation:
An error occurred while attempting to unmount a GPFS User response:
file system on Windows. Check the reported errno and correct the problem if
possible. Otherwise, contact the IBM Support Center.
User response:
Examine the error details, previous errors, and the 6027-1517 READ RESERVE ioctl failed with
GPFS message log to identify the cause. rc=returnCode. Related values
are SCSI status=scsiStatusValue,
6027-1512 [E] WMI query for queryType failed;
host_status=hostStatusValue,
errorQualifier (gpfsErrno)
driver_status=driverStatsValue.
Explanation:
Explanation:
An error occurred while running a WMI query on
An ioctl call failed with stated return code, errno
Windows.
value, and related values.
User response:
User response:
Examine the error details, previous errors, and the
Check the reported errno and correct the problem if
GPFS message log to identify the cause.
possible. Otherwise, contact the IBM Support Center.
6027-1513 DiskName is not an sg device, or sg
6027-1518 RESERVE ioctl failed with
driver is older than sg3
rc=returnCode. Related values
Explanation: are SCSI status=scsiStatusValue,
The disk is not a SCSI disk, or supports SCSI standard host_status=hostStatusValue,
older than SCSI 3. driver_status=driverStatsValue.
User response: Explanation:
Correct the command invocation and try again. An ioctl call failed with stated return code, errno
value, and related values.
6027-1514 ioctl failed with rc=returnCode.
Related values are User response:
SCSI status=scsiStatusValue, Check the reported errno and correct the problem if
host_status=hostStatusValue, possible. Otherwise, contact the IBM Support Center.
driver_status=driverStatsValue.
6027-1519 INQUIRY ioctl failed with
Explanation: rc=returnCode. Related values
An ioctl call failed with stated return code, errno are SCSI status=scsiStatusValue,
value, and related values. host_status=hostStatusValue,
driver_status=driverStatsValue.
User response:
Check the reported errno and correct the problem if Explanation:
possible. Otherwise, contact the IBM Support Center. An ioctl call failed with stated return code, errno
value, and related values.
6027-1515 READ KEY ioctl failed with
rc=returnCode. Related values are User response:
SCSI status=scsiStatusValue,

Chapter 42. References 801


Check the reported errno and correct the problem if Explanation:
possible. Otherwise, contact the IBM Support Center. The second argument to the mmcrfsc command is
minReleaseLevel and should be greater than 0.
6027-1520 PREEMPT ABORT ioctl failed with
rc=returnCode. Related values User response:
are SCSI status=scsiStatusValue, minReleaseLevel should be greater than 0. The
host_status=hostStatusValue, mmcrfs command should never call the mmcrfsc
driver_status=driverStatsValue. command without a valid minReleaseLevel
argument. Contact the IBM Support Center.
Explanation:
An ioctl call failed with stated return code, errno 6027-1530 Attention: parameter is set to
value, and related values. value.
User response: Explanation:
Check the reported errno and correct the problem if A configuration parameter is temporarily assigned a
possible. Otherwise, contact the IBM Support Center. new value.
6027-1521 Cannot find register key User response:
registerKeyValue at device Check the mmfs.cfg file. Use the mmchconfig
diskName. command to set a valid value for the parameter.
Explanation: 6027-1531 parameter value
Unable to find given register key at the disk.
Explanation:
User response: The configuration parameter was changed from its
Correct the problem and reissue the command. default value.
6027-1522 CLEAR ioctl failed with User response:
rc=returnCode. Related values Check the mmfs.cfg file.
are SCSI status=scsiStatusValue,
6027-1532 Attention: parameter (value) is
host_status=hostStatusValue,
not valid in conjunction with
driver_status=driverStatsValue.
parameter (value).
Explanation:
Explanation:
An ioctl call failed with stated return code, errno
A configuration parameter has a value that is not valid
value, and related values.
in relation to some other parameter. This can also
User response: happen when the default value for some parameter
Check the reported errno and correct the problem if is not sufficiently large for the new, user set value of a
possible. Otherwise, contact the IBM Support Center. related parameter.
6027-1523 Disk name longer than value is not User response:
allowed. Check the mmfs.cfg file.
Explanation: 6027-1533 parameter cannot be set
The specified disk name is too long. dynamically.
User response: Explanation:
Reissue the command with a valid disk name. The mmchconfig command encountered a
configuration parameter that cannot be set
6027-1524 The READ_KEYS ioctl data does
dynamically.
not contain the key that was
passed as input. User response:
Check the mmchconfig command arguments. If the
Explanation:
parameter must be changed, use the mmshutdown,
A REGISTER ioctl call apparently succeeded, but
mmchconfig, and mmstartup sequence of
when the device was queried for the key, the key was
commands.
not found.
6027-1534 parameter must have a value.
User response:
Check the device subsystem and try to correct the Explanation:
problem. If the problem persists, contact the IBM The tsctl command encountered a configuration
Support Center. parameter that did not have a specified value.
6027-1525 Invalid minReleaseLevel User response:
parameter: value Check the mmchconfig command arguments.

802 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1535 Unknown config name: parameter Explanation:
A new GPFS daemon started and found existing shared
Explanation: segments. The contents were not recognizable, so the
The tsctl command encountered an unknown GPFS daemon could not clean them up.
configuration parameter.
User response: User response
Check the mmchconfig command arguments.
1. Stop the GPFS daemon from trying to start by
6027-1536 parameter must be set using the issuing the mmshutdown command for the nodes
tschpool command. having the problem.
Explanation: 2. Find the owner of the shared segments with keys
The tsctl command encountered a configuration from 0x9283a0ca through 0x9283a0d1. If a non-
parameter that must be set using the tschpool GPFS program owns these segments, GPFS cannot
command. run on this node.
User response: 3. If these segments are left over from a previous
Check the mmchconfig command arguments. GPFS daemon:
6027-1537 [E] Connect failed to ipAddress: a. Remove them by issuing:
reason
ipcrm -m shared_memory_id
Explanation:
An attempt to connect sockets between nodes failed. b. Restart GPFS by issuing the mmstartup
command on the affected nodes.
User response:
Check the reason listed and the connection to the 6027-1543 error propagating parameter.
indicated IP address.
Explanation:
6027-1538 [I] Connect in progress to ipAddress mmfsd could not propagate a configuration parameter
value to one or more nodes in the cluster.
Explanation:
Connecting sockets between nodes. User response:
Contact the IBM Support Center.
User response:
None. Information message only. 6027-1544 [W] Sum of prefetchthreads(value),
worker1threads(value) and
6027-1539 [E] Connect progress select failed to
nsdMaxWorkerThreads (value)
ipAddress: reason
exceeds value. Reducing them to
Explanation: value, value and value.
An attempt to connect sockets between nodes failed.
Explanation:
User response: The sum of prefetchthreads, worker1threads,
Check the reason listed and the connection to the and nsdMaxWorkerThreads exceeds the permitted
indicated IP address. value.
6027-1540 [A] Try and buy license has expired! User response:
Accept the calculated values or reduce
Explanation:
the individual settings using mmchconfig
Self explanatory.
prefetchthreads=newvalue or mmchconfig
User response: worker1threads=newvalue. or mmchconfig
Purchase a GPFS license to continue using GPFS. nsdMaxWorkerThreads=newvalue. After using
mmchconfig, the new settings will not take affect
6027-1541 [N] Try and buy license expires in
until the GPFS daemon is restarted.
number days.
6027-1545 [A] The GPFS product that you are
Explanation:
attempting to run is not a fully
Self-explanatory.
functioning version. This probably
User response: means that this is an update
When the Try and Buy license expires, you will need to version and not the full product
purchase a GPFS license to continue using GPFS. version. Install the GPFS full
product version first, then apply
6027-1542 [A] Old shared memory exists but it is
not valid nor cleanable.

Chapter 42. References 803


any applicable update version Explanation:
before attempting to start GPFS. GPFS tried to establish an LDAP session with an Active
Directory server (normally the domain controller host),
Explanation:
and has been unable to do so.
GPFS requires a fully licensed GPFS installation.
User response:
User response:
Ensure the domain controller is available.
Verify installation of licensed GPFS, or purchase and
install a licensed version of GPFS. 6027-1555 Mount point and device name
cannot be equal: name
6027-1546 [W] Attention: parameter size of value
is too small. New value is value. Explanation:
The specified mount point is the same as the absolute
Explanation:
device name.
A configuration parameter is temporarily assigned a
new value. User response:
Enter a new device name or absolute mount point path
User response:
name.
Check the mmfs.cfg file. Use the mmchconfig
command to set a valid value for the parameter. 6027-1556 Interrupt received.
6027-1547 [A] Error initializing daemon: Explanation:
performing shutdown A GPFS administration command received an
interrupt.
Explanation:
GPFS kernel extensions are not loaded, and the User response:
daemon cannot initialize. GPFS may have been started None. Informational message only.
incorrectly.
6027-1557 You must first generate an
User response: authentication key file. Run:
Check GPFS log for errors resulting from kernel mmauth genkey new.
extension loading. Ensure that GPFS is started with the
Explanation:
mmstartup command.
Before setting a cipher list, you must generate an
6027-1548 [A] Error: daemon and kernel authentication key file.
extension do not match.
User response:
Explanation: Run the specified command to establish an
The GPFS kernel extension loaded in memory and the authentication key for the nodes in the cluster.
daemon currently starting do not appear to have come
6027-1559 The -i option failed. Changes will
from the same build.
take effect after GPFS is restarted.
User response:
Explanation:
Ensure that the kernel extension was reloaded after
The -i option on the mmchconfig command failed.
upgrading GPFS. See “GPFS modules cannot be
The changes were processed successfully, but will
loaded on Linux” on page 357 for details.
take effect only after the GPFS daemons are restarted.
6027-1549 [A] Attention: custom-built kernel
User response:
extension; the daemon and kernel
Check for additional error messages. Correct the
extension do not match.
problem and reissue the command.
Explanation:
6027-1560 This GPFS cluster contains file
The GPFS kernel extension loaded in memory does not
systems. You cannot delete the
come from the same build as the starting daemon. The
last node.
kernel extension appears to have been built from the
kernel open source package. Explanation:
An attempt has been made to delete a GPFS cluster
User response:
that still has one or more file systems associated with
None.
it.
6027-1550 [W] Error: Unable to establish a
User response:
session with an Active Directory
Before deleting the last node of a GPFS cluster, delete
server. ID remapping via Microsoft
all file systems that are associated with it. This applies
Identity Management for Unix will
to both local and remote file systems.
be unavailable.

804 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1561 Attention: Failed to remove node- 6027-1566 Remote cluster clusterName is
specific changes. already defined.
Explanation: Explanation:
The internal mmfixcfg routine failed to remove node- A request was made to add the cited cluster, but the
specific configuration settings, if any, for one or more cluster is already known to GPFS.
of the nodes being deleted. This is of consequence
User response:
only if the mmchconfig command was indeed used to
None. The cluster is already known to GPFS.
establish node specific settings and these nodes are
later added back into the cluster. 6027-1567 fileSystem from cluster
clusterName is already defined.
User response:
If you add the nodes back later, ensure that the Explanation:
configuration parameters for the nodes are set as A request was made to add the cited file system from
desired. the cited cluster, but the file system is already known
to GPFS.
6027-1562 command command cannot be
executed. Either none of the nodes User response:
in the cluster are reachable, or None. The file system is already known to GPFS.
GPFS is down on all of the nodes.
6027-1568 command command failed. Only
Explanation: parameterList changed.
The command that was issued needed to perform an
Explanation:
operation on a remote node, but none of the nodes in
The mmchfs command failed while making the
the cluster were reachable, or GPFS was not accepting
requested changes. Any changes to the attributes
commands on any of the nodes.
in the indicated parameter list were successfully
User response: completed. No other file system attributes were
Ensure that the affected nodes are available and changed.
all authorization requirements are met. Correct any
User response:
problems and reissue the command.
Reissue the command if you want to change additional
6027-1563 Attention: The file system may no attributes of the file system. Changes can be undone
longer be properly balanced. by issuing the mmchfs command with the original
value for the affected attribute.
Explanation:
The restripe phase of the mmadddisk or mmdeldisk 6027-1570 virtual shared disk support is not
command failed. installed.
User response: Explanation:
Determine the cause of the failure and run the The command detected that IBM Virtual Shared Disk
mmrestripefs -b command. support is not installed on the node on which it is
running.
6027-1564 To change the authentication key
for the local cluster, run: mmauth User response:
genkey. Install IBM Virtual Shared Disk support.
Explanation: 6027-1571 commandName does not exist or
The authentication keys for the local cluster must be failed; automount mounting may
created only with the specified command. not work.
User response: Explanation:
Run the specified command to establish a new One or more of the GPFS file systems were defined
authentication key for the nodes in the cluster. with the automount attribute but the requisite
automount command is missing or failed.
6027-1565 disk not found in file system
fileSystem. User response:
Correct the problem and restart GPFS. Or use the
Explanation:
mount command to explicitly mount the file system.
A disk specified for deletion or replacement does not
exist. 6027-1572 The command must run on a node
that is part of the cluster.
User response:
Specify existing disks for the indicated file system. Explanation:

Chapter 42. References 805


The node running the mmcrcluster command (this A GPFS administration command (prefixed by mm)
node) must be a member of the GPFS cluster. was asked to operate on an unknown GPFS cluster
type. The only supported GPFS cluster type is lc. This
User response:
message may also be generated if there is corruption
Issue the command from a node that will belong to the
in the GPFS system files.
cluster.
User response:
6027-1573 Command completed: No changes
Verify that the correct level of GPFS is installed on
made.
the node. If this is a cluster environment, make sure
Explanation: the node has been defined as a member of the
Informational message. GPFS cluster with the help of the mmcrcluster or
the mmaddnode command. If the problem persists,
User response:
contact the IBM Support Center.
Check the preceding messages, correct any problems,
and reissue the command. 6027-1590 nodeName cannot be reached.
6027-1574 Permission failure. The command Explanation:
requires root authority to execute. A command needs to issue a remote function on a
particular node but the node is not reachable.
Explanation:
The command, or the specified command option, User response:
requires root authority. Determine why the node is unreachable, correct the
problem, and reissue the command.
User response:
Log on as root and reissue the command. 6027-1591 Attention: Unable to retrieve GPFS
cluster files from node nodeName
6027-1578 File fileName does not contain
node names. Explanation:
A command could not retrieve the GPFS cluster files
Explanation:
from a particular node. An attempt will be made to
The specified file does not contain valid node names.
retrieve the GPFS cluster files from a backup node.
User response:
User response:
Node names must be specified one per line. The name
None. Informational message only.
localhost and lines that start with '#' character are
ignored. 6027-1592 Unable to retrieve GPFS cluster
files from node nodeName
6027-1579 File fileName does not contain
data. Explanation:
A command could not retrieve the GPFS cluster files
Explanation:
from a particular node.
The specified file does not contain data.
User response:
User response:
Correct the problem and reissue the command.
Verify that you are specifying the correct file name and
reissue the command. 6027-1594 Run the command command until
successful.
6027-1587 Unable to determine the local
device name for disk nsdName on Explanation:
node nodeName. The command could not complete normally. The GPFS
cluster data may be left in a state that precludes
Explanation:
normal operation until the problem is corrected.
GPFS was unable to determine the local device name
for the specified GPFS disk. User response:
Check the preceding messages, correct the problems,
User response:
and issue the specified command until it completes
Determine why the specified disk on the specified
successfully.
node could not be accessed and correct the problem.
Possible reasons include: connectivity problems, 6027-1595 No nodes were found that
authorization problems, fenced disk, and so forth. matched the input specification.
6027-1588 Unknown GPFS execution Explanation:
environment: value No nodes were found in the GPFS cluster that matched
those specified as input to a GPFS command.
Explanation:
User response:

806 IBM Storage Scale 5.1.9: Problem Determination Guide


Determine why the specified nodes were not valid, This message will normally be followed by a message
correct the problem, and reissue the GPFS command. telling you which command to issue as soon as the
problem is corrected and the specified nodes become
6027-1596 The same node was specified
available.
for both the primary and the
secondary server. 6027-1602 nodeName is not a member of this
cluster.
Explanation:
A command would have caused the primary and Explanation:
secondary GPFS cluster configuration server nodes to A command found that the specified node is not a
be the same. member of the GPFS cluster.
User response: User response:
Specify a different primary or secondary node. Correct the input or add the node to the GPFS cluster
and reissue the command.
6027-1597 Node node is specified more than
once. 6027-1603 The following nodes could not
be added to the GPFS cluster:
Explanation:
nodeList. Correct the problems and
The same node appears more than once on the
use the mmaddnode command to
command line or in the input file for the command.
add these nodes to the cluster.
User response:
Explanation:
All specified nodes must be unique. Note that even
The mmcrcluster or the mmaddnode command was
though two node identifiers may appear different on
unable to add the listed nodes to a GPFS cluster.
the command line or in the input file, they may still
refer to the same node. User response:
Correct the problems and add the nodes to the cluster
6027-1598 Node nodeName was not added to
using the mmaddnode command.
the cluster. The node appears to
already belong to a GPFS cluster. 6027-1604 Information cannot be displayed.
Either none of the nodes in the
Explanation:
cluster are reachable, or GPFS is
A GPFS cluster command found that a node to be
down on all of the nodes.
added to a cluster already has GPFS cluster files on
it. Explanation:
The command needed to perform an operation on a
User response:
remote node, but none of the nodes in the cluster were
Use the mmlscluster command to verify that the
reachable, or GPFS was not accepting commands on
node is in the correct cluster. If it is not, follow the
any of the nodes.
procedure in “Node cannot be added to the GPFS
cluster” on page 351. User response:
Ensure that the affected nodes are available and
6027-1599 The level of GPFS on node
all authorization requirements are met. Correct any
nodeName does not support the
problems and reissue the command.
requested action.
6027-1610 Disk diskName is the only disk in
Explanation:
file system fileSystem. You cannot
A GPFS command found that the level of the GPFS
replace a disk when it is the only
code on the specified node is not sufficient for the
remaining disk in the file system.
requested action.
Explanation:
User response:
The mmrpldisk command was issued, but there is
Install the correct level of GPFS.
only one disk in the file system.
6027-1600 Make sure that the following nodes
User response:
are available: nodeList
Add a second disk and reissue the command.
Explanation:
6027-1613 WCOLL (working collective)
A GPFS command was unable to complete because
environment variable not set.
nodes critical for the success of the operation were not
reachable or the command was interrupted. Explanation:
The mmdsh command was invoked without explicitly
User response:
specifying the nodes on which the command is to run

Chapter 42. References 807


by means of the -F or -L options, and the WCOLL 6027-1619 Unable to redirect outputStream.
environment variable has not been set. Error string was: string.
User response: Explanation:
Change the invocation of the mmdsh command to use The mmdsh command attempted to redirect an output
the -F or -L options, or set the WCOLL environment stream using open, but the open command failed.
variable before invoking the mmdsh command.
User response:
6027-1614 Cannot open file fileName. Error Determine why the call to open failed and correct the
string was: errorString. problem.
Explanation: 6027-1623 command: Mounting file
The mmdsh command was unable to successfully open systems ...
a file.
Explanation:
User response: This message contains progress information about the
Determine why the file could not be opened and mmmount command.
correct the problem.
User response:
6027-1615 nodeName remote shell process None. Informational message only.
had return code value.
6027-1625 option cannot be used with
Explanation: attribute name.
A child remote shell process completed with a nonzero
return code. Explanation:
An attempt was made to change a configuration
User response: attribute and requested the change to take effect
Determine why the child remote shell process failed immediately (-i or -I option). However, the specified
and correct the problem. attribute does not allow the operation.
6027-1616 Caught SIG signal - terminating User response:
the child processes. If the change must be made now, leave off the -i or
-I option. Then recycle the nodes to pick up the new
Explanation:
value.
The mmdsh command has received a signal causing it
to terminate. 6027-1626 Command is not supported in the
User response: type environment.
Determine what caused the signal and correct the Explanation:
problem. A GPFS administration command (mm...) is not
supported in the specified environment.
6027-1617 There are no available nodes on
which to run the command. User response:
Verify if the task is needed in this environment, and if it
Explanation:
is, use a different command.
The mmdsh command found that there are no available
nodes on which to run the specified command. 6027-1627 The following nodes are not aware
Although nodes were specified, none of the nodes of the configuration server change:
were reachable. nodeList. Do not start GPFS on the
User response: above nodes until the problem is
Determine why the specified nodes were not available resolved.
and correct the problem. Explanation:
6027-1618 The mmchcluster command could not propagate
Unable to pipe. Error string was:
the new cluster configuration servers to the specified
errorString.
nodes.
Explanation:
User response:
The mmdsh command attempted to open a pipe, but
Correct the problems and run the mmchcluster
the pipe command failed.
-p LATEST command before starting GPFS on the
User response: specified nodes.
Determine why the call to pipe failed and correct the
6027-1628 Cannot determine basic
problem.
environment information. Not
enough nodes are available.

808 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: environment or if the mmchcluster command did not
The mmchcluster command was unable to retrieve complete successfully.
the GPFS cluster data files. Usually, this is due to too
User response:
few nodes being available.
Correct any problems and issue the mmrefresh
User response: -f -a command. If the problem persists, perform
Correct any problems and ensure that as many of problem determination and contact the IBM Support
the nodes in the cluster are available as possible. Center.
Reissue the command. If the problem persists, record
6027-1633 Failed to create a backup copy
the above information and contact the IBM Support
of the GPFS cluster data on
Center.
nodeName.
6027-1629 Error found while checking node
Explanation:
descriptor descriptor
Commit could not create a correct copy of the GPFS
Explanation: cluster configuration data.
A node descriptor was found to be unsatisfactory in
User response:
some way.
Check the preceding messages, correct any problems,
User response: and reissue the command. If the problem persists,
Check the preceding messages, if any, and correct perform problem determination and contact the IBM
the condition that caused the disk descriptor to be Support Center.
rejected.
6027-1634 The GPFS cluster configuration
6027-1630 The GPFS cluster data on server node nodeName cannot be
nodeName is back level. removed.
Explanation: Explanation:
A GPFS command attempted to commit changes to the An attempt was made to delete a GPFS cluster
GPFS cluster configuration data, but the data on the configuration server node.
server is already at a higher level. This can happen
User response:
if the GPFS cluster configuration files were altered
You cannot remove a cluster configuration server node
outside the GPFS environment, or if the mmchcluster
unless all nodes in the GPFS cluster are being deleted.
command did not complete successfully.
Before deleting a cluster configuration server node,
User response: you must use the mmchcluster command to transfer
Correct any problems and reissue the command. If its function to another node in the GPFS cluster.
the problem persists, issue the mmrefresh -f -a
6027-1636 Error found while checking disk
command.
descriptor descriptor
6027-1631 The commit process failed.
Explanation:
Explanation: A disk descriptor was found to be unsatisfactory in
A GPFS administration command (mm...) cannot some way.
commit its changes to the GPFS cluster configuration
User response:
data.
Check the preceding messages, if any, and correct
User response: the condition that caused the disk descriptor to be
Examine the preceding messages, correct the rejected.
problem, and reissue the command. If the problem
6027-1637 command quitting. None of the
persists, perform problem determination and contact
specified nodes are valid.
the IBM Support Center.
Explanation:
6027-1632 The GPFS cluster configuration
A GPFS command found that none of the specified
data on nodeName is different
nodes passed the required tests.
than the data on nodeName.
User response:
Explanation:
Determine why the nodes were not accepted, fix the
The GPFS cluster configuration data on the primary
problems, and reissue the command.
cluster configuration server node is different than
the data on the secondary cluster configuration 6027-1638 Command: There are no
server node. This can happen if the GPFS cluster unassigned nodes in the cluster.
configuration files were altered outside the GPFS
Explanation:

Chapter 42. References 809


A GPFS command in a cluster environment needs Check whether there is a valid reason why the node
unassigned nodes, but found there are none. should be fenced out from the disk. If there is no such
reason, unfence the disk and reissue the command.
User response:
Verify whether there are any unassigned nodes in the 6027-1647 Unable to find disk with NSD
cluster. If there are none, either add more nodes to volume id NSD volume id.
the cluster using the mmaddnode command, or delete
Explanation:
some nodes from the cluster using the mmdelnode
A disk with the specified NSD volume id cannot be
command, and then reissue the command.
found.
6027-1639 Command failed. Examine
User response:
previous error messages to
Specify a correct disk NSD volume id.
determine cause.
6027-1648 GPFS was unable to obtain a lock
Explanation:
from node nodeName.
A GPFS command failed due to previously-reported
errors. Explanation:
GPFS failed in its attempt to get a lock from another
User response:
node in the cluster.
Check the previous error messages, fix the problems,
and then reissue the command. If no other User response:
messages are shown, examine the GPFS log files in Verify that the reported node is reachable. Examine
the /var/adm/ras directory on each node. previous error messages, if any. Fix the problems and
then reissue the command.
6027-1642 command: Starting GPFS ...
6027-1661 Failed while processing disk
Explanation:
descriptor descriptor on node
Progress information for the mmstartup command.
nodeName.
User response:
Explanation:
None. Informational message only.
A disk descriptor was found to be unsatisfactory in
6027-1643 The number of quorum nodes some way.
exceeds the maximum (number)
User response:
allowed.
Check the preceding messages, if any, and correct
Explanation: the condition that caused the disk descriptor to be
An attempt was made to add more quorum nodes to a rejected.
cluster than the maximum number allowed.
6027-1662 Disk device deviceName refers to
User response: an existing NSD name
Reduce the number of quorum nodes, and reissue the
Explanation:
command.
The specified disk device refers to an existing NSD.
6027-1644 Attention: The number of quorum
User response:
nodes exceeds the suggested
Specify another disk that is not an existing NSD.
maximum (number).
6027-1663 Disk descriptor descriptor should
Explanation:
refer to an existing NSD. Use
The number of quorum nodes in the cluster exceeds
mmcrnsd to create the NSD.
the maximum suggested number of quorum nodes.
Explanation:
User response:
An NSD disk given as input is not known to GPFS.
Informational message. Consider reducing the number
of quorum nodes to the maximum suggested number User response:
of quorum nodes for improved performance. Create the NSD. Then rerun the command.
6027-1645 Node nodeName is fenced out from 6027-1664 command: Processing node
disk diskName. nodeName
Explanation: Explanation:
A GPFS command attempted to access the specified Progress information.
disk, but found that the node attempting the operation
User response:
was fenced out from the disk.
None. Informational message only.
User response:

810 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1665 Issue the command from a node Specify a disk whose type is recognized by GPFS.
that remains in the cluster. 6027-1680 Disk name diskName is already
Explanation: registered for use by GPFS.
The nature of the requested change requires the Explanation:
command be issued from a node that will remain in The cited disk name was specified for use by GPFS,
the cluster.
but there is already a disk by that name registered for
User response: use by GPFS.
Run the command from a node that will remain in the User response:
cluster. Specify a different disk name for use by GPFS and
6027-1666 [I] No disks were found. reissue the command.
Explanation: 6027-1681 Node nodeName is being used as
A command searched for disks but found none. an NSD server.
User response: Explanation:
If disks are desired, create some using the mmcrnsd The specified node is defined as a server node for
command. some disk.
6027-1670 Incorrect or missing remote shell User response:
command: name If you are trying to delete the node from the GPFS
cluster, you must either delete the disk or define
Explanation: another node as its server.
The specified remote command does not exist or is not
executable. 6027-1685 Processing continues without lock
protection.
User response:
Specify a valid command. Explanation:
The command will continue processing although it was
6027-1671 Incorrect or missing remote file not able to obtain the lock that prevents other GPFS
copy command: name commands from running simultaneously.
Explanation: User response:
The specified remote command does not exist or is not Ensure that no other GPFS command is running. See
executable. the command documentation for additional details.
User response: 6027-1688 Command was unable to obtain
Specify a valid command.
the lock for the GPFS system
6027-1672 option value parameter must be an data. Unable to reach the holder
absolute path name. of the lock nodeName. Check
the preceding messages, if any.
Explanation:
Follow the procedure outlined in
The mount point does not begin with '/'.
the GPFS: Problem Determination
User response: Guide.
Specify the full path for the mount point. Explanation:
6027-1674 command: Unmounting file A command requires the lock for the GPFS system
systems ... data but was not able to obtain it.
Explanation: User response:
This message contains progress information about the Check the preceding messages, if any. Follow
mmumount command. the procedure in the IBM Storage Scale: Problem
Determination Guide for what to do when the GPFS
User response: system data is locked. Then reissue the command.
None. Informational message only.
6027-1689 vpath disk diskName is not
6027-1677 Disk diskName is of an unknown recognized as an IBM SDD device.
type.
Explanation:
Explanation: The mmvsdhelper command found that the specified
The specified disk is of an unknown type. disk is a vpath disk, but it is not recognized as an IBM
User response: SDD device.

Chapter 42. References 811


User response: The mmspsecserver process has created all the
Ensure the disk is configured as an IBM SDD device. service threads necessary for mmfsd.
Then reissue the command.
User response:
6027-1699 Remount failed for file system None. Informational message only.
fileSystem. Error code errorCode.
6027-1705 command: incorrect number of
Explanation: connections (number), exiting...
The specified file system was internally unmounted.
Explanation:
An attempt to remount the file system failed with the
The mmspsecserver process was called with an
specified error code.
incorrect number of connections. This will happen
User response: only when the mmspsecserver process is run as an
Check the daemon log for additional error messages. independent program.
Ensure that all file system disks are available and
User response:
reissue the mount command.
Retry with a valid number of connections.
6027-1700 Failed to load LAPI library.
6027-1706 mmspsecserver: parent program is
functionName not found. Changing
not "mmfsd", exiting...
communication protocol to TCP.
Explanation:
Explanation:
The mmspsecserver process was invoked from a
The GPFS daemon failed to load liblapi_r.a
program other than mmfsd.
dynamically.
User response:
User response:
None. Informational message only.
Verify installation of liblapi_r.a.
6027-1707 mmfsd connected to
6027-1701 mmfsd waiting to connect to
mmspsecserver
mmspsecserver. Setting up to
retry every number seconds for Explanation:
number minutes. The mmfsd daemon has successfully connected to the
mmspsecserver process through the communication
Explanation:
socket.
The GPFS daemon failed to establish a connection
with the mmspsecserver process. User response:
None. Informational message only.
User response:
None. Informational message only. 6027-1708 The mmfsd daemon failed to fork
mmspsecserver. Failure reason
6027-1702 Process pid failed at functionName
explanation
call, socket socketName, errno
value Explanation:
The mmfsd daemon failed to fork a child process.
Explanation:
Either the mmfsd daemon or the mmspsecserver User response:
process failed to create or set up the communication Check the GPFS installation.
socket between them.
6027-1709 [I] Accepted and connected to
User response: ipAddress
Determine the reason for the error.
Explanation:
6027-1703 The processName process The local mmfsd daemon has successfully accepted
encountered error: errorString. and connected to a remote daemon.
Explanation: User response:
Either the mmfsd daemon or the mmspsecserver None. Informational message only.
process called the error log routine to log an incident.
6027-1710 [N] Connecting to ipAddress
User response:
Explanation:
None. Informational message only.
The local mmfsd daemon has started a connection
6027-1704 mmspsecserver (pid number) request to a remote daemon.
ready for service.
User response:
Explanation: None. Informational message only.

812 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1711 [I] Connected to ipAddress User response:
Contact the administrator to obtain the new key and
Explanation: register it using mmremotecluster update.
The local mmfsd daemon has successfully connected
to a remote daemon. 6027-1725 [E] The key used by the contact
node named contactNodeName
User response: has changed. Contact the
None. Informational message only. administrator to obtain the new
6027-1712 Unexpected zero bytes received key and register it using mmauth
from name. Continuing. update.
Explanation: Explanation:
This is an informational message. A socket read The administrator of the cluster has changed the key
resulted in zero bytes being read. used for authentication.
User response: User response:
If this happens frequently, check IP connections. Contact the administrator to obtain the new key and
register it using mmauth update.
6027-1715 EINVAL trap from connect call to
ipAddress (socket name) 6027-1726 [E] The administrator of the cluster
named clusterName requires
Explanation: authentication. Contact the
The connect call back to the requesting node failed. administrator to obtain the
User response: clusters key and register the key
This is caused by a bug in AIX socket support. Upgrade using "mmremotecluster update".
AIX kernel and TCP client support.
Explanation:
6027-1716 [N] Close connection to ipAddress The administrator of the cluster requires
authentication.
Explanation:
Connection socket closed. User response:
Contact the administrator to obtain the cluster's key
User response: and register it using: mmremotecluster update.
None. Informational message only.
6027-1727 [E] The administrator of the
6027-1717 [E] Error initializing the configuration cluster named clusterName
server, err value does not require authentication.
Explanation: Unregister the clusters key using
The configuration server module could not be "mmremotecluster update".
initialized due to lack of system resources. Explanation:
User response: The administrator of the cluster does not require
Check system memory. authentication.
6027-1718 [E] Could not run command name, err User response:
value Unregister the clusters key using: mmremotecluster
update.
Explanation:
The GPFS daemon failed to run the specified 6027-1728 [E] Remote mounts are not
command. enabled within the cluster
named clusterName. Contact the
User response:
administrator and request that
Verify correct installation.
they enable remote mounts.
6027-1724 [E] The key used by the cluster named Explanation:
clusterName has changed. Contact The administrator of the cluster has not enabled
the administrator to obtain the remote mounts.
new key and register it using
"mmremotecluster update". User response:
Contact the administrator and request remote mount
Explanation: access.
The administrator of the cluster has changed the key
used for authentication. 6027-1729 [E] The cluster named clusterName
has not authorized this cluster to

Chapter 42. References 813


mount file systems. Contact the Explanation:
cluster administrator and request One of the functions required for engine support was
access. not included in the version of the OpenSSL library that
GPFS is configured to use.
Explanation:
The administrator of the cluster has not authorized this User response:
cluster to mount file systems. If this functionality is required, shut down the
daemon, install a version of OpenSSL with the desired
User response:
functionality, and configure GPFS to use it. Then
Contact the administrator and request access.
restart the daemon.
6027-1730 [E] Unsupported cipherList cipherList
6027-1735 [E] Close connection to ipAddress.
requested.
Attempting reconnect.
Explanation:
Explanation:
The target cluster requested a cipherList not
Connection socket closed. The GPFS daemon will
supported by the installed version of OpenSSL.
attempt to reestablish the connection.
User response:
User response:
Install a version of OpenSSL that supports the required
None. Informational message only.
cipherList or contact the administrator of the target
cluster and request that a supported cipherList be 6027-1736 [N] Reconnected to ipAddress
assigned to this remote cluster.
Explanation:
6027-1731 [E] Unsupported cipherList cipherList The local mmfsd daemon has successfully
requested. reconnected to a remote daemon following an
unexpected connection break.
Explanation:
The target cluster requested a cipherList that is not User response:
supported by the installed version of OpenSSL. None. Informational message only.
User response: 6027-1737 [N] Close connection to ipAddress
Either install a version of OpenSSL that supports the (errorString).
required cipherList or contact the administrator
Explanation:
of the target cluster and request that a supported
Connection socket closed.
cipherList be assigned to this remote cluster.
User response:
6027-1732 [X] Remote mounts are not enabled
None. Informational message only.
within this cluster.
6027-1738 [E] Close connection to
Explanation:
ipAddress (errorString). Attempting
Remote mounts cannot be performed in this cluster.
reconnect.
User response:
Explanation:
See the IBM Storage Scale: Administration Guide
Connection socket closed.
for instructions about enabling remote mounts. In
particular, make sure the keys have been generated User response:
and a cipherlist has been set. None. Informational message only.
6027-1733 OpenSSL dynamic lock support 6027-1739 [X] Accept socket connection failed:
could not be loaded. err value.
Explanation: Explanation:
One of the functions required for dynamic lock support The Accept socket connection received an unexpected
was not included in the version of the OpenSSL library error.
that GPFS is configured to use.
User response:
User response: None. Informational message only.
If this functionality is required, shut down the
6027-1740 [E] Timed out waiting for a reply from
daemon, install a version of OpenSSL with the desired
node ipAddress.
functionality, and configure GPFS to use it. Then
restart the daemon. Explanation:
A message that was sent to the specified node did not
6027-1734 [E] OpenSSL engine support could not
receive a response within the expected time limit.
be loaded.

814 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Verify that the gpfs.gskit package was properly
None. Informational message only. installed.
6027-1741 [E] Error code value received from 6027-1747 [W] The TLS handshake with node
node ipAddress. ipAddress failed with error value
(handshakeType).
Explanation:
When a message was sent to the specified node to Explanation:
check its status, an error occurred and the node could An error occurred while trying to establish a secure
not handle the message. connection with another GPFS node.
User response: User response:
None. Informational message only. Examine the error messages to obtain information
about the error. Under normal circumstances, the retry
6027-1742 [E] Message ID value was lost by node
logic will ensure that the connection is re-established.
ipAddress.
If this error persists, record the error code and contact
Explanation: the IBM Support Center.
During a periodic check of outstanding messages, a
6027-1748 [W] A secure receive from node
problem was detected where the destination node no
ipAddress failed with error value.
longer has any knowledge of a particular message.
Explanation:
User response:
An error occurred while receiving an encrypted
None. Informational message only.
message from another GPFS node.
6027-1743 [W] Failed to load GSKit library path:
User response:
(dlerror) errorMessage
Examine the error messages to obtain information
Explanation: about the error. Under normal circumstances, the retry
The GPFS daemon could not load the library required logic will ensure that the connection is re-established
to secure the node-to-node communications. and the message is received. If this error persists,
record the error code and contact the IBM Support
User response:
Center.
Verify that the gpfs.gskit package was properly
installed. 6027-1749 [W] A secure send to node ipAddress
failed with error value.
6027-1744 [I] GSKit library loaded and
initialized. Explanation:
An error occurred while sending an encrypted
Explanation:
message to another GPFS node.
The GPFS daemon successfully loaded the library
required to secure the node-to-node communications. User response:
Examine the error messages to obtain information
User response:
about the error. Under normal circumstances, the retry
None. Informational message only.
logic will ensure that the connection is re-established
6027-1745 [E] Unable to resolve symbol for and the message is sent. If this error persists, record
routine: functionName (dlerror) the error code and contact the IBM Support Center.
errorMessage
6027-1750 [N] The handshakeType TLS
Explanation: handshake with node ipAddress
An error occurred while resolving a symbol required for was cancelled: connection reset
transport-level security. by peer (return code value).
User response: Explanation:
Verify that the gpfs.gskit package was properly A secure connection could not be established because
installed. the remote GPFS node closed the connection.
6027-1746 [E] Failed to load or initialize GSKit User response:
library: error value None. Informational message only.
Explanation: 6027-1751 [N] A secure send to node ipAddress
An error occurred during the initialization of the was cancelled: connection reset
transport-security code. by peer (return code value).
User response: Explanation:

Chapter 42. References 815


Securely sending a message failed because the remote An unexpected TCP socket state has been detected,
GPFS node closed the connection. which may lead to data no longer flowing over
the connection. The socket state includes a
User response:
nonzero tcpi_ca_state value, a larger than default
None. Informational message only.
retransmission timeout (tcpi_rto) and a nonzero
6027-1752 [N] A secure receive from node number of currently unacknowledged segments
ipAddress was cancelled: (tcpi_unacked), or a larger than default tcpi_unacked
Connection reset by peer (return value. All these cases indicate an unexpected TCP
code value). socket state, possibly triggered by an outage in the
network.
Explanation:
Securely receiving a message failed because the User response:
remote GPFS node closed the connection. Check network connectivity and whether network
packets may have been lost or delayed. Check network
User response:
interface packet drop statistics.
None.
6027-1757 [E] The TLS handshake with node
6027-1753 [E] The crypto library with FIPS
ipAddress failed with error value:
support is not available for this
Certificate Signature Algorithm is
architecture. Disable FIPS mode
not supported by the SSL/TLS
and reattempt the operation.
Handshake (handshakeType).
Explanation:
Explanation:
GPFS is operating in FIPS mode, but the initialization
A secure connection could not be established because
of the cryptographic library failed because FIPS mode
the signature algorithm of one of the certificates used
is not yet supported on this architecture.
in the TLS handshake was not supported.
User response:
User response:
Disable FIPS mode and attempt the operation again.
Use the mmauth command to generate a new
6027-1754 [E] Failed to initialize the crypto certificate.
library in FIPS mode. Ensure that
6027-1758 [W] The TCP connection to
the crypto library package was
IP address ipAddress
correctly installed.
(socket socketNum) state is
Explanation: unexpected: state=stateValue
GPFS is operating in FIPS mode, but the initialization ca_state=caStateValue
of the cryptographic library failed. snd_cwnd=cwndValue
snd_ssthresh= ssthreshValu
User response:
unacked=unackedCount
Ensure that the packages required for encryption are
probes=probesValue
properly installed on each node in the cluster.
backoff=backoffValue
6027-1755 [W] The certificate for 'canonicalName' retransmits=retransmitsValue
is expired. Validity period is from rto=rtoValue
begDate to endDate. rcv_ssthresh=rcv_ssthreshValue
rtt=rttValue rttvar=rttvarValue
Explanation:
sacked=sackedValue
The validity period of the certificate used by a remote
retrans="retransValue
node is expired.
reordering=reorderingValue
User response: lost=lostValue
Contact the administrator of the remote cluster
Explanation:
and instruct them to use the mmauth command to
An unexpected TCP socket state has been detected,
generate a new certificate.
which may lead to data no longer flowing over
6027-1756 [E] The TCP connection to IP the connection. The socket state includes a
address ipAddress (socket nonzero tcpi_ca_state value, a larger than default
socketNum) state is unexpected: retransmission timeout (tcpi_rto) and the other
ca_state=caStateValue important tcp_info fields. All these cases indicate an
unacked=unackedCount unexpected TCP socket state, possibly triggered by an
rto=rtoValue. outage in the network.
Explanation: User response:

816 IBM Storage Scale 5.1.9: Problem Determination Guide


Check network connectivity and whether network The number of sockets monitored by a receiver thread
packets may have been lost or delayed. Check network exceeds the threshold, which might be indicating that
interface packet drop statistics. the receiver thread is overloaded.
6027-1759 [I] The TCP connection to User response:
IP address ipAddress Run the mmchconfig
(socket socketNum) state is maxReceiverThreads=nMaxReceivers command
unexpected: state=stateValue to increase the number of receiver threads.
ca_state=caStateValue
6027-1803 [E] Global NSD disk, name, not found.
snd_cwnd=cwndValue
snd_ssthresh= ssthreshValu Explanation:
unacked=unackedCount A client tried to open a globally-attached NSD disk, but
probes=probesValue a scan of all disks failed to find that NSD.
backoff=backoffValue
User response:
retransmits=retransmitsValue
Ensure that the globally-attached disk is available on
rto=rtoValue
every node that references it.
rcv_ssthresh=rcv_ssthreshValue
rtt=rttValue rttvar=rttvarValue 6027-1804 [E] I/O to NSD disk, name, fails. No
sacked=sackedValue such NSD locally found.
retrans="retransValue
Explanation:
reordering=reorderingValue
A server tried to perform I/O on an NSD disk, but a
lost=lostValue
scan of all disks failed to find that NSD.
Explanation:
User response:
A TCP socket state has been displayed because of
Make sure that the NSD disk is accessible to the client.
some internal checks. The socket state includes most
If necessary, break a reservation.
of the important tcp_info fields.
6027-1805 [N] Rediscovered nsd server access to
User response:
name.
If the corresponding node gets expelled, examine
whether tcpi_ca_state has a nonzero value and Explanation:
whether the retransmission timeout (tcpi_rto) has A server rediscovered access to the specified disk.
a high value. These may indicate possible network
User response:
problems.
None.
6027-1760 [E] Bad TCP state detected. Initiating
6027-1806 [X] A Persistent Reserve could not
proactive reconnect to node
be established on device name
ipAddress.
(deviceName): errorLine.
Explanation:
Explanation:
Bad TCP state detected on the connection to the
GPFS is using Persistent Reserve on this disk, but was
given node. Data appears not to be flowing through
unable to establish a reserve for this node.
the connection, which may result in the node being
expelled soon. The GPFS daemon will attempt to close User response:
the connection and then re-establish it. Perform disk diagnostics.
User response: 6027-1807 [E] NSD nsdName is using Persistent
No response needed if this message occurs Reserve, this will require an NSD
sporadically, but if the message occurs often then server on an osName node.
it may indicate a potential problem in the network
Explanation:
infrastructure.
A client tried to open a globally-attached NSD disk,
6027-1761 [W] The number of sockets but the disk is using Persistent Reserve. An osName
(nMonitoredSocks) monitored by NSD server is needed. GPFS only supports Persistent
a receiver thread exceeds the Reserve on certain operating systems.
threshold (nSocksThreshold). Set
User response:
maxReceiverThreads parameter
Use the mmchnsd command to add an osName NSD
to increase the number of receiver
server for the NSD.
threads. (Max:nMaxReceivers).
6027-1808 [A] Unable to reserve space for NSD
Explanation:
buffers. Increase pagepool size

Chapter 42. References 817


to at least requiredPagePoolSize User response:
MB. Refer to the IBM Storage Check for damage to the mmsdrfs file.
Scale: Administration Guide for
6027-1814 [E] Vdisk vdiskName cannot be
more information on selecting an
associated with its recovery group
appropriate pagepool size.
recoveryGroupName. This vdisk
Explanation: will be ignored.
The pagepool usage for an NSD buffer
Explanation:
(4*maxblocksize) is limited by factor nsdBufSpace.
The named vdisk cannot be associated with its
The value of nsdBufSpace can be in the range of
recovery group.
10-70. The default value is 30.
User response:
User response:
Check for damage to the mmsdrfs file.
Use the mmchconfig command to decrease the
value of maxblocksize or to increase the value of 6027-1815 [A] Error reading volume identifier (for
pagepool or nsdBufSpace. NSD name) from configuration file.
6027-1809 [E] The defined server serverName Explanation:
for NSD NsdName couldn't be The volume identifier for the named NSD could not be
resolved. read from the mmsdrfs file. This should never occur.
Explanation: User response:
The host name of the NSD server could not be resolved Check for damage to the mmsdrfs file.
by gethostbyName().
6027-1816 [E] The defined server
User response: serverName for recovery group
Fix the host name resolution. recoveryGroupName could not be
resolved.
6027-1810 [I] Vdisk server recovery: delay
number sec. for safe recovery. Explanation:
The hostname of the NSD server could not be resolved
Explanation:
by gethostbyName().
Wait for the existing disk lease to expire before
performing vdisk server recovery. User response:
Fix hostname resolution.
User response:
None. 6027-1817 [E] Vdisks are defined, but no
recovery groups are defined.
6027-1811 [I] Vdisk server recovery: delay
complete. Explanation:
There are vdisks defined in the mmsdrfs file, but no
Explanation:
recovery groups are defined. This should never occur.
Done waiting for existing disk lease to expire before
performing vdisk server recovery. User response:
Check for damage to the mmsdrfs file.
User response:
None. 6027-1818 [I] Relinquished recovery group
recoveryGroupName (err
6027-1812 [E] Rediscovery failed for name.
errorCode).
Explanation:
Explanation:
A server failed to rediscover access to the specified
This node has relinquished serving the named
disk.
recovery group.
User response:
User response:
Check the disk access issues and run the command
None.
again.
6027-1819 Disk descriptor for name refers to
6027-1813 [A] Error reading volume identifier
an existing pdisk.
(for objectName name) from
configuration file. Explanation:
The mmcrrecoverygroup command or mmaddpdisk
Explanation:
command found an existing pdisk.
The volume identifier for the named recovery group or
vdisk could not be read from the mmsdrfs file. This User response:
should never occur. Correct the input file, or use the -v option.

818 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1820 Disk descriptor for name refers to Explanation:
an existing NSD. The allowed number of retries was exceeded when
encountering an NSD checksum error on I/O to the
Explanation: indicated disk, using the indicated server.
The mmcrrecoverygroup command or mmaddpdisk
command found an existing NSD. User response:
There may be network issues that require
User response: investigation.
Correct the input file, or use the -v option.
6027-1826 [W] The host name of the server
6027-1821 Error errno writing disk descriptor serverName that is defined for NSD
on name. local cache NsdName could not be
Explanation: resolved.
The mmcrrecoverygroup command or mmaddpdisk Explanation:
command got an error writing the disk descriptor. The host name of NSD server could not be resolved by
User response: gethostbyName().
Perform disk diagnostics. User response:
6027-1822 Error errno reading disk descriptor Fix host name resolution.
on name. 6027-1827 [W] A path (uuid: DiskUniqueID
Explanation: MajorMinor: MajorMinorNumber) is
The tspreparedpdisk command got an error reinstated or newly added for NSD
reading the disk descriptor. nsdName. Run tsprreadkeys
command to check whether all
User response:
paths have the PR key for this
Perform disk diagnostics.
disk.
6027-1823 Path error, name and name are the Explanation:
same disk. A new device mapper path has been detected, or a
Explanation: previously failed path has been reactivated again.
The tspreparedpdisk command got an error during User response:
path verification. The pdisk descriptor file is miscoded. Check all the paths to this NSD and remount the file
User response: system or restart GPFS on this node if a Persistent
Correct the pdisk descriptor file and reissue the Reserve(PR) key is absent for any path.
command. 6027-1828 [E] NSD disk NsdName cannot be
6027-1824 [X] An unexpected Device Mapper locally found.
path dmDevice (nsdId) has been Explanation:
detected. The new path does not The NSD server cannot locate the local device for this
have a Persistent Reserve set up. NSD.
Server disk diskName will be put
offline User response:
Check the local device mapping for the NSD, make
Explanation: sure the local device is accessible.
A new device mapper path is detected or a previously
failed path is activated after the local device discovery 6027-1829 [E] No RDMA connection to NSD
has finished. This path lacks a Persistent Reserve, and server node nodeName available
cannot be used. All device paths must be active at for GPU Direct Storage I/O.
mount time.
Explanation:
User response: For processing a GPU Direct Storage I/O request an
Check the paths to all disks making up the file RDMA connection to the specified NSD server node is
system. Repair any paths to disks which have failed. required.
Rediscover the paths for the NSD.
User response:
6027-1825 [A] Unrecoverable NSD checksum Verify the RDMA configuration and make sure that an
error on I/O to NSD disk nsdName, RDMA connection to the specified NSD server node
using server serverName. Exceeds can be established. Use mmdiag --network to display
retry limit number. all currently established RDMA connections.

Chapter 42. References 819


6027-1830 [E] GPU Direct Storage READ not 6027-1903 path must be an absolute path
supported on NSD server. name.
Explanation: Explanation:
GDS read commands are not supported on a GPFS The path name did not begin with a /.
NSD server.
User response:
User response: Specify the absolute path name for the object.
None.
6027-1904 Device with major/minor numbers
6027-1831 [W] Cannot become an NSD server number and number already
for NSD NsdName because the exists.
dynamic pagepool is enabled.
Explanation:
Explanation: A device with the cited major and minor numbers
Dynamic pagepool is not supported on NSD servers. already exists.
User response: User response:
Restart the mmfsd daemon. When initialization finds Check the preceding messages for detailed
that it is an NSD server, dynamic page pool will not be information.
enabled.
6027-1905 name was not created by GPFS or
6027-1832 Unable to perform an online could not be refreshed.
NSD server change for NsdName
Explanation:
because the dynamic pagepool is
The attributes (device type, major/minor number) of
enabled on the node nodeName.
the specified file system device name are not as
Explanation: expected.
The dynamic pagepool is not supported on NSD
User response:
servers.
Check the preceding messages for detailed
User response: information on the current and expected values.
Restart the mmfsd daemon. When initialization finds These errors are most frequently caused by the
that it is an NSD server, the dynamic pagepool will be presence of /dev entries that were created outside
enabled. the GPFS environment. Resolve the conflict by
renaming or deleting the offending entries. Reissue the
6027-1900 Failed to stat pathName.
command letting GPFS create the /dev entry with the
Explanation: appropriate parameters.
A stat() call failed for the specified object.
6027-1906 There is no file system with drive
User response: letter driveLetter.
Correct the problem and reissue the command.
Explanation:
6027-1901 pathName is not a GPFS file No file system in the GPFS cluster has the specified
system object. drive letter.
Explanation: User response:
The specified path name does not resolve to an object Reissue the command with a valid file system.
within a mounted GPFS file system.
6027-1908 The option option is not allowed
User response: for remote file systems.
Correct the problem and reissue the command.
Explanation:
6027-1902 The policy file cannot be The specified option can be used only for locally-
determined. owned file systems.
Explanation: User response:
The command was not able to retrieve the policy rules Correct the command line and reissue the command.
associated with the file system.
6027-1909 There are no available free disks.
User response: Disks must be prepared prior
Examine the preceding messages and correct the to invoking command. Define
reported problems. Establish a valid policy file with the the disks using the command
mmchpolicy command or specify a valid policy file on command.
the command line.

820 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: 6027-1931 The following disks are not known
The currently executing command (mmcrfs, to GPFS: diskNames.
mmadddisk, mmrpldisk) requires disks to be defined
Explanation:
for use by GPFS using one of the GPFS disk creation
A GPFS administration command (mm...) found that
commands: mmcrnsd, mmcrvsd.
the specified disks are not known to GPFS.
User response:
User response:
Create disks and reissue the failing command.
Verify that the correct disks were requested.
6027-1910 Node nodeName is not a quorum
6027-1932 No disks were specified that could
node.
be deleted.
Explanation:
Explanation:
The mmchmgr command was asked to move the
A GPFS administration command (mm...) determined
cluster manager to a nonquorum node. Only one of the
that no disks were specified that could be deleted.
quorum nodes can be a cluster manager.
User response:
User response:
Examine the preceding messages, correct the
Designate the node to be a quorum node, specify a
problems, and reissue the command.
different node on the command line, or allow GPFS to
choose the new cluster manager node. 6027-1933 Disk diskName has been
removed from the GPFS cluster
6027-1911 File system fileSystem belongs to
configuration data but the NSD
cluster clusterName. The option
volume id was not erased from
option is not allowed for remote
the disk. To remove the NSD
file systems.
volume id, issue mmdelnsd -p
Explanation: NSDvolumeid.
The specified option can be used only for locally-
Explanation:
owned file systems.
A GPFS administration command (mm...) successfully
User response: removed the specified disk from the GPFS cluster
Correct the command line and reissue the command. configuration data, but was unable to erase the NSD
volume id from the disk.
6027-1927 The requested disks are not known
to GPFS. User response:
Issue the specified command to remove the NSD
Explanation:
volume id from the disk.
GPFS could not find the requested NSDs in the cluster.
6027-1934 Disk diskName has been
User response:
Reissue the command, specifying known disks. removed from the GPFS cluster
configuration data but the NSD
6027-1929 cipherlist is not a valid cipher list. volume id was not erased from
Explanation: the disk. To remove the NSD
The cipher list must be set to a value supported by volume id, issue: mmdelnsd -p
GPFS. All nodes in the cluster must support a common NSDvolumeid -N nodeList.
cipher. Explanation:
User response: A GPFS administration command (mm...) successfully
Use mmauth show ciphers to display a list of the removed the specified disk from the GPFS cluster
supported ciphers. configuration data but was unable to erase the NSD
volume id from the disk.
6027-1930 Disk diskName belongs to file
User response:
system fileSystem.
Issue the specified command to remove the NSD
Explanation: volume id from the disk.
A GPFS administration command (mm...) found that
6027-1936 Node nodeName cannot support
the requested disk to be deleted still belongs to a file
system. Persistent Reserve on disk
diskName because it is not an AIX
User response: node. The disk will be used as a
Check that the correct disk was requested. If so, non-PR disk.
delete the disk from the file system before proceeding.
Explanation:

Chapter 42. References 821


A non-AIX node was specified as an NSD server for the Explanation:
disk. The disk will be used as a non-PR disk. A GPFS administration command (mm...) received
unexpected output from the host -t a command for
User response:
the given host.
None. Informational message only.
User response:
6027-1937 A node was specified more than
Issue the host -t a command interactively and
once as an NSD server in disk
carefully review the output, as well as any error
descriptor descriptor.
messages.
Explanation:
6027-1943 Host name not found.
A node was specified more than once as an NSD server
in the disk descriptor shown. Explanation:
A GPFS administration command (mm...) could not
User response:
resolve a host from /etc/hosts or by using the host
Change the disk descriptor to eliminate any
command.
redundancies in the list of NSD servers.
User response:
6027-1938 configParameter is an incorrect
Make corrections to /etc/hosts and reissue the
parameter. Line in error:
command.
configLine. The line is ignored;
processing continues. 6027-1945 Disk name diskName is not
allowed. Names beginning with
Explanation:
gpfs are reserved for use by GPFS.
The specified parameter is not valid and will be
ignored. Explanation:
The cited disk name is not allowed because it begins
User response:
with gpfs.
None. Informational message only.
User response:
6027-1939 Line in error: line.
Specify a disk name that does not begin with gpfs and
Explanation: reissue the command.
The specified line from a user-provided input file
6027-1947 Use mmauth genkey to recover the
contains errors.
file fileName, or to generate and
User response: commit a new key.
Check the preceding messages for more information.
Explanation:
Correct the problems and reissue the command.
The specified file was not found.
6027-1940 Unable to set reserve policy
User response:
policy on disk diskName on node
Recover the file, or generate a new key by running:
nodeName.
mmauth genkey propagate or generate a new key
Explanation: by running mmauth genkey new, followed by the
The specified disk should be able to support Persistent mmauth genkey commit command.
Reserve, but an attempt to set up the registration key
6027-1948 Disk diskName is too large.
failed.
Explanation:
User response:
The specified disk is too large.
Correct the problem and reissue the command.
User response:
6027-1941 Cannot handle multiple interfaces
Specify a smaller disk and reissue the command.
for host hostName.
6027-1949 Propagating the cluster
Explanation:
configuration data to all affected
Multiple entries were found for the given hostname
nodes.
or IP address either in /etc/hosts or by the host
command. Explanation:
The cluster configuration data is being sent to the rest
User response:
of the nodes in the cluster.
Make corrections to /etc/hosts and reissue the
command. User response:
This is an informational message.
6027-1942 Unexpected output from the 'host
-t a name' command: 6027-1950 Local update lock is busy.

822 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: The specified disk cannot be initialized for use as a
More than one process is attempting to update the tiebreaker disk. Possible reasons are suggested in the
GPFS environment at the same time. message text.
User response: User response:
Repeat the command. If the problem persists, verify Use the mmlsfs and mmlsdisk commands to
that there are no blocked processes. determine what action is needed to correct the
problem.
6027-1951 Failed to obtain the local
environment update lock. 6027-1968 Failed while processing disk
diskName.
Explanation:
GPFS was unable to obtain the local environment Explanation:
update lock for more than 30 seconds. An error was detected while processing the specified
disk.
User response:
Examine previous error messages, if any. Correct any User response:
problems and reissue the command. If the problem Examine prior messages to determine the reason
persists, perform problem determination and contact for the failure. Correct the problem and reissue the
the IBM Support Center. command.
6027-1962 Permission denied for disk 6027-1969 Device device already exists on
diskName node nodeName
Explanation: Explanation:
The user does not have permission to access disk This device already exists on the specified node.
diskName.
User response:
User response: None.
Correct the permissions and reissue the command.
6027-1970 Disk diskName has no space
6027-1963 Disk diskName was not found. for the quorum data structures.
Specify a different disk as
Explanation:
tiebreaker disk.
The specified disk was not found.
Explanation:
User response:
There is not enough free space in the file system
Specify an existing disk and reissue the command.
descriptor for the tiebreaker disk data structures.
6027-1964 I/O error on diskName
User response:
Explanation: Specify a different disk as a tiebreaker disk.
An I/O error occurred on the specified disk.
6027-1974 None of the quorum nodes can be
User response: reached.
Check for additional error messages. Check the error
Explanation:
log for disk hardware problems.
Ensure that the quorum nodes in the cluster can be
6027-1965 [E] Disk diskName is too small. reached. At least one of these nodes is required for the
command to succeed.
Explanation:
The specified disk is too small. User response:
Ensure that the quorum nodes are available and
User response:
reissue the command.
Specify a disk larger than 128 MiB and reissue the
command. 6027-1975 The descriptor file contains more
than one descriptor.
6027-1967 Disk diskName belongs to back-
level file system fileSystem or the Explanation:
state of the disk is not ready. The descriptor file must contain only one descriptor.
Use mmchfs -V to convert the file
User response:
system to the latest format. Use
Correct the descriptor file.
mmchdisk to change the state of a
disk. 6027-1976 The descriptor file contains no
descriptor.
Explanation:
Explanation:

Chapter 42. References 823


The descriptor file must contain only one descriptor. User response:
Ensure that the file system is mounted and reissue the
User response:
command.
Correct the descriptor file.
6027-1993 File fileName either does not exist
6027-1977 Failed validating disk diskName.
or has an incorrect format.
Error code errorCode.
Explanation:
Explanation:
The specified file does not exist or has an incorrect
GPFS control structures are not as expected.
format.
User response:
User response:
Contact the IBM Support Center.
Check whether the input file specified actually exists.
6027-1984 Name name is not allowed. It
6027-1994 Did not find any match with the
is longer than the maximum
input disk address.
allowable length (length).
Explanation:
Explanation:
The mmfileid command returned without finding any
The cited name is not allowed because it is longer than
disk addresses that match the given input.
the cited maximum allowable length.
User response:
User response:
None. Informational message only.
Specify a name whose length does not exceed the
maximum allowable length, and reissue the command. 6027-1995 Device deviceName is not mounted
on node nodeName.
6027-1985 mmfskxload: The format of the
GPFS kernel extension is not Explanation:
correct for this version of AIX. The specified device is not mounted on the specified
node.
Explanation:
This version of AIX is incompatible with the current User response:
format of the GPFS kernel extension. Mount the specified device on the specified node and
reissue the command.
User response:
Contact your system administrator to check the AIX 6027-1996 Command was unable to
version and GPFS kernel extension. determine whether file system
fileSystem is mounted.
6027-1986 junctionName does not resolve
to a directory in deviceName. Explanation:
The junction must be within the The command was unable to determine whether the
specified file system. cited file system is mounted.
Explanation: User response:
The cited junction path name does not belong to the Examine any prior error messages to determine why
specified file system. the command could not determine whether the file
system was mounted, resolve the problem if possible,
User response:
and then reissue the command. If you cannot resolve
Correct the junction path name and reissue the
the problem, reissue the command with the daemon
command.
down on all nodes of the cluster. This will ensure that
6027-1987 Name name is not allowed. the file system is not mounted, which may allow the
command to proceed.
Explanation:
The cited name is not allowed because it is a reserved 6027-1998 Line lineNumber of file fileName is
word or a prohibited character. incorrect:
User response: Explanation:
Specify a different name and reissue the command. A line in the specified file passed to the command had
incorrect syntax. The line with the incorrect syntax is
6027-1988 File system fileSystem is not
displayed next, followed by a description of the correct
mounted.
syntax for the line.
Explanation:
User response:
The cited file system is not currently mounted on this
Correct the syntax of the line and reissue the
node.
command.

824 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-1999 Syntax error. The correct syntax is: 6027-2009 logicalVolume is not a valid logical
string. volume.
Explanation: Explanation:
The specified input passed to the command has logicalVolume does not exist in the ODM, implying that
incorrect syntax. logical name does not exist.
User response: User response:
Correct the syntax and reissue the command. Run the command on a valid logical volume.
6027-2000 Could not clear fencing for disk 6027-2010 vgName is not a valid volume
physicalDiskName. group name.
Explanation: Explanation:
The fencing information on the disk could not be vgName passed to the command is not found in the
cleared. ODM, implying that vgName does not exist.
User response: User response:
Make sure the disk is accessible by this node and retry. Run the command on a valid volume group name.
6027-2002 Disk physicalDiskName of type 6027-2011 For the hdisk specification -h
diskType is not supported for physicalDiskName to be valid
fencing. physicalDiskName must be the
only disk in the volume group.
Explanation:
However, volume group vgName
This disk is not a type that supports fencing.
contains disks.
User response:
Explanation:
None.
The hdisk specified belongs to a volume group that
6027-2004 None of the specified nodes belong contains other disks.
to this GPFS cluster.
User response:
Explanation: Pass an hdisk that belongs to a volume group that
The nodes specified do not belong to the GPFS cluster. contains only this disk.
User response: 6027-2012 physicalDiskName is not a valid
Choose nodes that belong to the cluster and try the physical volume name.
command again.
Explanation:
6027-2007 Unable to display fencing for disk The specified name is not a valid physical disk name.
physicalDiskName.
User response:
Explanation: Choose a correct physical disk name and retry the
Cannot retrieve fencing information for this disk. command.
User response: 6027-2013 pvid is not a valid physical volume
Make sure that this node has access to the disk before id.
retrying.
Explanation:
6027-2008 For the logical volume The specified value is not a valid physical volume ID.
specification -l lvName to be valid
User response:
lvName must be the only logical
Choose a correct physical volume ID and retry the
volume in the volume group.
command.
However, volume group vgName
contains logical volumes. 6027-2014 Node node does not have access to
disk physicalDiskName.
Explanation:
The command is being run on a logical volume that Explanation:
belongs to a volume group that has more than one The specified node is not able to access the specified
logical volume. disk.
User response: User response:
Run this command only on a logical volume where it Choose a different node or disk (or both), and retry the
is the only logical volume in the corresponding volume command. If both the node and disk name are correct,
group. make sure that the node has access to the disk.

Chapter 42. References 825


6027-2015 Node node does not hold Make sure the disk is accessible by this node before
a reservation for disk retrying.
physicalDiskName. 6027-2022 Could not open disk
Explanation: physicalDiskName, errno value.
The node on which this command is run does not have Explanation:
access to the disk.
The specified disk cannot be opened.
User response: User response:
Run this command from another node that has access Examine the errno value and other messages to
to the disk. determine the reason for the failure. Correct the
6027-2016 SSA fencing support is not present problem and reissue the command.
on this node. 6027-2023 retVal = value, errno = value for
Explanation: key value.
This node does not support SSA fencing. Explanation:
User response: An ioctl call failed with stated return code, errno
None. value, and related values.
6027-2017 Node ID nodeId is not a valid SSA User response:
node ID. SSA node IDs must be a Check the reported errno and correct the problem if
number in the range of 1 to 128. possible. Otherwise, contact the IBM Support Center.
Explanation: 6027-2024 ioctl failed with rc=returnCode,
You specified a node ID outside of the acceptable errno=errnoValue. Related values
range. are scsi_status=scsiStatusValue,
sense_key=senseKeyValue,
User response:
scsi_asc=scsiAscValue,
Choose a correct node ID and retry the command.
scsi_ascq=scsiAscqValue.
6027-2018 The SSA node id is not set. Explanation:
Explanation: An ioctl call failed with stated return code, errno
The SSA node ID has not been set. value, and related values.
User response: User response:
Set the SSA node ID. Check the reported errno and correct the problem if
possible. Otherwise, contact the IBM Support Center.
6027-2019 Unable to retrieve the SSA node id.
6027-2025 READ_KEYS ioctl failed
Explanation:
with errno=returnCode, tried
A failure occurred while trying to retrieve the SSA node
timesTried times. Related values
ID.
are scsi_status=scsiStatusValue,
User response: sense_key=senseKeyValue,
None. scsi_asc=scsiAscValue,
scsi_ascq=scsiAscqValue.
6027-2020 Unable to set fencing for disk
physicalDiskName. Explanation:
A READ_KEYS ioctl call failed with stated errno
Explanation: value, and related values.
A failure occurred while trying to set fencing for the
specified disk. User response:
Check the reported errno and correct the problem if
User response: possible. Otherwise, contact the IBM Support Center.
None.
6027-2026 READRES ioctl failed with
6027-2021 Unable to clear PR reservations for errno=returnCode, tried timesTried
disk physicalDiskNam. times. Related values
Explanation: are: scsi_status=scsiStatusValue,
Failed to clear Persistent Reserve information on the sense_key=senseKeyValue,
disk. scsi_asc=scsiAscValue,
scsi_ascq=scsiAscqValue.
User response:
Explanation:

826 IBM Storage Scale 5.1.9: Problem Determination Guide


A REGISTER ioctl call failed with stated errno A GPFS page pool cannot be pinned into memory on
value, and related values. this machine.
User response: User response:
Check the reported errno and correct the problem if Increase the physical memory size of the machine.
possible. Otherwise, contact the IBM Support Center.
6027-2050 [W] Pagepool has size actualValueK
6027-2027 READRES ioctl failed with bytes instead of the requested
errno=returnCode, tried timesTried requestedValueK bytes.
times. Related values
Explanation:
are: scsi_status=scsiStatusValue,
The configured GPFS page pool is either A) too large to
sense_key=senseKeyValue,
be allocated or pinned into memory on this machine,
scsi_asc=scsiAscValue,
or B) rounded down to align on page size boundary.
scsi_ascq=scsiAscqValue.
GPFS will work properly, but with reduced capacity for
Explanation: caching user data.
A READRES ioctl call failed with stated errno
User response:
value, and related values.
To prevent this message from being generated if
User response: pagepool size is too large, reduce pagepool size using
Check the reported errno and correct the problem if mmchconfig command.
possible. Otherwise, contact the IBM Support Center.
6027-2051 [W] This node has LROC devices
6027-2028 could not open disk device with total capacity of
diskDeviceName lrocCapacity GB. Optimal LROC
performance requires setting the
Explanation:
maxBufferDescs config option.
A problem occurred on a disk open.
Based on an assumed 4MB data
User response: block size, the recommended
Ensure the disk is accessible and not fenced out, and value for maxBufferDescs is
then reissue the command. maxBufferDescs on this node.
6027-2029 could not close disk device Explanation:
diskDeviceName When LROC devices are present on the node,
additional buffer descriptors are needed to get desired
Explanation:
performance benefits.
A problem occurred on a disk close.
User response:
User response:
Set maxBufferDescs config parameter as
None.
recommended using mmchconfig command for this
6027-2030 ioctl failed with DSB=value and node.
result=value reason: explanation
6027-2052 [I] [For optimal LROC tuning,
Explanation: based on average cached
An ioctl call failed with stated return code, errno data block sizes of
value, and related values. averageCachedDataBlockSize in
previously observed workloads,
User response:
value of maxBufferDescs for
Check the reported errno and correct the problem, if
maxBufferDescs for this node is
possible. Otherwise, contact the IBM Support Center.
recommended.
6027-2031 ioctl failed with non-zero return
Explanation:
code
Based on average cached data block sizes,
Explanation: recommend maxBufferDescs value for optimal LROC
An ioctl failed with a non-zero return code. tuning.
User response: User response:
Correct the problem, if possible. Otherwise, contact Use this message as a hint to set maxBufferDescs
the IBM Support Center. config parameter as suggested above using
mmchconfig command for this node.
6027-2049 [X] Cannot pin a page pool of size
value bytes. 6027-2053 [W] The PagePoolMinPhysMemPct
parameter
Explanation:

Chapter 42. References 827


(PagePoolMinPhysMemPct) is 6027-2057 [W] Cannot increase pagepool by
set larger than growPages pages for growForWhat
the PagePoolMaxPhysMemPct because then the page pool
parameter size would exceed the maximum
(PagePoolMaxPhysMemPct). page pool size (current
currPagepoolSizePages pages,
Explanation:
maximum maxPagepoolSizePages
The configuration parameter
pages).
PagePoolMinPhysMemory is set to a larger
value than the configuration parameter Explanation:
PagePoolMaxPhysMemory. The current pagepool size is close to the maximum
pagepool size. If the pagepool size were increased,
User response:
then the current pagepool size would become larger
Review the configuration settings
than the maximum page pool size.
of PagePoolMinPhysMemory and
PagePoolMaxPhysMemory and set the User response:
PagePoolMinPhysMemory value less than the Set a larger value for the pagepoolMaxPhysMemPct
PagePoolMaxPhysMemory value. parameter to allow more memory for the pagepool.
If the total physical memory is not enough, add more
6027-2054 [W] The available physical memory
physical memory.
(initMaxAvailPhyMemBytes) is less
than the minimal size of pagepool 6027-2058 [W] Cannot shrink the pagepool
(minPagePoolSize). Trying to for shrinkForWhat because
allocate the minimal size. the pagepool size is
Explanation: already less than the
The available physical memory is too low, and is less minimum pagepool size (current
than the minimal pagepool size. The GPFS daemon will currPagepoolSizePages pages,
try to start and allocate the minimal size of pagepool, minimum minPagepoolSizePages
which may fail due to low physical memory. pages).
Explanation:
User response:
The system is running out of memory and is requesting
If the GPFS daemon cannot start, try to free
to shrink the pagepool. However, the current page pool
some physical memory or stop applications that use
size is already less than the minimum value.
significant amount of memory.
User response:
6027-2055 [I] The available physical memory
Add more physical memory.
(initMaxAvailPhyMemBytes) is less
than the initial size of pagepool 6027-2059 [W] Attempt to shrink pagepool less
(initPagePoolSize). Using the than the minimum pagepool
available physical memory as the size. The shrink request is
initial size. adjusted to shrink the pagepool
Explanation: by shrinkPages pages for
The available physical memory is less than the initial shrinkForWhat. The pagepool
size of pagepool. The initial size of pagepool has been size now reaches to the
adjusted to the current available physical memory. minimum pagepool size (current
currPagepoolSizePages pages,
User response: minimum minPagepoolSizePages
N/A pages).
6027-2056 [X] Cannot map RDMA to a pagepool Explanation:
buffer. The shrink pages must be adjusted to avoid the size
to be less than the minimum pagepool size. After
Explanation:
shrinking, the current page pool size reaches to the
A GPFS page pool buffer cannot be mapped to RDMA
minimum pagepool size. It seems other applications
on this machine.
are requesting more memory, and the system is
User response: running out memory.
Reduce RDMA usage by other applications on this
User response:
machine, or reboot the machine to recover leaked
Add more physical memory.
RDMA mappings, or disable RDMA, or disable the
dynamic pagepool.

828 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2060 [W] Insufficient system memory static pagepool and set the pagepool parameter
for enabling the dynamic accordingly.
pagepool, and a static pagepool 6027-2100 Incorrect range value-value
is used instead. A dynamic specified.
pagepool requires minimum
requiredMemoryForDpp bytes of Explanation:
system memory. The range specified to the command is incorrect. The
first parameter value must be less than or equal to the
Explanation: second parameter value.
Dynamic pagepool is enabled, but the system has less
than the minimum memory that is required to run the User response:
dynamic pagepool. Correct the address range and reissue the command.
User response: 6027-2101 Insufficient free space in
Disable the dynamic pagepool, or add more memory to fileSystem (storage minimum
the system. required).
6027-2061 [W] The upper page pool boundary Explanation:
the PagePoolMaxPhysMemPct There is not enough free space in the specified file
parameter is reduced to system or directory for the command to successfully
adjustedPercentage percent to complete.
avoid running out of memory. User response:
Explanation: Correct the problem and reissue the command.
The specified setting of the
6027-2102 Node nodeName is not
PagePoolMaxPhysMemPct parameter reduces the
mmremotefs to run the command.
system memory less than 4GiB of memory
outside the pagepool. The effective value of the Explanation:
PagePoolMaxPhysMemPct parameter has been The specified node is not available to run a command.
reduced to have 4GiB of memory outside the page Depending on the command, a different node may be
pool. tried.
User response: User response:
Reduce the PagePoolMaxPhysMemPct parameter Determine why the specified node is not available and
value or add more memory to the system. correct the problem.
6027-2062 [I] Overriding and disabling 6027-2103 Directory dirName does not exist
the dynamicPagepoolEnabled Explanation:
parameter because the node is an The specified directory does not exist.
NSD server.
User response:
Explanation: Reissue the command specifying an existing directory.
Dynamic pagepool is not supported on NSD servers.
6027-2104 The GPFS release level could not
User response:
be determined on nodes: nodeList.
If a node is an NSD server, set the
dynamicPagepoolEnabled=no parameter for this Explanation:
node. The command was not able to determine the level of
the installed GPFS code on the specified nodes.
6027-2063 [X] The pagepool size is too small.
Disable the dynamic pagepool or User response:
set the pagepool size to minimum Reissue the command after correcting the problem.
minPagepoolSize bytes. 6027-2105 The following nodes must be
Explanation: upgraded to GPFS release
Cannot switch from a dynamic pagepool to a static productVersion or higher: nodeList
pagepool because the current pagepool size is too Explanation:
small. The command requires that all nodes be at the
User response: specified GPFS release level.
Set the dynamicPagepoolEnabled=no parameter User response:
on this node because the dynamic pagepool needs Correct the problem and reissue the command.
to be disabled. Also, review the desired size of the

Chapter 42. References 829


6027-2106 Ensure the nodes are available The specified command option requires root authority.
and run: command. User response:
Explanation: Log on as root and reissue the command.
The command could not complete normally. 6027-2113 Not able to associate diskName on
User response: node nodeName with any known
Check the preceding messages, correct the problems, GPFS disk.
and issue the specified command until it completes Explanation:
successfully. A command could not find a GPFS disk that matched
6027-2107 Upgrade the lower release level the specified disk and node values passed as input.
nodes and run: command. User response:
Explanation: Correct the disk and node values passed as input and
The command could not complete normally. reissue the command.
User response: 6027-2114 The subsystem subsystem is
Check the preceding messages, correct the problems, already active.
and issue the specified command until it completes Explanation:
successfully. The user attempted to start a subsystem that was
6027-2108 Error found while processing already active.
stanza User response:
Explanation: None. Informational message only.
A stanza was found to be unsatisfactory in some way. 6027-2115 Unable to resolve address range
User response: for disk diskName on node
Check the preceding messages, if any, and correct the nodeName.
condition that caused the stanza to be rejected. Explanation:
6027-2109 Failed while processing disk A command could not perform address range
stanza on node nodeName. resolution for the specified disk and node values
passed as input.
Explanation:
A disk stanza was found to be unsatisfactory in some User response:
way. Correct the disk and node values passed as input and
reissue the command.
User response:
Check the preceding messages, if any, and correct the 6027-2116 [E] The GPFS daemon must be active
condition that caused the stanza to be rejected. on the recovery group server
nodes.
6027-2110 Missing required parameter
parameter Explanation:
The command requires that the GPFS daemon be
Explanation: active on the recovery group server nodes.
The specified parameter is required for this command.
User response:
User response: Ensure GPFS is running on the recovery group server
Specify the missing information and reissue the nodes and reissue the command.
command.
6027-2117 [E] object name already exists.
6027-2111 The following disks were not
deleted: diskList Explanation:
The user attempted to create an object with a name
Explanation: that already exists.
The command could not delete the specified disks.
Check the preceding messages for error information. User response:
Correct the name and reissue the command.
User response:
Correct the problems and reissue the command. 6027-2118 [E] The parameter is invalid or missing
in the pdisk descriptor.
6027-2112 Permission failure. Option option
requires root authority to run. Explanation:
The pdisk descriptor is not valid. The bad descriptor is
Explanation: displayed following this message.

830 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Vdisk-based NSDs cannot be specified as tiebreaker
Correct the input and reissue the command. disks.
6027-2119 [E] Recovery group name not found. User response:
Correct the input and reissue the command.
Explanation:
The specified recovery group was not found. 6027-2126 [I] No recovery groups were found.
User response: Explanation:
Correct the input and reissue the command. A command searched for recovery groups but found
none.
6027-2120 [E] Unable to delete recovery group
name on nodes nodeNames. User response:
None. Informational message only.
Explanation:
The recovery group could not be deleted on the 6027-2127 [E] Disk descriptor descriptor refers to
specified nodes. an existing pdisk.
User response: Explanation:
Perform problem determination. The specified disk descriptor refers to an existing
pdisk.
6027-2121 [I] Recovery group name deleted on
node nodeName. User response:
Specify another disk that is not an existing pdisk.
Explanation:
The recovery group has been deleted. 6027-2128 [E] The attribute attribute must be
configured to use hostname as a
User response:
recovery group server.
This is an informational message.
Explanation:
6027-2122 [E] The number of spares
The specified GPFS configuration attributes must be
(numberOfSpares) must be less
configured to use the node as a recovery group server.
than the number of pdisks
(numberOfpdisks) being created. User response:
Use the mmchconfig command to set the attributes,
Explanation:
then reissue the command.
The number of spares specified must be less than the
number of pdisks that are being created. 6027-2129 [E] Vdisk block size (blockSize) must
match the file system block size
User response:
(blockSize).
Correct the input and reissue the command.
Explanation:
6027-2123 [E] The GPFS daemon is down on the
The specified NSD is a vdisk with a block size that does
vdiskName servers.
not match the block size of the file system.
Explanation:
User response:
The GPFS daemon was down on the vdisk servers
Reissue the command using block sizes that match.
when mmdelvdisk was issued.
6027-2130 [E] Could not find an active server for
User response:
recovery group name.
Start the GPFS daemon on the specified nodes and
issue the specified mmdelvdisk command. Explanation:
A command was issued that acts on a recovery group,
6027-2124 [E] Vdisk vdiskName is still NSD
but no active server was found for the specified
nsdName. Use the mmdelnsd
recovery group.
command.
User response:
Explanation:
Perform problem determination.
The specified vdisk is still an NSD.
6027-2131 [E] Cannot create an NSD on a log
User response:
vdisk.
Use the mmdelnsd command.
Explanation:
6027-2125 [E] nsdName is a vdisk-based NSD
The specified disk is a log vdisk; it cannot be used for
and cannot be used as a tiebreaker
an NSD.
disk.
User response:
Explanation:

Chapter 42. References 831


Specify another disk that is not a log vdisk. Explanation:
You are importing a disk into a cluster in which
6027-2132 [E] Log vdisk vdiskName cannot be
Persistent Reserve is disabled. An attempt to clear the
deleted while there are other
Persistent Reserve reservations on the disk failed.
vdisks in recovery group name.
User response:
Explanation:
Correct the configuration and enter the command
The specified disk is a log vdisk; it must be the last
again.
vdisk deleted from the recovery group.
6027-2138 The cluster must be running either
User response:
all AIX or all Linux nodes to
Delete the other vdisks first.
change Persistent Reserve disk
6027-2133 [E] Unable to delete recovery group diskName to a SAN-attached disk.
name; vdisks are still defined.
Explanation:
Explanation: There was an attempt to redefine a Persistent Reserve
Cannot delete a recovery group while there are still disk as a SAN attached disk, but not all nodes in the
vdisks defined. cluster were running either all AIX or all Linux nodes.
User response: User response:
Delete all the vdisks first. Correct the configuration and enter the command
again.
6027-2134 Node nodeName cannot be used
as an NSD server for Persistent 6027-2139 NSD server nodes must be running
Reserve disk diskName because it either all AIX or all Linux to
is not a Linux node. enable Persistent Reserve for disk
diskName.
Explanation:
There was an attempt to enable Persistent Reserve for Explanation:
a disk, but not all of the NSD server nodes are running There was an attempt to enable Persistent Reserve for
Linux. a disk, but not all NSD server nodes were running all
AIX or all Linux nodes.
User response:
Correct the configuration and enter the command User response:
again. Correct the configuration and enter the command
again.
6027-2135 All nodes in the cluster must be
running AIX to enable Persistent 6027-2140 All NSD server nodes must be
Reserve for SAN attached disk running AIX or all running Linux to
diskName. enable Persistent Reserve for disk
diskName.
Explanation:
There was an attempt to enable Persistent Reserve for Explanation:
a SAN-attached disk, but not all nodes in the cluster Attempt to enable Persistent Reserve for a disk while
are running AIX. not all NSD server nodes are running AIX or all running
Linux.
User response:
Correct the configuration and run the command again. User response:
Correct the configuration first.
6027-2136 All NSD server nodes must be
running AIX to enable Persistent 6027-2141 Disk diskName is not configured as
Reserve for disk diskName. a regular hdisk.
Explanation: Explanation:
There was an attempt to enable Persistent Reserve for In an AIX only cluster, Persistent Reserve is supported
the specified disk, but not all NSD servers are running for regular hdisks only.
AIX.
User response:
User response: Correct the configuration and enter the command
Correct the configuration and enter the command again.
again.
6027-2142 Disk diskName is not configured as
6027-2137 An attempt to clear the Persistent a regular generic disk.
Reserve reservations on disk
Explanation:
diskName failed.

832 IBM Storage Scale 5.1.9: Problem Determination Guide


In a Linux only cluster, Persistent Reserve is supported 6027-2149 [E] Could not get recovery group
for regular generic or device mapper virtual disks only. information from an active server.
User response: Explanation:
Correct the configuration and enter the command A command that needed recovery group information
again. failed; the GPFS daemons may have become inactive
or the recovery group is temporarily unavailable.
6027-2143 Mount point mountPoint cannot
be part of automount directory User response:
automountDir. Reissue the command.
Explanation: 6027-2150 The archive system client
The mount point cannot be the parent directory of the backupProgram could not be found
automount directory. or is not executable.
User response: Explanation:
Specify a mount point that is not the parent of the TSM dsmc or other specified backup or archive system
automount directory. client could not be found.
6027-2144 [E] The lockName lock for file system User response:
fileSystem is busy. Verify that TSM is installed, dsmc can be found in
the installation location or that the archiver client
Explanation:
specified is executable.
More than one process is attempting to obtain the
specified lock. 6027-2151 The path directoryPath is not
User response: contained in the snapshot
Repeat the command. If the problem persists, verify snapshotName.
that there are no blocked processes. Explanation:
The directory path supplied is not contained in the
6027-2145 [E] Internal remote command
snapshot named with the -S parameter.
'mmremote command' no longer
supported. User response:
Correct the directory path or snapshot name supplied,
Explanation:
or omit -S and the snapshot name in the command.
A GPFS administration command invoked an internal
remote command which is no longer supported. 6027-2152 The path directoryPath containing
Backward compatibility for remote commands is only image archives was not found.
supported for release 3.4 and newer.
Explanation:
User response: The directory path supplied does not contain the
All nodes within the cluster must be at release 3.4 or expected image files to archive into TSM.
newer. If all the cluster nodes meet this requirement,
User response:
contact the IBM Support Center.
Correct the directory path name supplied.
6027-2147 [E] BlockSize must be specified in
6027-2153 The archiving system
disk descriptor.
backupProgram exited with status
Explanation: return code. Image backup
The blockSize positional parameter in a vdisk files have been preserved in
descriptor was empty. The bad disk descriptor is globalWorkDir
displayed following this message.
Explanation:
User response: Archiving system executed and returned a non-zero
Correct the input and reissue the command. exit status due to some error.
6027-2148 [E] nodeName is not a valid User response:
recovery group server for Examine archiver log files to discern the cause of the
recoveryGroupName. archiver's failure. Archive the preserved image files
from the indicated path.
Explanation:
The server name specified is not one of the defined 6027-2154 Unable to create a policy file for
recovery group servers. image backup in policyFilePath.
User response: Explanation:
Correct the input and reissue the command.

Chapter 42. References 833


A temporary file could not be created in the global The command cannot create the specified NSD
shared directory path. because the underlying vdisk is already mapped to a
different NSD.
User response:
Check or correct the directory path name supplied. User response:
Correct the input and reissue the command.
6027-2155 File system fileSystem must be
mounted read only for restore. 6027-2161 [E] NAS servers cannot be specified
when creating an NSD on a vdisk.
Explanation:
The empty file system targeted for restoration must be Explanation:
mounted in read only mode during restoration. The command cannot create the specified NSD
because servers were specified and the underlying
User response:
disk is a vdisk.
Unmount the file system on all nodes and remount it
read only, then try the command again. User response:
Correct the input and reissue the command.
6027-2156 The image archive index
ImagePath could not be found. 6027-2162 [E] Cannot set nsdRAIDTracks to zero;
nodeName is a recovery group
Explanation:
server.
The archive image index could be found in the
specified path Explanation:
nsdRAIDTracks cannot be set to zero while the node
User response:
is still a recovery group server.
Check command arguments for correct specification of
image path, then try the command again. User response:
Modify or delete the recovery group and reissue the
6027-2157 The image archive index
command.
ImagePath is corrupt or
incomplete. 6027-2163 [E] Vdisk name not found in the
daemon. Recovery may be
Explanation:
occurring. The disk will not be
The archive image index specified is damaged.
deleted.
User response:
Explanation:
Check the archive image index file for corruption and
GPFS cannot find the specified vdisk. This can happen
remedy.
if recovery is taking place and the recovery group is
6027-2158 Disk usage must be dataOnly, temporarily inactive.
metadataOnly, descOnly,
User response:
dataAndMetadata, vdiskLog,
Reissue the command. If the recovery group is
vdiskLogTip, vdiskLogTipBackup,
damaged, specify the -p option.
or vdiskLogReserved.
6027-2164 [E] Disk descriptor for name refers to
Explanation:
an existing pdisk.
The disk usage positional parameter in a vdisk
descriptor has a value that is not valid. The bad disk Explanation:
descriptor is displayed following this message. The specified pdisk already exists.
User response: User response:
Correct the input and reissue the command. Correct the command invocation and try again.
6027-2159 [E] parameter is not valid or missing in 6027-2165 [E] Node nodeName cannot be used as
the vdisk descriptor. a server of both vdisks and non-
vdisk NSDs.
Explanation:
The vdisk descriptor is not valid. The bad descriptor is Explanation:
displayed following this message. The command specified an action that would have
caused vdisks and non-vdisk NSDs to be defined on
User response:
the same server. This is not a supported configuration.
Correct the input and reissue the command.
User response:
6027-2160 [E] Vdisk vdiskName is already
Correct the command invocation and try again.
mapped to NSD nsdName.
Explanation:

834 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2166 [E] IBM Storage Scale RAID is not If possible, remove the object with the conflicting
configured. name.
Explanation: 6027-2172 [E] Errors encountered while
IBM Storage Scale RAID is not configured on this node. importing IBM Storage Scale RAID
objects.
User response:
Reissue the command on the appropriate node. Explanation:
Errors were encountered while trying to import an IBM
6027-2167 [E] Device deviceName does not exist Storage Scale RAID based file system. No file systems
or is not active on this node. will be imported.
Explanation: User response:
The specified device does not exist or is not active on Check the previous error messages and if possible,
the node. correct the problems.
User response: 6027-2173 [I] Use mmchrecoverygroup to
Reissue the command on the appropriate node. assign and activate servers
6027-2168 [E] The GPFS cluster must be for the following recovery
shut down before downloading groups (automatically assigns NSD
firmware to port cards. servers as well): recoveryGroupList
Explanation: Explanation:
The GPFS daemon must be down on all nodes in the The mmimportfs command imported the specified
cluster before attempting to download firmware to a recovery groups. These must have servers assigned
port card. and activated.
User response: User response:
Stop GPFS on all nodes and reissue the command. After the mmimportfs command finishes, use the
mmchrecoverygroup command to assign NSD server
6027-2169 Unable to disable Persistent nodes as needed.
Reserve on the following disks:
diskList 6027-2174 Option option can be specified only
in conjunction with option.
Explanation:
The command was unable to disable Persistent Explanation:
Reserve on the specified disks. The cited option cannot be specified by itself.
User response: User response:
Examine the disks and additional error information Correct the input and reissue the command.
to determine if the disks should support Persistent 6027-2175 [E] Exported path exportPath does not
Reserve. Correct the problem and reissue the
exist
command.
Explanation:
6027-2170 [E] Recovery group The directory or one of the components in the
recoveryGroupName does not exist directory path to be exported does not exist.
or is not active.
User response:
Explanation: Correct the input and reissue the command.
A command was issued to a recovery group that does
not exist or is not in the active state. 6027-2176 [E] mmchattr for fileName failed.
User response: Explanation:
Reissue the command with a valid recovery group The command to change the attributes of the file
name or wait for the recovery group to become active. failed.
6027-2171 [E] objectType objectName already User response:
exists in the cluster. Check the previous error messages and correct the
problems.
Explanation:
The file system being imported contains an object with 6027-2177 [E] Cannot create file fileName.
a name that conflicts with the name of an existing Explanation:
object in the cluster. The command to create the specified file failed.
User response: User response:

Chapter 42. References 835


Check the previous error messages and correct the User response:
problems. Correct the input file and reissue the command.
6027-2178 File fileName does not contain any 6027-2186 [E] There were no valid vdisk stanzas
NSD descriptors or stanzas. in the input file fileName.
Explanation: Explanation:
The input file should contain at least one NSD The mmcrvdisk input stanza file has no valid vdisk
descriptor or stanza. stanzas.
User response: User response:
Correct the input file and reissue the command. Correct the input file and reissue the command.
6027-2181 [E] Failover is allowed only for 6027-2187 [E] Could not get pdisk information
single-writer, independent-writer for the following recovery groups:
filesets. recoveryGroupList
Explanation: Explanation:
The fileset AFM mode is not compatible with the An mmlspdisk all command could not query all of
requested operation. the recovery groups because some nodes could not be
reached.
User response:
Check the previous error messages and correct the User response:
problems. None.
6027-2182 [E] Resync is allowed only for single- 6027-2188 Unable to determine the local
writer filesets. node identity.
Explanation: Explanation:
The fileset AFM mode is not compatible with the The command is not able to determine the identity of
requested operation. the local node. This can be the result of a disruption
in the network over which the GPFS daemons
User response:
communicate.
Check the previous error messages and correct the
problems. User response:
Ensure the GPFS daemon network (as identified in
6027-2183 [E] Peer snapshots using mmpsnap
the output of the mmlscluster command on a good
are allowed only for single-writer
node) is fully operational and reissue the command.
or primary filesets.
6027-2189 [E] Action action is allowed only for
Explanation:
read-only filesets.
The fileset AFM mode is not compatible with the
requested operation. Explanation:
The specified action is only allowed for read-only
User response:
filesets.
Check the previous error messages and correct the
problems. User response:
None.
6027-2184 [E] If the recovery group is damaged,
issue mmdelrecoverygroup name 6027-2190 [E] Cannot prefetch file fileName. The
-p. file does not belong to fileset
fileset.
Explanation:
No active servers were found for the recovery group Explanation:
that is being deleted. If the recovery group is damaged The requested file does not belong to the fileset.
the -p option is needed.
User response:
User response: None.
Perform diagnosis and reissue the command.
6027-2191 [E] Vdisk vdiskName not
6027-2185 [E] There are no pdisk stanzas in the found in recovery group
input file fileName. recoveryGroupName.
Explanation: Explanation:
The mmcrrecoverygroup input stanza file has no The mmdelvdisk command was invoked with the --
pdisk stanzas. recovery-group option to delete one or more vdisks

836 IBM Storage Scale 5.1.9: Problem Determination Guide


from a specific recovery group. The specified vdisk 6027-2198 [E] Cannot run the mmafmctl
does not exist in this recovery group. flushPending command on
directory dirName.
User response:
Correct the input and reissue the command. Explanation:
The mmafmctl flushPending command cannot be
6027-2193 [E] Recovery group
issued on this directory.
recoveryGroupName must be
active on the primary server User response:
serverName. Correct the input and reissue the command.
Explanation: 6027-2199 [E] No enclosures were found.
The recovery group must be active on the specified
Explanation:
node.
A command searched for disk enclosures but none
User response: were found.
Use the mmchrecoverygroup command to activate
User response:
the group and reissue the command.
None.
6027-2194 [E] The state of fileset filesetName
6027-2200 [E] Cannot have multiple nodes
is Expired; prefetch cannot be
updating firmware for the same
performed.
enclosure. Enclosure serialNumber
Explanation: is already being updated by node
The prefetch operation cannot be performed on nodeName.
filesets that are in the Expired state.
Explanation:
User response: The mmchenclosure command was called with
None. multiple nodes updating the same firmware.
6027-2195 [E] Error getting snapshot ID for User response:
snapshotName. Correct the node list and reissue the command.
Explanation: 6027-2201 [E] The mmafmctl flushPending
The command was unable to obtain the resync command completed with errors.
snapshot ID.
Explanation:
User response: An error occurred while flushing the queue.
Examine the preceding messages, correct the
User response:
problem, and reissue the command. If the problem
Examine the GPFS log to identify the cause.
persists, perform problem determination and contact
the IBM Support Center. 6027-2202 [E] There is a SCSI-3 PR reservation
6027-2196 [E] on disk diskname. mmcrnsd
Resync is allowed only when the
cannot format the disk because
fileset queue is in active state.
the cluster is not configured as PR
Explanation: enabled.
This operation is allowed only when the fileset queue
Explanation:
is in active state.
The specified disk has a SCSI-3 PR reservation, which
User response: prevents the mmcrnsd command from formatting it.
None.
User response:
6027-2197 [E] Empty file encountered Clear the PR reservation by following the instructions
when running the mmafmctl in “Clearing a leftover Persistent Reserve reservation”
flushPending command. on page 426.
Explanation: 6027-2203 Node nodeName is not a gateway
The mmafmctl flushPending command did not find node.
any entries in the file specified with the --list-file
Explanation:
option.
The specified node is not a gateway node.
User response:
User response:
Correct the input file and reissue the command.
Designate the node as a gateway node or specify a
different node on the command line.

Chapter 42. References 837


6027-2204 AFM target map mapName is 6027-2210 [E] Unable to build a storage
already defined. enclosure inventory file on node
nodeName.
Explanation:
A request was made to create an AFM target map with Explanation:
the cited name, but that map name is already defined. A command was unable to build a storage enclosure
inventory file. This is a temporary file that is required
User response:
to complete the requested command.
Specify a different name for the new AFM target map
or first delete the current map definition and then User response:
recreate it. None.
6027-2205 There are no AFM target map 6027-2211 [E] Error collecting firmware
definitions. information on node nodeName.
Explanation: Explanation:
A command searched for AFM target map definitions A command was unable to gather firmware
but found none. information from the specified node.
User response: User response:
None. Informational message only. Ensure the node is active and retry the command.
6027-2206 AFM target map mapName is not 6027-2212 [E] Firmware update file updateFile
defined. was not found.
Explanation: Explanation:
The cited AFM target map name is not known to GPFS. The mmchfirmware command could not find the
specified firmware update file to load.
User response:
Specify an AFM target map known to GPFS. User response:
Locate the firmware update file and retry the
6027-2207 Node nodeName is being used as a
command.
gateway node for the AFM cluster
clusterName. 6027-2213 [E] Pdisk path redundancy was
lost while updating enclosure
Explanation:
firmware.
The specified node is defined as a gateway node for
the specified AFM cluster. Explanation:
The mmchfirmware command lost paths after loading
User response:
firmware and rebooting the Enclosure Services
If you are trying to delete the node from the GPFS
Module.
cluster or delete the gateway node role, you must
remove it from the export server map. User response:
Wait a few minutes and then retry the command. GPFS
6027-2208 [E] commandName is already running
might need to be shut down to finish updating the
in the cluster.
enclosure firmware.
Explanation:
6027-2214 [E] Timeout waiting for firmware to
Only one instance of the specified command is allowed
load.
to run.
Explanation:
User response:
A storage enclosure firmware update was in progress,
None.
but the update did not complete within the expected
6027-2209 [E] Unable to list objectName on node time frame.
nodeName.
User response:
Explanation: Wait a few minutes, and then use the mmlsfirmware
A command was unable to list the specific object that command to ensure the operation completed.
was requested.
6027-2215 [E] Storage enclosure serialNumber
User response: not found.
None.
Explanation:
The specified storage enclosure was not found.
User response:

838 IBM Storage Scale 5.1.9: Problem Determination Guide


None. Explanation:
The mmchrecoverygroup command failed. Check the
6027-2216 Quota management is disabled for
following error messages.
file system fileSystem.
User response:
Explanation:
Correct the problem and retry the command.
Quota management is disabled for the specified file
system. 6027-2222 [E] Storage enclosure serialNumber
already has a newer firmware
User response:
version: firmwareLevel.
Enable quota management for the file system.
Explanation:
6027-2217 [E] Error errno updating firmware for
The mmchfirmware command found a newer level of
drives driveList.
firmware on the specified storage enclosure.
Explanation:
User response:
The firmware load failed for the specified drives. Some
If the intent is to force on the older firmware version,
of the drives may have been updated.
use the -v no option.
User response:
6027-2223 [E] Storage enclosure serialNumber is
None.
not redundant. Shutdown GPFS
6027-2218 [E] Storage enclosure serialNumber in the cluster and retry the
component componentType mmchfirmware command.
component ID componentId not
Explanation:
found.
The mmchfirmware command found a non-redundant
Explanation: storage enclosure. Proceeding could cause loss of data
The mmchenclosure command could not find the access.
component specified for replacement.
User response:
User response: Shutdown GPFS in the cluster and retry the
Use the mmlsenclosure command to determine valid mmchfirmware command.
input and then retry the command.
6027-2224 [E] Peer snapshot creation failed.
6027-2219 [E] Storage enclosure serialNumber Error code errorCode.
component componentType
Explanation:
component ID componentId did
For an active fileset, check the AFM target
not fail. Service is not required.
configuration for peer snapshots. Ensure there is at
Explanation: least one gateway node configured for the cluster.
The component specified for the mmchenclosure Examine the preceding messages and the GPFS log for
command does not need service. additional details.
User response: User response:
Use the mmlsenclosure command to determine valid Correct the problems and reissue the command.
input and then retry the command.
6027-2225 [E] Peer snapshot successfully
6027-2220 [E] Recovery group name has pdisks deleted at cache. The delete
with missing paths. Consider snapshot operation failed at home.
using the -v no option of the Error code errorCode.
mmchrecoverygroup command.
Explanation:
Explanation: For an active fileset, check the AFM target
The mmchrecoverygroup command failed because configuration for peer snapshots. Ensure there is at
all the servers could not see all the disks, and the least one gateway node configured for the cluster.
primary server is missing paths to disks. Examine the preceding messages and the GPFS log for
additional details.
User response:
If the disks are cabled correctly, use the -v no option User response:
of the mmchrecoverygroup command. Correct the problems and reissue the command.
6027-2221 [E] Error determining redundancy 6027-2226 [E] Invalid firmware update file.
of enclosure serialNumber ESM
Explanation:
esmName.

Chapter 42. References 839


An invalid firmware update file was specified for the Explanation:
mmchfirmware command. The file system block size cannot be smaller than the
system memory page size.
User response:
Reissue the command with a valid update file. User response:
Specify a block size greater than or equal to the
6027-2227 [E] Failback is allowed only for
system memory page size.
independent-writer filesets.
6027-2232 [E] Peer snapshots are allowed only
Explanation:
for targets using the NFS protocol.
Failback operation is allowed only for independent-
writer filesets. Explanation:
The mmpsnap command can be used to create
User response:
snapshots only for filesets that are configured to use
Check the fileset mode.
the NFS protocol.
6027-2228 [E] The daemon version
User response:
(daemonVersion) on node
Specify a valid fileset target.
nodeName is lower than the
daemon version (daemonVersion) 6027-2233 [E] Fileset filesetName in file system
on node nodeName. filesystemName does not contain
peer snapshot snapshotName. The
Explanation:
delete snapshot operation failed at
A command was issued that requires nodes to be at
cache. Error code errorCode.
specific levels, but the affected GPFS servers are not
at compatible levels to support this operation. Explanation:
The specified snapshot name was not found. The
User response:
command expects the name of an existing peer
Update the GPFS code on the specified servers and
snapshot of the active fileset in the specified file
retry the command.
system.
6027-2229 [E] Cache Eviction/Prefetch is not
User response:
allowed for Primary and
Reissue the command with a valid peer snapshot
Secondary mode filesets.
name.
Explanation:
6027-2234 [E] Use the mmafmctl
Cache eviction/prefetch is not allowed for primary and
converttoprimary command for
secondary mode filesets.
converting to primary fileset.
User response:
Explanation:
None.
Converting to a primary fileset is not allowed directly.
6027-2230 [E] afmTarget=newTargetString is not
User response:
allowed. To change the AFM
Check the previous error messages and correct the
target, use mmafmctl failover
problems.
with the --target-only option. For
primary filesets, use mmafmctl 6027-2235 [E] Only independent filesets can be
changeSecondary. converted to secondary filesets.
Explanation: Explanation:
The mmchfileset command cannot be used to Converting to secondary filesets is allowed only for
change the NFS server or IP address of the home independent filesets.
cluster.
User response:
User response: None.
To change the AFM target, use the mmafmctl
6027-2236 [E] The CPU architecture on this
failover command and specify the --target-
node does not support tracing
only option. To change the AFM target for primary
in traceMode mode. Switching to
filesets, use the mmafmctl changeSecondary
traceMode mode.
command.
Explanation:
6027-2231 [E] The specified block size blockSize
The CPU does not have constant time stamp counter
is smaller than the system page
capability, which is required for overwrite trace mode.
size pageSize.
The trace has been enabled in blocking mode.

840 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Explanation:
Update the configuration parameters to use the trace The mmnfs export load loadCfgFile command
facility in blocking mode or replace this node with found an error in the NFS configuration files.
modern CPU architecture.
User response:
6027-2237 [W] An image backup made from the Correct the configuration file error.
live file system may not be usable
6027-2245 [E] To change the AFM target, use
for image restore. Specify a valid
mmafmctl changeSecondary for
global snapshot for image backup.
the primary.
Explanation:
Explanation:
The mmimgbackup command should always be used
Failover with the targetonly option can be run on a
with a global snapshot to make a consistent image
primary fileset.
backup of the file system.
User response:
User response:
None.
Correct the command invocation to include the -S
option to specify either a global snapshot name or 6027-2246 [E] Timeout executing function:
a directory path that includes the snapshot root functionName (return
directory for the file system and a valid global code=returnCode).
snapshot name.
Explanation:
6027-2238 [E] Use the mmafmctl The executeCommandWithTimeout function was
convertToSecondary command for called but it timed out.
converting to secondary.
User response:
Explanation: Correct the problem and issue the command again.
Converting to secondary is allowed by using the
6027-2247 [E] Creation of exchangeDir failed.
mmafmctl convertToSecondary command.
Explanation:
User response:
A Cluster Export Service command was unable to
None.
create the CCR exchange directory.
6027-2239 [E] Drive serialNumber serialNumber
User response:
is being managed by
Correct the problem and issue the command again.
server nodeName. Reissue the
mmchfirmware command for 6027-2248 [E] CCR command failed: command
server nodeName.
Explanation:
Explanation: A CCR update command failed.
The mmchfirmware command was issued to update
User response:
a specific disk drive which is not currently being
Correct the problem and issue the command again.
managed by this node.
6027-2249 [E] Error getting next nextName from
User response:
CCR.
Reissue the command specifying the active server.
Explanation:
6027-2240 [E] Option is not supported for a
An expected value from CCR was not obtained.
secondary fileset.
User response:
Explanation:
Issue the command again.
This option cannot be set for a secondary fileset.
6027-2250 [E] Error putting next nextName to
User response:
CCR, new ID: newExpid version:
None.
version
6027-2241 [E] Node nodeName is not a CES node.
Explanation:
Explanation: A CCR value update failed.
A Cluster Export Service command specified a node
User response:
that is not defined as a CES node.
Issue the command again.
User response:
6027-2251 [E] Error retrieving configuration file:
Reissue the command specifying a CES node.
configFile
6027-2242 [E] Error in configuration file.

Chapter 42. References 841


Explanation: User response:
Error retrieving configuration file from CCR. Correct the problem and issue the command again.
User response: 6027-2258 [E] Error writing export configuration
Issue the command again. file to CCR (return code:
returnCode).
6027-2252 [E] Error reading export configuration
file (return code: returnCode). Explanation:
A CES command was unable to write configuration file
Explanation:
to CCR.
A CES command was unable to read the export
configuration file. User response:
Correct the problem and issue the command again.
User response:
Correct the problem and issue the command again. 6027-2259 [E] The path exportPath to create
the export does not exist (return
6027-2253 [E] Error creating the internal
code:returnCode).
export data objects (return code
returnCode). Explanation:
A CES command was unable to create an export
Explanation:
because the path does not exist.
A CES command was unable to create an export data
object. User response:
Correct the problem and issue the command again.
User response:
Correct the problem and issue the command again. 6027-2260 [E] The path exportPath to create
the export is invalid (return code:
6027-2254 [E] Error creating single export
returnCode).
output, export exportPath not
found (return code returnCode). Explanation:
A CES command was unable to create an export
Explanation:
because the path is invalid.
A CES command was unable to create a single export
print output. User response:
Correct the problem and issue the command again.
User response:
Correct the problem and reissue the command. 6027-2261 [E] Error creating new export object,
invalid data entered (return code:
6027-2255 [E] Error creating export output
returnCode).
(return code: returnCode).
Explanation:
Explanation:
A CES command was unable to add an export because
A CES command was unable to create the export print
the input data is invalid.
output.
User response:
User response:
Correct the problem and issue the command again.
Correct the problem and issue the command again.
6027-2262 [E] Error creating new export object;
6027-2256 [E] Error creating the internal export
getting new export ID (return
output file string array (return
code: returnCode).
code: returnCode).
Explanation:
Explanation:
A CES command was unable to add an export. A new
A CES command was unable to create the array for
export ID was not obtained.
print output.
User response:
User response:
Correct the problem and issue the command again.
Correct the problem and issue the command again.
6027-2263 [E] Error adding export; new export
6027-2257 [E] Error deleting export, export
path exportPath already exists.
exportPath not found (return code:
returnCode). Explanation:
A CES command was unable to add an export because
Explanation:
the path already exists.
A CES command was unable to delete an export. The
exportPath was not found. User response:
Correct the problem and issue the command again.

842 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2264 [E] The --servers option is only used 6027-2270 [E] Error errno updating firmware for
to provide names for primary adapter adapterIdentifier.
and backup server configurations.
Explanation:
Provide a maximum of two server
The firmware load failed for the specified adapter.
names.
User response:
Explanation:
None.
An input node list has too many nodes specified.
6027-2271 [E] Error locating the reference client
User response:
in inputStringContainingClient
Verify the list of nodes and shorten the list to the
(return code: returnCode).
supported number.
Explanation:
6027-2265 [E] Cannot convert fileset to
The reference client for reordering a client could not
secondary fileset.
be found for the given export path.
Explanation:
User response:
Fileset cannot be converted to a secondary fileset.
Correct the problem and try again.
User response:
6027-2272 [E] Error removing the
None.
requested client in
6027-2266 [E] The snapshot names that start inputStringContainingClient from a
with psnap-rpo or psnap0-rpo are client declaration, return code:
reserved for RPO. returnCode
Explanation: Explanation:
The specified snapshot name starts with psnap- One of the specified clients to remove could not be
rpo or psnap0-rpo, which are reserved for RPO found in any client declaration for the given export
snapshots. path.
User response: User response:
Use a different snapshot name for the mmcrsnapshot Correct the problem and try again.
command.
6027-2273 [E] Error adding the requested client
6027-2267 [I] Fileset filesetName in file system in inputStringContainingClient to
fileSystem is either unlinked a client declaration, return code:
or being deleted. Home delete- returnCode
snapshot operation was not
Explanation:
queued.
One of the specified clients to add could not be
Explanation: applied for the given export path.
The command expects that the peer snapshot at home
User response:
is not deleted because the fileset at cache is either
Correct the problem and try again.
unlinked or being deleted.
6027-2274 [E] Error locating the reference client
User response:
in inputStringContainingClient
Delete the snapshot at home manually.
(return code: returnCode).
6027-2268 [E] This is already a secondary fileset.
Explanation:
Explanation: The reference client for reordering a client could not
The fileset is already a secondary fileset. be applied for the given export path.
User response: User response:
None. Correct the problem and try again.
6027-2269 [E] Adapter adapterIdentifier was not 6027-2275 [E] Unable to determine the status of
found. DASD device dasdDevice
Explanation: Explanation:
The specified adapter was not found. The dasdview command failed.
User response: User response:
Specify an existing adapter and reissue the command. Examine the preceding messages, correct the
problem, and reissue the command.

Chapter 42. References 843


6027-2276 [E] The specified DASD device Explanation:
dasdDevice is not properly The specified action is not allowed for secondary
formatted. It is not an ECKD-type filesets.
device, or it has a format other User response:
then CDL or LDL, or it has a block None.
size other then 4096.
6027-2283 [E] Node nodeName is already a CES
Explanation: node.
The specified device is not properly formatted.
Explanation:
User response: An mmchnode command attempted to enable CES
Correct the problem and reissue the command. services on a node that is already part of the CES
6027-2277 [E] Unable to determine if DASD cluster.
device dasdDevice is partitioned. User response:
Explanation: Reissue the command specifying a node that is not a
The fdasd command failed. CES node.
User response: 6027-2284 [E] The fileset
Examine the preceding messages, correct the afmshowhomesnapshot value is
problem, and reissue the command. 'yes'. The fileset mode cannot be
changed.
6027-2278 [E] Cannot partition DASD device
dasdDevice; it is already Explanation:
partitioned. The fileset afmshowhomesnapshot attribute value is
yes. The fileset mode change is not allowed.
Explanation:
The specified DASD device is already partitioned. User response:
First change the attribute afmshowhomesnapshot
User response: value to no, and then issue the command again to
Remove the existing partitions, or reissue the change the mode.
command using the desired partition name.
6027-2285 [E] Deletion of initial snapshot
6027-2279 [E] Unable to partition DASD device snapshotName of fileset
dasdDevice filesetName in file system
Explanation: fileSystem failed. The delete fileset
The fdasd command failed. operation failed at cache. Error
code errorCode.
User response:
Examine the preceding messages, correct the Explanation:
problem, and reissue the command. The deletion of the initial snapshot psnap0 of
filesetName failed. The primary and secondary
6027-2280 [E] The DASD device with bus ID filesets cannot be deleted without deleting the initial
busID cannot be found or it is in snapshot.
use.
User response:
Explanation: None.
The chccwdev command failed.
6027-2286 [E] RPO peer snapshots using
User response:
mmpsnap are allowed only for
Examine the preceding messages, correct the
primary filesets.
problem, and reissue the command.
Explanation:
6027-2281 [E] Error errno updating firmware for RPO snapshots can be created only for primary
enclosure enclosureIdentifier. filesets.
Explanation: User response:
The firmware load failed for the specified enclosure. Reissue the command with a valid primary fileset or
User response: without the --rpo option.
None. 6027-2287 The fileset needs to be linked to
6027-2282 [E] Action action is not allowed for change afmShowHomeSnapshot
secondary filesets. to 'no'.

844 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: preceding messages and the GPFS log for additional
The afmShowHomeSnapshot value cannot be details.
changed to no if the fileset is unlinked.
User response:
User response: Verify that the fileset is a primary fileset and that it has
Link the fileset and reissue the command. psnap0 created and try again.
6027-2288 [E] Option optionName is not 6027-2293 [E] The peer snapshot creation failed
supported for AFM filesets. because fileset filesetName is in
filesetState state.
Explanation:
IAM modes are not supported for AFM filesets. Explanation:
For an active fileset, check the AFM target
User response:
configuration for peer snapshots. Ensure there is at
None.
least one gateway node configured for the cluster.
6027-2289 [E] Peer snapshot creation failed Examine the preceding messages and the GPFS log for
while running subCommand. Error additional details.
code errorCode
User response:
Explanation: None. The fileset needs to be in active or dirty state.
For an active fileset, check the AFM target
6027-2294 [E] Removing older peer snapshots
configuration for peer snapshots. Ensure there is at
failed while obtaining snap IDs.
least one gateway node configured for the cluster.
Error code errorCode
Examine the preceding messages and the GPFS log for
additional details. Explanation:
Ensure the fileset exists. Examine the preceding
User response:
messages and the GPFS log for additional details.
Correct the problems and reissue the command.
User response:
6027-2290 [E] The comment string should be less
Verify that snapshots exist for the given fileset.
than 50 characters long.
6027-2295 [E] Removing older peer snapshots
Explanation:
failed while obtaining old snap
The comment/prefix string of the snapshot is longer
IDs. Error code errorCode
than 50 characters.
Explanation:
User response:
Ensure the fileset exists. Examine the preceding
Reduce the comment string size and reissue the
messages and the GPFS log for additional details.
command.
User response:
6027-2291 [E] Peer snapshot creation failed
Verify that snapshots exist for the given fileset.
while generating snapshot name.
Error code errorCode 6027-2296 [E] Need a target to convert to the
primary fileset.
Explanation:
For an active fileset, check the AFM target Explanation:
configuration for peer snapshots. Ensure there is at Need a target to convert to the primary fileset.
least one gateway node configured for the cluster.
User response:
Examine the preceding messages and the GPFS log for
Specify a target to convert to the primary fileset.
additional details.
6027-2297 [E] The check-metadata and nocheck-
User response:
metadata options are not
Correct the problems and reissue the command.
supported for a non-AFM fileset.
6027-2292 [E] The initial snapshot psnap0Name
Explanation:
does not exist. The peer
The check-metadata and nocheck-metadata
snapshot creation failed. Error
options are not supported for a non-AFM fileset.
code errorCode
User response:
Explanation:
None.
For an active fileset, check the AFM target
configuration for peer snapshots. Ensure the initial 6027-2298 [E] Only independent filesets can be
peer snapshot exists for the fileset. Examine the converted to primary or secondary.
Explanation:

Chapter 42. References 845


Only independent filesets can be converted to primary None.
or secondary.
6027-2305 [E] The mmafmctl convertToPrimary
User response: command is not allowed for this
Specify an independent fileset. primary fileset.
6027-2299 [E] Issue the mmafmctl getstate Explanation:
command to check fileset state The mmafmctl convertToPrimary command is not
and if required issue mmafmctl allowed for the primary fileset because it is not in
convertToPrimary. PrimInitFail state.
Explanation: User response:
Issue the mmafmctl getstate command to None.
check fileset state and if required issue mmafmctl
6027-2306 [E] Failed to check for cached files
convertToPrimary.
while doing primary conversion
User response: from filesetMode mode.
Issue the mmafmctl getstate command to
Explanation:
check fileset state and if required issue mmafmctl
Failed to check for cached files while doing primary
convertToPrimary.
conversion.
6027-2300 [E] The check-metadata and nocheck-
User response:
metadata options are not
None.
supported for the primary fileset.
6027-2307 [E] Uncached files present, run
Explanation:
prefetch first.
The check-metadata and nocheck-metadata
options are not supported for the primary fileset. Explanation:
Uncached files present.
User response:
None. User response:
Run prefetch and then do the conversion.
6027-2301 [E] The inband option is not supported
for the primary fileset. 6027-2308 [E] Uncached files present, run
prefetch first using policy output:
Explanation:
nodeDirFileOut.
The inband option is not supported for the primary
fileset. Explanation:
Uncached files present.
User response:
None. User response:
Run prefetch first using policy output.
6027-2302 [E] AFM target cannot be changed for
the primary fileset. 6027-2309 [E] Conversion to primary not allowed
for filesetMode mode.
Explanation:
AFM target cannot be changed for the primary fileset. Explanation:
Conversion to primary not allowed for this mode.
User response:
None. User response:
None.
6027-2303 [E] The inband option is not supported
for an AFM fileset. 6027-2310 [E] This option is available only for a
primary fileset.
Explanation:
The inband option is not supported for an AFM fileset. Explanation:
This option is available only for a primary fileset.
User response:
None. User response:
None.
6027-2304 [E] Target cannot be changed for an
AFM fileset. 6027-2311 [E] The target-only option is not
allowed for a promoted primary
Explanation:
without a target.
Target cannot be changed for an AFM fileset.
Explanation:
User response:

846 IBM Storage Scale 5.1.9: Problem Determination Guide


The target-only option is not allowed for a Investigate the failure in the CCR and fix the problem.
promoted primary without a target.
6027-2318 [E] Could not put localFilePath into the
User response: CCR as ccrName
None.
Explanation:
6027-2312 [E] Need a target to setup the new There was an error when trying to do an fput of a file
secondary. into the CCR.
Explanation: User response:
Target is required to setup the new secondary. Investigate the error and fix the problem.
User response: 6027-2319 [I] Version mismatch during upload of
None. fileName (version). Retrying.
6027-2313 [E] The target-only and inband Explanation:
options are not allowed together. The file could not be uploaded to the CCR because
another process updated it in the meantime. The file
Explanation:
will be downloaded, modified, and uploaded again.
The target-only and inband options are not
allowed together. User response:
None. The upload will automatically be tried again.
User response:
None. 6027-2320 directoryName does not resolve
to a directory in deviceName.
6027-2314 [E] Could not run commandName.
The directory must be within the
Verify that the Object protocol was
specified file system.
installed.
Explanation:
Explanation:
The cited directory does not belong to the specified
The mmcesobjlscfg command cannot find a
file system.
prerequisite command on the system.
User response:
User response:
Correct the directory name and reissue the command.
Install the missing command and try again.
6027-2321 [E] AFM primary or secondary filesets
6027-2315 [E] Could not determine CCR file for
cannot be created for file system
service serviceName
fileSystem because version is less
Explanation: than supportedVersion.
For the given service name, there is not a
Explanation:
corresponding file in the CCR.
The AFM primary or secondary filesets are not
User response: supported for a file system version that is less than
None. 14.20.
6027-2316 [E] Unable to retrieve file fileName User response:
from CCR using command Upgrade the file system and reissue the command.
command. Verify the file name is
6027-2322 [E] The OBJ service cannot be
correct and exists in CCR.
enabled because it is not installed.
Explanation: The file fileName was not found.
There was an error downloading a file from the CCR
Explanation:
repository.
The node could not enable the CES OBJ service
User response: because of a missing binary or configuration file.
Correct the error and try again.
User response:
6027-2317 [E] Unable to parse version number of Install the required software and retry the command.
file fileName from mmccr output
6027-2323 [E] The OBJ service cannot be
Explanation: enabled because the number of
The current version should be printed by mmccr when CES IPs below the minimum of
a file is extracted. The command could not read the minValue expected.
version number from the output and failed.
Explanation:
User response: The value of CES IPs was below the minimum.

Chapter 42. References 847


User response: User response:
Add at least minValue CES IPs to the cluster. None.
6027-2324 [E] The object store for serviceName 6027-2331 [E] CCR value ccrValue not defined.
is either not a GPFS type or The OBJ service cannot be
mountPoint does not exist. enabled if identity authentication
is not configured.
Explanation:
The object store is not available at this time. Explanation:
Object authentication type was not found.
User response:
Verify that serviceName is a GPFS type. Verify that the User response:
mountPoint exists, the file system is mounted, or the Configure identity authentication and try again.
fileset is linked.
6027-2332 [E] Only regular independent filesets
6027-2325 [E] File fileName does not exist are converted to secondary
in CCR. Verify that the Object filesets.
protocol is correctly installed.
Explanation:
Explanation: Only regular independent filesets can be converted to
There was an error verifying Object config and ring files secondary filesets.
in the CCR repository.
User response:
User response: Specify a regular independent fileset and run the
Correct the error and try again. command again.
6027-2326 [E] The OBJ service cannot be 6027-2333 [E] Failed to disable serviceName
enabled because attribute service. Ensure authType
attributeName for a CES IP has authentication is removed.
not been defined. Verify that
Explanation:
the Object protocol is correctly
Disable CES service failed because authentication was
installed.
not removed.
Explanation:
User response:
There was an error verifying attributeName on CES IPs.
Remove authentication and retry.
User response:
6027-2334 [E] Fileset indFileset cannot be
Correct the error and try again.
changed because it has a
6027-2327 The snapshot snapshotName is the dependent fileset depFileset
wrong scope for use in targetType
Explanation:
backup
Filesets with dependent filesets cannot be converted
Explanation: to primary or secondary.
The snapshot specified is the wrong scope.
User response:
User response: This operation cannot proceed until all the dependent
Please provide a valid snapshot name for this backup filesets are unlinked.
type.
6027-2335 [E] Failed to convert fileset, because
6027-2329 [E] The fileset attributes cannot be the policy to detect special files is
set for the primary fileset with failing.
caching disabled.
Explanation:
Explanation: The policy to detect special files is failing.
The fileset attributes cannot be set for the primary
User response:
fileset with caching disabled.
Retry the command later.
User response:
6027-2336 [E] Immutable/append-only files or
None.
clones copied from a snapshot
6027-2330 [E] The outband option is not are present, hence conversion is
supported for AFM filesets. disallowed
Explanation: Explanation:
The outband option is not supported for AFM filesets.

848 IBM Storage Scale 5.1.9: Problem Determination Guide


Conversion is disallowed if immutable/append-only Explanation:
files or clones copied from a snapshot are present. Setup cannot be done on a fileset with this mode.
User response: User response:
Files should not be immutable/append-only. None.
6027-2337 [E] Conversion to primary is not 6027-2344 [E] The GPFS daemon must be active
allowed at this time. Retry the on the node from which the
command later. mmcmd is executed with option --
inode-criteria or -o.
Explanation:
Conversion to primary is not allowed at this time. Explanation:
The GPFS daemon needs to be active on the
User response:
node where the command is issued with --inode-
Retry the command later.
criteria or -o options.
6027-2338 [E] Conversion to primary is not
User response:
allowed because the state of the
Run the command where the daemon is active.
fileset is filesetState.
6027-2345 [E] The provided snapshot name must
Explanation:
be unique to list filesets in a
Conversion to primary is not allowed with the current
specific snapshot
state of the fileset.
Explanation:
User response:
The mmlsfileset command received a snapshot
Retry the command later.
name that is not unique.
6027-2339 [E] Orphans are present, run prefetch
User response:
first.
Correct the command invocation or remove the
Explanation: duplicate named snapshots and try again.
Orphans are present.
6027-2346 [E] The local node is not a CES node.
User response:
Explanation:
Run prefetch on the fileset and then do the conversion.
A local Cluster Export Service command was invoked
6027-2340 [E] Fileset was left in PrimInitFail on a node that is not defined as a Cluster Export
state. Take the necessary actions. Service node.
Explanation: User response:
The fileset was left in PrimInitFail state. Reissue the command on a CES node.
User response: 6027-2347 [E] Error changing export, export
Take the necessary actions. exportPath not found.
6027-2341 [E] This operation can be done only on Explanation:
a primary fileset A CES command was unable to change an export. The
exportPath was not found.
Explanation:
This is not a primary fileset. User response:
Correct problem and issue the command again.
User response:
None. 6027-2348 [E] A device for directoryName does
not exist or is not active on this
6027-2342 [E] Failover/resync is currently
node.
running so conversion is not
allowed Explanation:
The device containing the specified directory does not
Explanation:
exist or is not active on the node.
Failover/resync is currently running so conversion is
not allowed. User response:
Reissue the command with a correct directory or on an
User response:
appropriate node.
Retry the command later after failover/resync
completes. 6027-2349 [E] The fileset for junctionName
does not exist in the targetType
6027-2343 [E] DR Setup cannot be done on a
specified.
fileset with mode filesetMode.

Chapter 42. References 849


Explanation: 6027-2355 [E] Unable to reload moduleName.
The fileset to back up cannot be found in the file Node hostname should be
system or snapshot specified. rebooted.
User response: Explanation:
Reissue the command with a correct name for the Host adapter firmware was updated so the specified
fileset, snapshot, or file system. module needs to be unloaded and reloaded. Linux
does not display the new firmware level until the
6027-2350 [E] The fileset for junctionName is not
module is reloaded.
linked in the targetType specified.
User response:
Explanation:
Reboot the node.
The fileset to back up is not linked in the file system or
snapshot specified. 6027-2356 [E] Node nodeName is being used as a
recovery group server.
User response:
Relink the fileset in the file system. Optionally create Explanation:
a snapshot and reissue the command with a correct The specified node is defined as a server node for
name for the fileset, snapshot, and file system. some disk.
6027-2351 [E] One or more unlinked filesets User response:
(filesetNames) exist in the If you are trying to delete the node from the GPFS
targetType specified. Check your cluster, you must either delete the disk or define
filesets and try again. another node as its server.
Explanation: 6027-2357 [E] Root fileset cannot be converted to
The file system to back up contains one or more primary fileset.
filesets that are unlinked in the file system or snapshot
Explanation:
specified.
Root fileset cannot be converted to the primary fileset.
User response:
User response:
Relink the fileset in the file system. Optionally create
None.
a snapshot and reissue the command with a correct
name for the fileset, snapshot, and file system. 6027-2358 [E] Root fileset cannot be converted to
6027-2352 secondary fileset.
The snapshot snapshotName could
not be found for use by Explanation:
commandName Root fileset cannot be converted to the secondary
fileset.
Explanation:
The snapshot specified could not be located. User response:
None.
User response:
Please provide a valid snapshot name. 6027-2359 [I] Attention: command is now
6027-2353 [E] enabled. This attribute can no
The snapshot name cannot be
longer be modified.
generated.
Explanation:
Explanation:
Indefinite retention protection is enabled. This value
The snapshot name cannot be generated.
cannot be changed in the future.
User response:
User response:
None.
None.
6027-2354 Node nodeName must be disabled
6027-2360 [E] The current value of command is
as a CES node before trying to
attrName. This value cannot be
remove it from the GPFS cluster.
changed.
Explanation:
Explanation:
The specified node is defined as a CES node.
Indefinite retention protection is enabled for this
User response: cluster and this attribute cannot be changed.
Disable the CES node and try again.
User response:
None.

850 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2361 [E] command is enabled. File systems Explanation:
cannot be deleted. The specified node is defined as a server node for
some disk.
Explanation:
When indefinite retention protection is enabled the file User response:
systems cannot be deleted. If you are trying to delete the node from the GPFS
cluster, you must either delete the disk or define
User response: another node as its server.
None.
6027-2367 [E] Fileset having iammode mode
6027-2362 [E] The current value of command is cannot be converted to primary
attrName. No changes made. fileset.
Explanation: Explanation:
The current value and the request value are the same. Fileset with Integrated Archive Manager (IAM) mode
No changes made. cannot be converted to primary fileset.
User response: User response:
None. None.
6027-2363 [E] Operation is not permitted as state 6027-2368 [E] Unable to find information for
of the fileset is filesetState. Hypervisor.
Explanation: Explanation:
This operation is not allowed with the current state of The lscpu command failed.
the fileset.
User response:
User response: Examine the preceding messages, correct the
Retry the command later. problem, and reissue the command.
6027-2364 [E] Fileset name is missing. 6027-2369 [E] Unable to list DASD devices
Explanation: Explanation:
This operation needs to be run for a particular fileset. The lsdasd command failed.
User response: User response:
Retry the command with a fileset name. Examine the preceding messages, correct the
6027-2365 [E] Firmware loader filename not problem, and reissue the command.
executable. 6027-2370 [E] Unable to flush buffer for DASD
Explanation: device name1
The listed firmware loader is not executable. Explanation:
User response: The blockdev --flushbufs command failed.
Make the firmware loader executable and retry the User response:
command. Examine the preceding messages, correct the
6027-2366 Node nodeName is being used problem, and reissue the command.
as an NSD server. This may 6027-2371 [E] Unable to read the partition table
include Local Read Only Cache for DASD device dasdDevice.
(LROC) storage. Review these
details and determine the NSD Explanation:
type by running the mmlsnsd The blockdev --rereadpt command failed.
command. For standard NSDs, you User response:
must either delete the disk or Examine the preceding messages, correct the
define another node as its server. problem, and reissue the command.
For nodes that include LROC
NSDs (local cache) must have all 6027-2372 [E] Unable to find information to
the LROC NSDs removed before DASD device dasdDevice.
the node can be deleted. Fully Explanation:
review the mmdelnsd command The dasdinfo command failed.
documentation before making any
changes. User response:

Chapter 42. References 851


Examine the preceding messages, correct the Specialist (or send an email to
problem, and reissue the command. [email protected]) to review your
use case of the Transparent Cloud
6027-2373 feature is only available in the IBM
Tiering feature and to obtain the
Storage Scale Advanced Edition.
instructions to enable the feature
Explanation: in your environment.
The specified function or feature is only part of the
Explanation:
IBM Storage Scale Advanced Edition.
The Transparent Cloud Tiering feature must be
User response: enabled with assistance from IBM.
Install the IBM Storage Scale Advanced Edition on all
User response:
nodes in the cluster, and then reissue the command.
Contact IBM support for more information.
6027-2374 [E] Unable to delete recovery group
6027-2379 [E] The FBA-type DASD device
name; as the associated VDisk
dasdDevice is not a partition.
sets are still defined.
Explanation:
Explanation:
The FBA-type DASD device has to be a partition.
Cannot delete a recovery group when vdisk sets are
still associated with it. User response:
Reissue the command using the desired partition
User response:
name.
Delete all the associated vdisk sets before deleting the
recovery group. 6027-2380 [E] Support for FBA-type DASD device
is not enabled. Run mmchconfig
6027-2376 [E] Node class nodeclass cannot
release=LATEST to activate the
be action. It is marked for
new function.
use by Transparent Cloud
Tiering. To remove this node Explanation:
class, first disable all the FBA-type DASD must be supported in the entire
nodes with mmchnode --cloud- cluster.
gateway-disable.
User response:
Explanation: Verify the IBM Storage Scale level on all nodes,
Cannot delete a node class that has cloud gateway update to the required level to support FBA by using
enabled. the mmchconfig release=LATEST command, and
reissue the command.
User response:
Disable the nodes first with mmchnode --cloud- 6027-2381 [E] Missing argument missingArg
gateway-disable.
Explanation:
6027-2377 [E] Node nodeclass cannot be deleted. An IBM Storage Scale administration command
It is marked for use by received an insufficient number of arguments.
Transparent Cloud Tiering. To
User response:
remove this node, first disable
Correct the command line and reissue the command.
it with mmchnode --cloud-
gateway-disable. 6027-2382 [E] Conversion is not allowed for
filesets with active clone files.
Explanation:
Cannot delete a node that has cloud gateway enabled. Explanation:
Conversion is disallowed if clones are present.
User response:
Disable the node first with mmchnode --cloud- User response:
gateway-disable. Remove the clones and try again.
6027-2378 [E] To enable Transparent Cloud 6027-2383 [E] Conversion to secondary fileset
Tiering nodes, you must first has failed.
enable the Transparent\n\ Cloud
Explanation:
Tiering feature. This feature
Fileset could not be converted to secondary.
provides a new level of storage
tiering capability to the IBM User response:
Storage Scale customer. Please Run the mmafmctl convertToSecondary
contact your IBM Client Technical command again.

852 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2384 [E] No object storage policy found. 6027-2390 [E] SELinux needs to be in either
disabled or permissive mode.
Explanation:
Error while retrieving object storage policies. Explanation:
The command validates SELinux state.
User response:
Verify if object protocol is enabled on all nodes, and User response:
reissue the command. Retry with SELinux in disabled mode.
6027-2385 [E] Failed to create soft link between 6027-2391 [E] The configuration of SED based
directories: directoryName1, encryption for the drive 'name1' is
directoryName2. failed.
Explanation: Explanation:
Error while creating soft link between provided fileset The enrollment of SED drive for SED based encryption
path and container path. is failed.
User response: User response:
Examine the command output to determine the root Rerun the command after fixing the drive.
cause.
6027-2392 [E] Found pdisk serialNumber
6027-2386 [E] Provided fileset path filesetPath is in recovery group
already enabled for objectization. recoverygroupName has
pdiskName paths.
Explanation:
The provided fileset path is already enabled for Explanation:
objectization. The mmchfirmware command found a non-redundant
pdisk. Proceeding could cause loss of data access.
User response:
Retry using different fileset path. User response:
Shutdown GPFS in the cluster and retry the
6027-2387 [E] Provided container containerName
mmchfirmware command.
is already enabled for
objectization. 6027-2393 [E] Use the -N parameter to specify
the nodes that have access to the
Explanation:
hardware to be updated.
The provided container is already enabled for
objectization. Explanation:
The mmchfirmware command was issued to update
User response:
firmware, but no devices were found on the specified
Retry using a different container name.
nodes.
6027-2388 [E] Given fileset: filesetName is
User response:
not part of object file system:
Reissue the command with the -N parameter.
fileSystemName.
6027-2394 [E] No drive serial number was found
Explanation:
for driveName.
Provided fileset is derived from a non object file
system. Explanation:
The mmchcarrier command was unable to determine
User response:
the drive serial number for the replacement drive.
Retry using the fileset which is derived from object file
system. User response:
Contact the IBM Support Center.
6027-2389 [E] Fileset path is already used by
object protocol. It cannot be 6027-2395 The feature is only available in
selected for objectization. the IBM Storage Scale Advanced
Edition or the Data Management
Explanation:
Edition.
The provided fileset path is already in use by the
object protocol. Explanation:
The specified function or feature is only part of the
User response:
IBM Storage Scale Advanced Edition or the Data
Retry using a different fileset path.
Management Edition.
User response:

Chapter 42. References 853


Install the IBM Storage Scale Advanced Edition or the allowed to only run on the admin
Data Management Edition on all nodes in the cluster, nodes.
and then reissue the command.
Explanation:
6027-2396 [E] Failed to verify the presence of Only nodes that are designated as admin nodes
uncached files while disabling are allowed to run commands. The node where the
AFM. command was attempted run is not admin node.
Explanation: User response:
Failed to verify uncached files because the Use the mmlscluster command to identify the admin
mmapplypolicy command that is run internally as nodes where commands can run. Use the mmchnode
part of disabling AFM is failing. command to designate admin nodes in the cluster.
User response: 6027-2402 Missing option: MissingOption.
Rerun the command.
Explanation:
6027-2397 [E] Orphans are present, run prefetch A GPFS administrative command is missing a required
first by using the policy output: option.
nodeDirFileOut
User response:
Explanation: Correct the command line and reissue the command.
Orphans are present.
6027-2403 Invalid argument for
User response: MissingOption1 option:
Run prefetch first by using the policy output. MissingOption2
6027-2398 [E] Fileset is unlinked. Link the fileset Explanation:
and then rerun the command, A GPFS administrative command option is invalid.
so that the uncached files and
User response:
orphans can be verified.
Correct the command line and reissue the command.
Explanation:
6027-2404 [E] No NVMe devices are in use by
Fileset is unlinked and policy cannot be run on it to
GNR.
verify the uncached files and orphans.
Explanation:
User response:
Either no NVMe devices exist or GNR is not using any
Link the fileset first and then retry the command.
NVMe devices.
6027-2399 [E] The mmchfirmware command
User response:
cannot be completed due to mixed
Verify whether recovery groups exist and configured to
code levels on the recovery group
use NVMe devices.
servers.
6027-2405 [E] The NVMe device
Explanation:
NVMeDeviceName is not in use by
The mmchfirmware command discovered
GNR.
incompatible code levels on the recovery group
servers. Explanation:
Either the specified NVMe device does not exist or
User response:
GNR is not configured to use the specified NVMe
Update the code levels on the recovery group servers
device.
and try again.
User response:
6027-2400 At least one node in the cluster
Verify whether recovery groups exist and configured to
must be defined as an admin node.
use the specified NVMe device.
Explanation:
6027-2406 [E] The recovery group servers are not
All nodes were explicitly designated or allowed to
found.
default to non-admin. At least one node must be
designated as admin nodes. Explanation:
There are no recovery groups configured.
User response:
Specify which of the nodes must be considered as an User response:
admin node and then reissue the command. Use -N option to specify node names or node class.
6027-2401 [E] This cluster node is not designated 6027-2407 [E] Vdisk (vdiskName) with block size
as an admin node. Commands are (blockSize) cannot be allocated

854 IBM Storage Scale 5.1.9: Problem Determination Guide


in storage pool (storagePoolName) Contact IBM support for more information.
with block size (blockSize).
6027-2413 [E] Node nodeName cannot be
Explanation: deleted. This node is configured
The mmcrfs command specified an NSD but its as a message queue Zookeeper
underlying vdisk block size does not match the storage server.
pool block size.
Explanation:
User response: Cannot delete a node that is configured as a message
Reissue the command using block sizes that match. queue Zookeeper server.
6027-2408 [E] Node nodeName cannot be User response:
deleted. It is an active message Remove the node from message queue Zookeeper
queue server. server.
Explanation: 6027-2414 [E] The value of
Cannot delete a node that is an active message queue encryptionKeyCacheExpiratio
server. n parameter provided is smaller
than the minimum value allowed
User response:
of 60 seconds.
Remove the node from active message queue server.
Explanation:
6027-2409 [E] Incorrect node specified for
The value provided is smaller than the minimum
command.
allowed.
Explanation:
User response:
Specified node is invalid.
Provide a value greater than or equal to 60.
User response:
6027-2415 Configuration propagation is
Specify a valid node.
asynchronous. Run the CLI
6027-2410 [E] Error found while checking disk command mmuserauth service
descriptor. check to query the status across
all nodes.
Explanation:
An error is found in the disk descriptor. Explanation:
Authentication configuration propagation is
User response:
asynchronous in nature. In a large cluster
Check the preceding messages, if any, and correct the
configuration update and enablement of necessary
condition that caused error in the disk descriptor.
services may take a while. It is vital to confirm that
6027-2411 [E] The specified name is not allowed the expected settings are applied successfully across
because it contains an invalid all the nodes.
special character.
User response:
Explanation: Confirm that the authentication configuration has been
The cited name is not allowed because it contains an applied on all the protocol nodes. The check can be
invalid special character. performed by executing the validation CLI command.
The status of all the checks must be OK for all the
User response:
protocol nodes.
Specify a name that does not contain an invalid special
character, and reissue the command. 6027-2416 File system fileSystem is mounted
but inaccessible. Try remounting
6027-2412 [E] To use this new CLI, you must
the file system.
first enable the QOS for Filesets
feature. Contact your IBM Client Explanation:
Technical Specialist or send an The command tried to access a mounted file system
email to [email protected] to but received an ESTALE error. This is usually because
obtain the instructions to enable a file system has crashed or become unexpectedly
the feature in your environment. unavailable.
Explanation: User response:
The QOS for Filesets feature must be enabled with the Recover and remount the file system and retry the
assistance from IBM. command.
User response:

Chapter 42. References 855


6027-2417 Invalid current working directory 6027-2423 [E] Unable to sanitize recovery group
detected: directory name.
Explanation: Explanation:
The current working directory does not exist and the An error occurred when trying to get pdisk paths list of
command cannot run. the given recovery group for sanitize.
User response: User response:
Change to a valid directory then retry the command. Examine the error messages and logs to determine the
problem and correct the problem.
6027-2418 Cannot convert AFM or root fileset
to manual-updates mode. 6027-2424 [E] The sanitize of SED drive failed
for recovery group Recovery Group
Explanation:
Name.
Existing AFM or root fileset cannot be converted to
manual-updates mode. Explanation:
The sanitize of SED drives of given recovery group
User response:
failed.
Correct the command invocation and try again.
User response:
6027-2419 [E] AFM manual-updates filesets
Need to check error message and try to sanitize the
cannot be created for file system
drives individually by using the mmsed command.
fileSystem because version is less
than supportedVersion. 6027-2425 [E] Changing the gateway designation
with active I/O to AFM filesets
Explanation:
could cause loss of data access.
The AFM manual-updates filesets are not supported
for a file system version that is less than 26.00. Explanation:
The mmchnode command to add/remove a gateway
User response:
node role must be run after ensuring that the cluster
Upgrade the file system and reissue the command.
has no active I/Os on AFM filesets.
6027-2420 [E] The SED based unlock for the drive
User response:
driveName is failed.
Ensure that no I/O is on AFM filesets in the cluster
Explanation: before running the mmchnode command.
The unlock of locked SED drive is failed.
6027-2426 [E] Unable to sanitize or crypto erase
User response: a disk diskSanitizeOrCryptoerase
Rerun the command after verifying the SED drive at a given drive path or user
locked keys. location drivePathOrUserLocation
due to the failure reason
6027-2421 [E] Failed to get the pdisk paths
failureReason.
list of a recovery group
recoveryGroupName. Explanation:
Due to some reason such as no paths to drive, the
Explanation:
sanitize or cryptoerase operation cannot be performed
An error occurred when trying to get the pdisk paths
on the drive.
list of the given recovery group.
User response:
User response:
Check the disk at the given user location or path
Examine the error messages and logs to determine
and determine the problem and correct the problem,
the problem and correct the problem and reissue the
and manually perform the sanitize or cryptoerase
command.
operation.
6027-2422 [E] Unable to cryptoerase
6027-2427 [E] Some of the disks are
recoveryGroupName.
not sanitized or cryptoerased
Explanation: diskSanitizeOrCryptoerase
An error occurred when trying to get the pdisk paths successfully. Examine previous
list of the given recovery group for cryptoerase. messages to determine the cause.
User response: Explanation:
Examine the error messages and logs to determine the During recoverygroup deletion, the sanitize or
problem and correct the problem. cryptoerase operation failed on some drives.

856 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Explanation:
Check previous error messages of failed the sanitize The file system has reached its limit of online
or crypto erase operation on disks and correct snapshots
the problem, and manually perform the sanitize or
User response:
cryptoerase operation on disks.
Delete an existing snapshot, then issue the create
6027-2500 mmsanrepairfs already in progress snapshot command again.
for "name"
6027-2601 Snapshot name dirName already
Explanation: exists.
This is an output from mmsanrepairfs when another
Explanation:
mmsanrepairfs command is already running.
by the tscrsnapshot command.
User response:
User response:
Wait for the currently running command to complete
Delete existing file/directory and reissue the
and reissue the command.
command.
6027-2501 Could not allocate storage.
6027-2602 Unable to delete snapshot
Explanation: snapshotName from file system
Sufficient memory could not be allocated to run the fileSystem. rc=returnCode.
mmsanrepairfs command.
Explanation:
User response: This message is issued by the tscrsnapshot
Increase the amount of memory available. command.
6027-2503 "name" is not SANergy enabled. User response:
Delete the snapshot using the tsdelsnapshot
Explanation:
command.
The file system is not SANergy enabled, there is
nothing to repair on this file system. 6027-2603 Unable to get permission to create
snapshot, rc=returnCode.
User response:
None. mmsanrepairfs cannot be run against this file Explanation:
system. This message is issued by the tscrsnapshot
command.
6027-2504 Waiting number seconds for
SANergy data structure cleanup. User response:
Reissue the command.
Explanation:
This is an output from mmsanrepairfs reporting a 6027-2604 Unable to quiesce all nodes,
delay in the command completion because it must rc=returnCode.
wait until internal SANergy cleanup occurs.
Explanation:
User response: This message is issued by the tscrsnapshot
None. Information message only. command.
6027-2576 [E] Error: Daemon value kernel value User response:
PAGE_SIZE mismatch. Restart failing nodes or switches and reissue the
command.
Explanation:
The GPFS kernel extension loaded in memory does not 6027-2605 Unable to resume all nodes,
have the same PAGE_SIZE value as the GPFS daemon rc=returnCode.
PAGE_SIZE value that was returned from the POSIX
Explanation:
sysconf API.
This message is issued by the tscrsnapshot
User response: command.
Verify that the kernel header files used to build the
User response:
GPFS portability layer are the same kernel header files
Restart failing nodes or switches.
used to build the running kernel.
6027-2606 Unable to sync all nodes,
6027-2600 Cannot create a new snapshot
rc=returnCode.
until an existing one is deleted.
File system fileSystem has a limit Explanation:
of number online snapshots. This message is issued by the tscrsnapshot
command.

Chapter 42. References 857


User response: 6027-2613 Cannot restore snapshot.
Restart failing nodes or switches and reissue the fileSystem is mounted on number
command. node(s) and in use on number
node(s).
6027-2607 Cannot create new snapshot until
an existing one is deleted. Fileset Explanation:
filesetName has a limit of number This message is issued by the tsressnapshot
snapshots. command.
Explanation: User response:
The fileset has reached its limit of snapshots. Unmount the file system and reissue the restore
command.
User response:
Delete an existing snapshot, then issue the create 6027-2614 File system fileSystem does not
snapshot command again. contain snapshot snapshotName
err = number.
6027-2608 Cannot create new snapshot:
state of fileset filesetName is Explanation:
inconsistent (badState). An incorrect snapshot name was specified.
Explanation: User response:
An operation on the cited fileset is incomplete. Specify a valid snapshot and issue the command
again.
User response:
Complete pending fileset actions, then issue the 6027-2615 Cannot restore snapshot
create snapshot command again. snapshotName which is
6027-2609 snapshotState, err = number.
Fileset named filesetName does
not exist. Explanation:
The specified snapshot is not in a valid state.
Explanation:
One of the filesets listed does not exist. User response:
Specify a snapshot that is in a valid state and issue the
User response:
command again.
Specify only existing fileset names.
6027-2616 Restoring snapshot snapshotName
6027-2610 File system fileSystem does not
requires quotaTypes quotas to be
contain snapshot snapshotName
enabled.
err = number.
Explanation:
Explanation:
The snapshot being restored requires quotas to be
An incorrect snapshot name was specified.
enabled, since they were enabled when the snapshot
User response: was created.
Select a valid snapshot and issue the command again.
User response:
6027-2611 Cannot delete snapshot Issue the recommended mmchfs command to enable
snapshotName which is in state quotas.
snapshotState.
6027-2617 You must run: mmchfs fileSystem
Explanation: -Q yes.
The snapshot cannot be deleted while it is in the cited
Explanation:
transition state because of an in-progress snapshot
The snapshot being restored requires quotas to be
operation.
enabled, since they were enabled when the snapshot
User response: was created.
Wait for the in-progress operation to complete and
User response:
then reissue the command.
Issue the cited mmchfs command to enable quotas.
6027-2612 Snapshot named snapshotName
6027-2618 [N] Restoring snapshot snapshotName
does not exist.
in file system fileSystem requires
Explanation: quotaTypes quotas to be enabled.
A snapshot to be listed does not exist.
Explanation:
User response:
Specify only existing snapshot names.

858 IBM Storage Scale 5.1.9: Problem Determination Guide


The snapshot being restored in the cited file system Delete the previous snapshot using the
requires quotas to be enabled, since they were mmdelsnapshot command, and then reissue the
enabled when the snapshot was created. original snapshot command.
User response: 6027-2625 Previous snapshot snapshotName
Issue the mmchfs command to enable quotas. must be restored before a new
snapshot may be created.
6027-2619 Restoring snapshot snapshotName
requires quotaTypes quotas to be Explanation:
disabled. The cited previous snapshot must be restored before a
new snapshot may be created.
Explanation:
The snapshot being restored requires quotas to be User response:
disabled, since they were not enabled when the Run mmrestorefs on the previous snapshot, and then
snapshot was created. reissue the original snapshot command.
User response: 6027-2626 Previous snapshot snapshotName
Issue the cited mmchfs command to disable quotas. is not valid and must be deleted
before another snapshot may be
6027-2620 You must run: mmchfs fileSystem
deleted.
-Q no.
Explanation:
Explanation:
The cited previous snapshot is not valid and must be
The snapshot being restored requires quotas to be
deleted before another snapshot may be deleted.
disabled, since they were not enabled when the
snapshot was created. User response:
Delete the previous snapshot using the
User response:
mmdelsnapshot command, and then reissue the
Issue the cited mmchfs command to disable quotas.
original snapshot command.
6027-2621 [N] Restoring snapshot snapshotName
6027-2627 Previous snapshot snapshotName
in file system fileSystem requires
is not valid and must be deleted
quotaTypes quotas to be disabled.
before another snapshot may be
Explanation: restored.
The snapshot being restored in the cited file system
Explanation:
requires quotas to be disabled, since they were
The cited previous snapshot is not valid and must be
disabled when the snapshot was created.
deleted before another snapshot may be restored.
User response:
User response:
Issue the mmchfs command to disable quotas.
Delete the previous snapshot using the
6027-2623 [E] Error deleting snapshot mmdelsnapshot command, and then reissue the
snapshotName in file system original snapshot command.
fileSystem err number
6027-2628 More than one snapshot is marked
Explanation: for restore.
The cited snapshot could not be deleted during file
Explanation:
system recovery.
More than one snapshot is marked for restore.
User response:
User response:
Run the mmfsck command to recover any lost data
Restore the previous snapshot and then reissue the
blocks.
original snapshot command.
6027-2624 Previous snapshot snapshotName
6027-2629 Offline snapshot being restored.
is not valid and must be deleted
before a new snapshot may be Explanation:
created. An offline snapshot is being restored.
Explanation: User response:
The cited previous snapshot is not valid and must be When the restore of the offline snapshot completes,
deleted before a new snapshot may be created. reissue the original snapshot command.
User response: 6027-2630 Program failed, error number.
Explanation:

Chapter 42. References 859


The tssnaplatest command encountered an error The disk descriptors for this command must include
and printErrnoMsg failed. one and only one storage pool that is allowed to
contain metadata.
User response:
Correct the problem shown and reissue the command. User response:
Modify the command's disk descriptors and reissue
6027-2631 Attention: Snapshot
the command.
snapshotName was being restored
to fileSystem. 6027-2638 Maximum of number storage pools
allowed.
Explanation:
A file system in the process of a snapshot restore Explanation:
cannot be mounted except under a restricted mount. The cited limit on the number of storage pools that
may be defined has been exceeded.
User response:
None. Informational message only. User response:
Modify the command's disk descriptors and reissue
6027-2633 Attention: Disk configuration for
the command.
fileSystem has changed while tsdf
was running. 6027-2639 Incorrect fileset name
filesetName.
Explanation:
The disk configuration for the cited file system Explanation:
changed while the tsdf command was running. The fileset name provided in the command invocation
is incorrect.
User response:
Reissue the mmdf command. User response:
Correct the fileset name and reissue the command.
6027-2634 Attention: number of number
regions in fileSystem were 6027-2640 Incorrect path to fileset junction
unavailable for free space. filesetJunction.
Explanation: Explanation:
Some regions could not be accessed during the tsdf The path to the cited fileset junction is incorrect.
run. Typically, this is due to utilities such mmdefragfs
User response:
or mmfsck running concurrently.
Correct the junction path and reissue the command.
User response:
6027-2641 Incorrect fileset junction name
Reissue the mmdf command.
filesetJunction.
6027-2635 The free space data is not
Explanation:
available. Reissue the command
The cited junction name is incorrect.
without the -q option to collect it.
User response:
Explanation:
Correct the junction name and reissue the command.
The existing free space information for the file system
is currently unavailable. 6027-2642 Specify one and only one of
FilesetName or -J JunctionPath.
User response:
Reissue the mmdf command. Explanation:
The change fileset and unlink fileset commands accept
6027-2636 Disks in storage pool storagePool
either a fileset name or the fileset's junction path to
must have disk usage type
uniquely identify the fileset. The user failed to provide
dataOnly.
either of these, or has tried to provide both.
Explanation:
User response:
A non-system storage pool cannot hold metadata or
Correct the command invocation and reissue the
descriptors.
command.
User response:
6027-2643 Cannot create a new fileset until
Modify the command's disk descriptors and reissue
an existing one is deleted. File
the command.
system fileSystem has a limit of
6027-2637 The file system must contain at maxNumber filesets.
least one disk for metadata.
Explanation:
Explanation:

860 IBM Storage Scale 5.1.9: Problem Determination Guide


An attempt to create a fileset for the cited file system Remove all files and directories from the fileset,
failed because it would exceed the cited limit. or specify the -f option to the mmdelfileset
command.
User response:
Remove unneeded filesets and reissue the command. 6027-2650 Fileset information is not
available.
6027-2644 Comment exceeds maximum
length of maxNumber characters. Explanation:
A fileset command failed to read file system metadata
Explanation:
file. The file system may be corrupted.
The user-provided comment for the new fileset
exceeds the maximum allowed length. User response:
Run the mmfsck command to recover the file system.
User response:
Shorten the comment and reissue the command. 6027-2651 Fileset filesetName cannot be
unlinked.
6027-2645 Fileset filesetName already exists.
Explanation:
Explanation:
The user tried to unlink the root fileset, or is not
An attempt to create a fileset failed because the
authorized to unlink the selected fileset.
specified fileset name already exists.
User response:
User response:
None. The fileset cannot be unlinked.
Select a unique name for the fileset and reissue the
command. 6027-2652 Fileset at junctionPath cannot be
unlinked.
6027-2646 Unable to sync all nodes while
quiesced, rc=returnCode Explanation:
The user tried to unlink the root fileset, or is not
Explanation:
authorized to unlink the selected fileset.
This message is issued by the tscrsnapshot
command. User response:
None. The fileset cannot be unlinked.
User response:
Restart failing nodes or switches and reissue the 6027-2653 Failed to unlink fileset filesetName
command. from filesetName.
6027-2647 Fileset filesetName must be Explanation:
unlinked to be deleted. An attempt was made to unlink a fileset that is linked
to a parent fileset that is being deleted.
Explanation:
The cited fileset must be unlinked before it can be User response:
deleted. Delete or unlink the children, and then delete the
parent fileset.
User response:
Unlink the fileset, and then reissue the delete 6027-2654 Fileset filesetName cannot be
command. deleted while other filesets are
linked to it.
6027-2648 Filesets have not been enabled for
file system fileSystem. Explanation:
The fileset to be deleted has other filesets linked to
Explanation:
it, and cannot be deleted without using the -f flag, or
The current file system format version does not
unlinking the child filesets.
support filesets.
User response:
User response:
Delete or unlink the children, and then delete the
Change the file system format version by issuing
parent fileset.
mmchfs -V.
6027-2655 Fileset filesetName cannot be
6027-2649 Fileset filesetName contains user
deleted.
files and cannot be deleted unless
the -f option is specified. Explanation:
The user is not allowed to delete the root fileset.
Explanation:
An attempt was made to delete a non-empty fileset. User response:
None. The fileset cannot be deleted.
User response:

Chapter 42. References 861


6027-2656 Unable to quiesce fileset at all Explanation:
nodes. The directory specified for the junction has too many
links.
Explanation:
An attempt to quiesce the fileset at all nodes failed. User response:
Select a new directory for the link and reissue the
User response: command.
Check communication hardware and reissue the
command. 6027-2663 Fileset filesetName cannot be
changed.
6027-2657 Fileset filesetName has open files.
Specify -f to force unlink. Explanation:
The user specified a fileset to tschfileset that
Explanation: cannot be changed.
An attempt was made to unlink a fileset that has open
files. User response:
None. You cannot change the attributes of the root
User response: fileset.
Close the open files and then reissue command, or
use the -f option on the unlink command to force the 6027-2664 Fileset at pathName cannot be
open files to close. changed.
6027-2658 Fileset filesetName cannot be Explanation:
linked into a snapshot at The user specified a fileset to tschfileset that
pathName. cannot be changed.
Explanation: User response:
The user specified a directory within a snapshot for the None. You cannot change the attributes of the root
junction to a fileset, but snapshots cannot be modified. fileset.
User response: 6027-2665 mmfileid already in progress for
Select a directory within the active file system, and name.
reissue the command. Explanation:
6027-2659 Fileset filesetName is already An mmfileid command is already running.
linked. User response:
Explanation: Wait for the currently running command to complete,
The user specified a fileset that was already linked. and issue the new command again.
User response: 6027-2666 mmfileid can only handle a
Unlink the fileset and then reissue the link command. maximum of diskAddresses disk
addresses.
6027-2660 Fileset filesetName cannot be
linked. Explanation:
Too many disk addresses specified.
Explanation:
The fileset could not be linked. This typically happens User response:
when the fileset is in the process of being deleted. Provide less than 256 disk addresses to the command.
User response: 6027-2667 [I] Allowing block allocation for file
None. system fileSystem that makes a file
ill-replicated due to insufficient
6027-2661 Fileset junction pathName already resource and puts data at risk.
exists.
Explanation:
Explanation: The partialReplicaAllocation file system option
A file or directory already exists at the specified allows allocation to succeed even when all replica
junction. blocks cannot be allocated. The file was marked as not
User response: replicated correctly and the data may be at risk if one
Select a new junction name or a new directory for the of the remaining disks fails.
link and reissue the link command. User response:
6027-2662 Directory pathName for junction None. Informational message only.
has too many links.

862 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2670 Fileset name filesetName not User response:
found. Unmount the file system and run the kwdmmfsck
command to repair the file system.
Explanation:
The fileset name that was specified with the command 6027-2675 Only file systems with NFSv4
invocation was not found. ACL semantics enabled can be
mounted on this platform.
User response:
Correct the fileset name and reissue the command. Explanation:
A user is trying to mount a file system on Microsoft
6027-2671 Fileset command on fileSystem Windows, but the ACL semantics disallow NFSv4 ACLs.
failed; snapshot snapshotName
must be restored first. User response:
Enable NFSv4 ACL semantics using the mmchfs
Explanation: command (-k option)
The file system is being restored either from an offline
backup or a snapshot, and the restore operation has 6027-2676 Only file systems with NFSv4
not finished. Fileset commands cannot be run. locking semantics enabled can be
mounted on this platform.
User response:
Run the mmrestorefs command to complete the Explanation:
snapshot restore operation or to finish the offline A user is trying to mount a file system on Microsoft
restore, then reissue the fileset command. Windows, but the POSIX locking semantics are in
effect.
6027-2672 Junction parent directory inode
number inodeNumber is not valid. User response:
Enable NFSv4 locking semantics using the mmchfs
Explanation: command (-D option).
An inode number passed to tslinkfileset is not
valid. 6027-2677 Fileset filesetName has pending
changes that need to be synced.
User response:
Check the mmlinkfileset command arguments for Explanation:
correctness. If a valid junction path was provided, A user is trying to change a caching option for a fileset
contact the IBM Support Center. while it has local changes that are not yet synced with
the home server.
6027-2673 [X] Duplicate owners of an allocation
region (index indexNumber, region User response:
regionNumber, pool poolNumber) Perform AFM recovery before reissuing the command.
were detected for file system 6027-2678 File system fileSystem is mounted
fileSystem: nodes nodeName and on nodes nodes or fileset
nodeName. filesetName is not unlinked.
Explanation: Explanation:
The allocation region should not have duplicate A user is trying to change a caching feature for a fileset
owners. while the file system is still mounted or the fileset is
User response: still linked.
Contact the IBM Support Center. User response:
6027-2674 [X] The owner of an allocation Unmount the file system from all nodes or unlink the
region (index indexNumber, region fileset before reissuing the command.
regionNumber, pool poolNumber) 6027-2679 Mount of fileSystem failed because
that was detected for file system mount event not handled by any
fileSystem: node nodeName is not data management application.
valid.
Explanation:
Explanation: The mount failed because the file system is enabled
The file system had detected a problem with the for DMAPI events (-z yes), but there was no data
ownership of an allocation region. This may result management application running to handle the event.
in a corrupted file system and loss of data. One or
more nodes may be terminated to prevent any further User response:
damage to the file system. Make sure the DM application (for example HSM or
HPSS) is running before the file system is mounted.

Chapter 42. References 863


6027-2680 AFM filesets cannot be created for fileset. Try reducing the maximum
file system fileSystem. inode limits for some of the inode
spaces in fileSystem.
Explanation:
The current file system format version does not Explanation:
support AFM-enabled filesets; the -p option cannot be The number of inodes available is too small to create a
used. new inode space.
User response: User response:
Change the file system format version by issuing Reduce the maximum inode limits and issue the
mmchfs -V. command again.
6027-2681 Snapshot snapshotName has 6027-2688 Only independent filesets can
linked independent filesets be configured as AFM filesets.
The --inode-space=new option is
Explanation: required.
The specified snapshot is not in a valid state.
Explanation:
User response: Only independent filesets can be configured for
Correct the problem and reissue the command. caching.
6027-2682 [E] Set quota file attribute error User response:
(reasonCode)explanation Specify the --inode-space=new option.
Explanation: 6027-2689 The value for --block-size must
While mounting a file system a new quota file failed
be the keyword auto or the value
to be created due to inconsistency with the current
must be of the form [n]K, [n]M,
degree of replication or the number of failure groups.
[n]G or [n]T, where n is an optional
User response: integer in the range 1 to 1023.
Disable quotas. Check and correct the degree of
Explanation:
replication and the number of failure groups. Re-
An invalid value was specified with the --block-
enable quotas.
size option.
6027-2683 Fileset filesetName in file system User response:
fileSystem does not contain Reissue the command with a valid option.
snapshot snapshotName, err =
number 6027-2690 Fileset filesetName can only be
linked within its own inode space.
Explanation:
An incorrect snapshot name was specified. Explanation:
A dependent fileset can only be linked within its own
User response: inode space.
Select a valid snapshot and issue the command again.
User response:
6027-2684 File system fileSystem does Correct the junction path and reissue the command.
not contain global snapshot
snapshotName, err = number 6027-2691 The fastea feature needs to be
enabled for file system fileSystem
Explanation:
before creating AFM filesets.
An incorrect snapshot name was specified.
Explanation:
User response: The current file system on-disk format does not
Select a valid snapshot and issue the command again. support storing of extended attributes in the file's
6027-2685 Total file system capacity inode. This is required for AFM-enabled filesets.
allows minMaxInodes inodes in User response:
fileSystem. Currently the total Use the mmmigratefs command to enable the fast
inode limits used by all the extended-attributes feature.
inode spaces in inodeSpace is
inodeSpaceLimit. There must be at 6027-2692 Error encountered while
least number inodes available to processing the input file.
create a new inode space. Use the Explanation:
mmlsfileset -L command to show The tscrsnapshot command encountered an error
the maximum inode limits of each while processing the input file.

864 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Explanation:
Check and validate the fileset names listed in the input A user is trying to change a caching feature for a fileset
file. while the file system is still mounted or the fileset is
still linked.
6027-2693 Fileset junction name
junctionName conflicts with the User response:
current setting of mmsnapdir. Unmount the file system from all nodes or unlink the
fileset before reissuing the command.
Explanation:
The fileset junction name conflicts with the current 6027-2699 Cannot create a new independent
setting of mmsnapdir. fileset until an existing one is
deleted. File system fileSystem has
User response:
a limit of maxNumber independent
Select a new junction name or a new directory for the
filesets.
link and reissue the mmlinkfileset command.
Explanation:
6027-2694 [I] The requested maximum number
An attempt to create an independent fileset for the
of inodes is already at number.
cited file system failed because it would exceed the
Explanation: cited limit.
The specified number of nodes is already in effect.
User response:
User response: Remove unneeded independent filesets and reissue
This is an informational message. the command.
6027-2695 [E] The number of inodes to 6027-2700 [E] A node join was rejected. This
preallocate cannot be higher than could be due to incompatible
the maximum number of inodes. daemon versions, failure to find
the node in the configuration
Explanation:
database, or no configuration
The specified number of nodes to preallocate is not
manager found.
valid.
Explanation:
User response:
A request to join nodes was explicitly rejected.
Correct the --inode-limit argument then retry the
command. User response:
Verify that compatible versions of GPFS are installed
6027-2696 [E] The number of inodes to
on all nodes. Also, verify that the joining node is in the
preallocate cannot be lower
configuration database.
than the number inodes already
allocated. 6027-2701 The mmpmon command file is
empty.
Explanation:
The specified number of nodes to preallocate is not Explanation:
valid. The mmpmon command file is empty.
User response: User response:
Correct the --inode-limit argument then retry the Check file size, existence, and access permissions.
command.
6027-2702 Unexpected mmpmon response
6027-2697 Fileset at junctionPath has from file system daemon.
pending changes that need to be
Explanation:
synced.
An unexpected response was received to an mmpmon
Explanation: request.
A user is trying to change a caching option for a fileset
User response:
while it has local changes that are not yet synced with
Ensure that the mmfsd daemon is running. Check the
the home server.
error log. Ensure that all GPFS software components
User response: are at the same version.
Perform AFM recovery before reissuing the command.
6027-2703 Unknown mmpmon command
6027-2698 File system fileSystem is mounted command.
on nodes nodes or fileset at
Explanation:
junctionPath is not unlinked.

Chapter 42. References 865


An unknown mmpmon command was read from the User response:
input file. Add the specified node to the node list and reissue the
command.
User response:
Correct the command and rerun. 6027-2710 [E] Node nodeName is being expelled
due to expired lease.
6027-2704 Permission failure. The command
requires root authority to execute. Explanation:
The nodes listed did not renew their lease in a timely
Explanation:
fashion and will be expelled from the cluster.
The mmpmon command was issued with a nonzero UID.
User response:
User response:
Check the network connection between this node and
Log on as root and reissue the command.
the node specified above.
6027-2705 Could not establish connection to
6027-2711 [E] File system table full.
file system daemon.
Explanation:
Explanation:
The mmfsd daemon cannot add any more file systems
The connection between a GPFS command and
to the table because it is full.
the mmfsd daemon could not be established. The
daemon may have crashed, or never been started, User response:
or (for mmpmon) the allowed number of simultaneous None. Informational message only.
connections has been exceeded.
6027-2712 Option 'optionName' has been
User response: deprecated.
Ensure that the mmfsd daemon is running. Check the
Explanation:
error log. For mmpmon, ensure that the allowed number
The option that was specified with the command is no
of simultaneous connections has not been exceeded.
longer supported. A warning message is generated to
6027-2706 [I] Recovered number nodes. indicate that the option has no effect.
Explanation: User response:
The asynchronous part (phase 2) of node failure Correct the command line and then reissue the
recovery has completed. command.
User response: 6027-2713 Permission failure. The command
None. Informational message only. requires SuperuserName authority
to execute.
6027-2707 [I] Node join protocol waiting value
seconds for node recovery Explanation:
The command, or the specified command option,
Explanation:
requires administrative authority.
Node join protocol is delayed until phase 2 of previous
node failure recovery protocol is complete. User response:
Log on as a user with administrative privileges and
User response:
reissue the command.
None. Informational message only.
6027-2714 Could not appoint node nodeName
6027-2708 [E] Rejected node join protocol. Phase
as cluster manager. errorString
two of node failure recovery
appears to still be in progress. Explanation:
The mmchmgr -c command generates this message
Explanation:
if the specified node cannot be appointed as a new
Node join protocol is rejected after a number of
cluster manager.
internal delays and phase two node failure protocol is
still in progress. User response:
Make sure that the specified node is a quorum node
User response:
and that GPFS is running on that node.
None. Informational message only.
6027-2715 Could not appoint a new cluster
6027-2709 Configuration manager node
manager. errorString
nodeName not found in the node
list. Explanation:
The mmchmgr -c command generates this message
Explanation:
when a node is not available as a cluster manager.
The specified node was not found in the node list.

866 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Explanation:
Make sure that GPFS is running on a sufficient number This is an informational message when a new cluster
of quorum nodes. manager takes over.
6027-2716 [I] Challenge response received; User response:
canceling disk election. None. Informational message only.
Explanation: 6027-2724 [I] reasonString. Probing cluster
The node has challenged another node, which won clusterName
the previous election, and detected a response to the
Explanation:
challenge.
This is an informational message when a lease request
User response: has not been renewed.
None. Informational message only.
User response:
6027-2717 Node nodeName is already a None. Informational message only.
cluster manager or another node
6027-2725 [N] Node nodeName lease renewal is
is taking over as the cluster
overdue. Pinging to check if it is
manager.
alive
Explanation:
Explanation:
The mmchmgr -c command generates this message if
This is an informational message on the cluster
the specified node is already the cluster manager.
manager when a lease request has not been renewed.
User response:
User response:
None. Informational message only.
None. Informational message only.
6027-2718 Incorrect port range:
6027-2726 [I] Recovered number nodes for file
GPFSCMDPORTRANGE='range'.
system fileSystem.
Using default.
Explanation:
Explanation:
The asynchronous part (phase 2) of node failure
The GPFS command port range format is lllll[-hhhhh],
recovery has completed.
where lllll is the low port value and hhhhh is the high
port value. The valid range is 1 to 65535. User response:
None. Informational message only.
User response:
None. Informational message only. 6027-2727 fileSystem: quota manager is not
available.
6027-2719 The files provided do not contain
valid quota entries. Explanation:
An attempt was made to perform a quota command
Explanation:
without a quota manager running. This could be
The quota file provided does not have valid quota
caused by a conflicting offline mmfsck command.
entries.
User response:
User response:
Reissue the command once the conflicting program
Check that the file being restored is a valid GPFS quota
has ended.
file.
6027-2728 [N] Connection from node rejected
6027-2722 [E] Node limit of number has been
because it does not support IPv6
reached. Ignoring nodeName.
Explanation:
Explanation:
A connection request was received from a node that
The number of nodes that have been added to the
does not support Internet Protocol Version 6 (IPv6),
cluster is greater than some cluster members can
and at least one node in the cluster is configured
handle.
with an IPv6 address (not an IPv4-mapped one) as
User response: its primary address. Since the connecting node will not
Delete some nodes from the cluster using the be able to communicate with the IPv6 node, it is not
mmdelnode command, or shut down GPFS on nodes permitted to join the cluster.
that are running older versions of the code with lower
User response:
limits.
Upgrade the connecting node to a version of GPFS
6027-2723 [N] This node (nodeName) is now that supports IPv6, or delete all nodes with IPv6-only
Cluster Manager for clusterName. addresses from the cluster.

Chapter 42. References 867


6027-2729 Value value for option optionName This could be caused by the change of manager in the
is out of range. Valid values are middle of the operation.
value through value. User response:
Explanation: Retry the operation.
An out of range value was specified for the specified 6027-2736 The value for --block-size must
option.
be the keyword auto or the value
User response: must be of the form nK, nM, nG or
Correct the command line. nT, where n is an optional integer
in the range 1 to 1023.
6027-2730 [E] Node nodeName failed to take over
as cluster manager. Explanation:
An invalid value was specified with the --block-
Explanation: size option.
An attempt to takeover as cluster manager failed.
User response:
User response: Reissue the command with a valid option.
Make sure that GPFS is running on a sufficient number
of quorum nodes. 6027-2737 Editing quota limits for root fileset
is not permitted.
6027-2731 Failed to locate a working cluster
manager. Explanation:
The root fileset was specified for quota limits editing in
Explanation: the mmedquota command.
The cluster manager has failed or changed. The new
cluster manager has not been appointed. User response:
Specify a non-root fileset in the mmedquota
User response: command. Editing quota limits for the root fileset is
Check the internode communication configuration and prohibited.
ensure enough GPFS nodes are up to make a quorum.
6027-2738 Editing quota limits for the root
6027-2732 Attention: No data disks remain user is not permitted
in the system pool. Use
mmapplypolicy to migrate all Explanation:
data left in the system pool to The root user was specified for quota limits editing in
other storage pool. the mmedquota command.
Explanation: User response:
The mmchdisk command has been issued but no data Specify a valid user or group in the mmedquota
disks remain in the system pool. Warn user to use command. Editing quota limits for the root user or
mmapplypolicy to move data to other storage pool. system group is prohibited.
User response: 6027-2739 Editing quota limits for groupName
None. Informational message only. group not permitted.
6027-2734 [E] Disk failure from node nodeName Explanation:
Volume name. Physical volume The system group was specified for quota limits
name. editing in the mmedquota command.
Explanation: User response:
An I/O request to a disk or a request to fence a disk Specify a valid user or group in the mmedquota
has failed in such a manner that GPFS can no longer command. Editing quota limits for the root user or
use the disk. system group is prohibited.
User response: 6027-2740 [I] Starting new election as previous
Check the disk hardware and the software subsystems clmgr is expelled
in the path to the disk. Explanation:
6027-2735 [E] Not a manager This node is taking over as clmgr without challenge as
the old clmgr is being expelled.
Explanation:
This node is not a manager or no longer a manager User response:
of the type required to proceed with the operation. None. Informational message only.

868 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-2741 [W] This node cannot continue to be 6027-2747 [E] Inconsistency detected between
cluster manager the local node number retrieved
from 'mmsdrfs' (nodeNumber) and
Explanation:
the node number retrieved from
This node invoked the user-specified callback handler
'mmfs.cfg' (nodeNumber).
for event tiebreakerCheck and it returned a non-
zero value. This node cannot continue to be the cluster Explanation:
manager. The node number retrieved by obtaining the list of
nodes in the mmsdrfs file did not match the node
User response:
number contained in mmfs.cfg. There may have been
None. Informational message only.
a recent change in the IP addresses being used by
6027-2742 [I] CallExitScript: exit script exitScript network interfaces configured at the node.
on event eventName returned code
User response:
returnCode, quorumloss.
Stop and restart GPFS daemon.
Explanation:
6027-2748 Terminating because a conflicting
This node invoked the user-specified callback handler
program on the same inode space
for the tiebreakerCheck event and it returned a
inodeSpace is running.
non-zero value. The user-specified action with the
error is quorumloss. Explanation:
A program detected that it must terminate because a
User response:
conflicting program is running.
None. Informational message only.
User response:
6027-2743 Permission denied.
Reissue the command after the conflicting program
Explanation: ends.
The command is invoked by an unauthorized user.
6027-2749 Specified locality group 'number'
User response: does not match disk 'name'
Retry the command with an authorized user. locality group 'number'. To
change locality groups in an
6027-2744 [D] Invoking tiebreaker callback script
SNC environment, please use
Explanation: the mmdeldisk and mmadddisk
The node is invoking the callback script due to change commands.
in quorum membership.
Explanation:
User response: The locality group specified on the mmchdisk
None. Informational message only. command does not match the current locality group
of the disk.
6027-2745 [E] File system is not mounted.
User response:
Explanation:
To change locality groups in an SNC environment, use
A command was issued, which requires that the file
the mmdeldisk and mmadddisk commands.
system be mounted.
6027-2750 [I] Node NodeName is now the Group
User response:
Leader.
Mount the file system and reissue the command.
Explanation:
6027-2746 [E] Too many disks unavailable for
A new cluster Group Leader has been assigned.
this server to continue serving a
RecoveryGroup. User response:
None. Informational message only.
Explanation:
RecoveryGroup panic: Too many disks unavailable to 6027-2751 [I] Starting new election: Last
continue serving this RecoveryGroup. This server will elected: NodeNumber Sequence:
resign, and failover to an alternate server will be SequenceNumber
attempted.
Explanation:
User response: A new disk election will be started. The disk challenge
Ensure the alternate server took over. Determine what will be skipped since the last elected node was either
caused this event and address the situation. Prior none or the local node.
messages may help determine the cause of the event.
User response:

Chapter 42. References 869


None. Informational message only. Explanation
6027-2752 [I] This node got elected. Sequence: Users set too low quota limits. It will cause
SequenceNumber unexpected quota behavior. MinQuotaLimit is
computed through:
Explanation:
Local node got elected in the disk election. This node 1. for block: QUOTA_THRESHOLD *
will become the cluster manager. MIN_SHARE_BLOCKS * subblocksize
2. for inode: QUOTA_THRESHOLD *
User response:
MIN_SHARE_INODES
None. Informational message only.
User response:
6027-2753 [N] Responding to disk challenge:
Users should reset quota limits so that they are more
response: ResponseValue. Error
than MinQuotaLimit. It is just a warning. Quota limits
code: ErrorCode.
will be set anyway.
Explanation:
6027-2757 [E] The peer snapshot is in progress.
A disk challenge has been received, indicating that
Queue cannot be flushed now.
another node is attempting to become a Cluster
Manager. Issuing a challenge response, to confirm the Explanation:
local node is still alive and will remain the Cluster The Peer Snapshot is in progress. Queue cannot be
Manager. flushed now.
User response: User response:
None. Informational message only. Reissue the command once the peer snapshot has
ended.
6027-2754 [X] Challenge thread did not respond
to challenge in time: took 6027-2758 [E] The AFM target does not support
TimeIntervalSecs seconds. this operation. Run mmafmconfig
on the AFM target cluster.
Explanation:
Challenge thread took too long to respond to a disk Explanation:
challenge. Challenge thread will exit, which will result The .afmctl file is probably not present on the AFM
in the local node losing quorum. target cluster.
User response: User response:
None. Informational message only. Run mmafmconfig on the AFM target cluster to
configure the AFM target cluster.
6027-2755 [N] Another node committed
disk election with sequence 6027-2759 [N] Disk lease period expired in
CommittedSequenceNumber cluster ClusterName. Attempting
(our sequence was to reacquire lease.
OurSequenceNumber).
Explanation:
Explanation: The disk lease period expired, which will prevent the
Another node committed a disk election with a local node from being able to perform disk I/O. This
sequence number higher than the one used when can be caused by a temporary communication outage.
this node used to commit an election in the past.
User response:
This means that the other node has become, or is
If message is repeated then the communication
becoming, a Cluster Manager. To avoid having two
outage should be investigated.
Cluster Managers, this node will lose quorum.
6027-2760 [N] Disk lease reacquired in cluster
User response:
ClusterName.
None. Informational message only.
Explanation:
6027-2756 Attention: In file
The disk lease has been reacquired, and disk I/O will
system FileSystemName,
be resumed.
FileSetName (Default)
QuotaLimitType(QuotaLimit) User response:
for QuotaTypeUserName/ None. Informational message only.
GroupName/FilesetName is too
6027-2761 Unable to run command on
small. Suggest setting it higher
'fileSystem' while the file system is
than minQuotaLimit.
mounted in restricted mode.

870 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: User has specified a callback script that is invoked
A command that can alter data in a file system was whenever a decision is about to be taken on what node
issued while the file system was mounted in restricted should be expelled from the active cluster. As a result
mode. of the execution of the script, GPFS will reverse its
decision on what node to expel.
User response:
Mount the file system in read-only or read-write mode User response:
or unmount the file system and then reissue the None.
command.
6027-2767 [E] Error errorNumber while accessing
6027-2762 Unable to run command on tiebreaker devices.
'fileSystem' while the file system is
Explanation:
suspended.
An error was encountered while reading from or
Explanation: writing to the tiebreaker devices. When such error
A command that can alter data in a file system was happens while the cluster manager is checking for
issued while the file system was suspended. challenges, it will cause the cluster manager to lose
cluster membership.
User response:
Resume the file system and reissue the command. User response:
Verify the health of the tiebreaker devices.
6027-2763 Unable to start command on
'fileSystem' because conflicting 6027-2770 Disk diskName belongs to a write-
program name is running. Waiting affinity enabled storage pool. Its
until it completes. failure group cannot be changed.
Explanation: Explanation:
A program detected that it cannot start because The failure group specified on the mmchdisk
a conflicting program is running. The program will command does not match the current failure group of
automatically start once the conflicting program has the disk.
ended as long as there are no other conflicting
User response:
programs running at that time.
Use the mmdeldisk and mmadddisk commands to
User response: change failure groups in a write-affinity enabled
None. Informational message only. storage pool.
6027-2764 Terminating command on 6027-2771 fileSystem: Default per-fileset
fileSystem because a conflicting quotas are disabled for quotaType.
program name is running.
Explanation:
Explanation: A command was issued to modify default fileset-level
A program detected that it must terminate because a quota, but default quotas are not enabled.
conflicting program is running.
User response:
User response: Ensure the --perfileset-quota option is in effect
Reissue the command after the conflicting program for the file system, then use the mmdefquotaon
ends. command to enable default fileset-level quotas. After
default quotas are enabled, issue the failed command
6027-2765 command on 'fileSystem' is
again.
finished waiting. Processing
continues ... name 6027-2772 Cannot close disk name.
Explanation: Explanation:
A program detected that it can now continue the Could not access the specified disk.
processing since a conflicting program has ended.
User response:
User response: Check the disk hardware and the path to the disk.
None. Informational message only. Refer to “Unable to access disks” on page 411.
6027-2766 [I] User script has chosen to expel 6027-2773 fileSystem:filesetName: default
node nodeName instead of node quota for quotaType is disabled.
nodeName.
Explanation:
Explanation: A command was issued to modify default quota, but
default quota is not enabled.

Chapter 42. References 871


User response: User response:
Ensure the -Q yes option is in effect for the Check the network connection between this node and
file system, then enable default quota with the the node listed in the message.
mmdefquotaon command.
6027-2779 [E] Challenge thread stopped.
6027-2774 fileSystem: Per-fileset quotas are
Explanation:
not enabled.
A tiebreaker challenge thread stopped because of an
Explanation: error. Cluster membership will be lost.
A command was issued to modify fileset-level quota,
User response:
but per-fileset quota management is not enabled.
Check for additional error messages. File systems will
User response: be unmounted, then the node will rejoin the cluster.
Ensure that the --perfileset-quota option is in
6027-2780 [E] Not enough quorum nodes
effect for the file system and reissue the command.
reachable: reachableNodes.
6027-2775 Storage pool named poolName
Explanation:
does not exist.
The cluster manager cannot reach a sufficient number
Explanation: of quorum nodes, and therefore must resign to prevent
The mmlspool command was issued, but the cluster partitioning.
specified storage pool does not exist.
User response:
User response: Determine if there is a network outage or if too many
Correct the input and reissue the command. nodes have failed.
6027-2776 Attention: A disk being stopped 6027-2781 [E] Lease expired for
reduces the degree of system numSecs seconds
metadata replication (value) or (shutdownOnLeaseExpiry).
data replication (value) to lower
Explanation:
than tolerable.
Disk lease expired for too long, which results in the
Explanation: node losing cluster membership.
The mmchdisk stop command was issued, but the
User response:
disk cannot be stopped because of the current file
None. The node will attempt to rejoin the cluster.
system metadata and data replication factors.
6027-2782 [E] This node is being expelled from
User response:
the cluster.
Make more disks available, delete unavailable disks,
or change the file system metadata replication Explanation:
factor. Also check the current value of the This node received a message instructing it to leave
unmountOnDiskFail configuration parameter. the cluster, which might indicate communication
problems between this node and some other node in
6027-2777 [E] Node nodeName is being expelled
the cluster.
because of an expired lease. Pings
sent: pingsSent. Replies received: User response:
pingRepliesReceived. None. The node will attempt to rejoin the cluster.
Explanation: 6027-2783 [E] New leader elected with a higher
The node listed did not renew its lease in a timely ballot number.
fashion and is being expelled from the cluster.
Explanation:
User response: A new group leader was elected with a higher ballot
Check the network connection between this node and number, and this node is no longer the leader.
the node listed in the message. Therefore, this node must leave the cluster and rejoin.
6027-2778 [I] Node nodeName: ping timed out. User response:
Pings sent: pingsSent. Replies None. The node will attempt to rejoin the cluster.
received: pingRepliesReceived.
6027-2784 [E] No longer a cluster manager or
Explanation: lost quorum while running a group
Ping timed out for the node listed, which should be the protocol.
cluster manager. A new cluster manager will be chosen
Explanation:
while the current cluster manager is expelled from the
cluster.

872 IBM Storage Scale 5.1.9: Problem Determination Guide


Cluster manager no longer maintains quorum after in the node losing cluster membership and then
attempting to run a group protocol, which might attempting to rejoin the cluster.
indicate a network outage or node failures.
User response:
User response: None. The node will attempt to rejoin the cluster.
None. The node will attempt to rejoin the cluster.
6027-2790 Attention: Disk parameters were
6027-2785 [X] A severe error was encountered changed. Use the mmrestripefs
during cluster probe. command with the -r option to
relocate data and metadata.
Explanation:
A severe error was encountered while running the Explanation:
cluster probe to determine the state of the nodes in The mmchdisk command with the change option was
the cluster. issued.
User response: User response:
Examine additional error messages. The node will Issue the mmrestripefs -r command to relocate
attempt to rejoin the cluster. data and metadata.
6027-2786 [E] Unable to contact any quorum 6027-2791 Disk diskName does not belong to
nodes during cluster probe. file system deviceName.
Explanation: Explanation:
This node has been unable to contact any quorum The input disk name does not belong to the specified
nodes during cluster probe, which might indicate a file system.
network outage or too many quorum node failures.
User response:
User response: Correct the command line.
Determine whether there was a network outage or
6027-2792 The current file system version
whether quorum nodes failed.
does not support default per-
6027-2787 [E] Unable to contact enough other fileset quotas.
quorum nodes during cluster
Explanation:
probe.
The current version of the file system does not support
Explanation: default fileset-level quotas.
This node, a quorum node, was unable to contact
User response:
a sufficient number of quorum nodes during cluster
Use the mmchfs -V command to activate the new
probe, which might indicate a network outage or too
function.
many quorum node failures.
6027-2793 [E] Contents of local fileName file are
User response:
invalid. Node may be unable to be
Determine whether there was a network outage or
elected group leader.
whether quorum nodes failed.
Explanation:
6027-2788 [E] Attempt to run leader election
In an environment where tie-breaker disks are used,
failed with error errorNumber.
the contents of the ballot file have become invalid,
Explanation: possibly because the file has been overwritten by
This node attempted to run a group leader election another application. This node will be unable to be
but failed to get elected. This failure might indicate elected group leader.
that two or more quorum nodes attempted to run the
User response:
election at the same time. As a result, this node will
Run mmcommon resetTiebreaker, which will
lose cluster membership and then attempt to rejoin
ensure the GPFS daemon is down on all quorum nodes
the cluster.
and then remove the given file on this node. After that,
User response: restart the cluster on this and on the other nodes.
None. The node will attempt to rejoin the cluster.
6027-2794 [E] Invalid content of disk paxos
6027-2789 [E] Tiebreaker script returned a non- sector for disk diskName.
zero value.
Explanation:
Explanation: In an environment where tie-breaker disks are used,
The tiebreaker script, invoked during group leader the contents of either one of the tie-breaker disks or
election, returned a non-zero value, which results

Chapter 42. References 873


the ballot files became invalid, possibly because the User response:
file has been overwritten by another application. Do not specify these two options together.
User response: 6027-2800 Available memory exceeded on
Examine mmfs.log file on all quorum nodes for request to allocate number bytes.
indication of a corrupted ballot file. If 6027-2793 Trace point sourceFile-tracePoint.
is found then follow instructions for that message.
Explanation:
If problem cannot be resolved, shut down GPFS
The available memory was exceeded during an
across the cluster, undefine, and then redefine the
allocation request made from the cited source file and
tiebreakerdisks configuration variable, and finally
trace point.
restart the cluster.
User response:
6027-2795 An error occurred while executing
Try shutting down and then restarting GPFS. If the
command for fileSystem.
problem recurs, contact the IBM Support Center.
Explanation:
6027-2801 Policy set syntax version
A quota command encountered a problem on a
versionString not supported.
file system. Processing continues with the next file
system. Explanation:
The policy rules do not comply with the supported
User response:
syntax.
None. Informational message only.
User response:
6027-2796 [W] Callback event eventName is
Rewrite the policy rules, following the documented,
not supported on this node;
supported syntax and keywords.
processing continues ...
6027-2802 Object name
Explanation:
'poolName_or_filesetName' is not
informational
valid.
User response:
Explanation:
6027-2797 [I] Node nodeName: lease request The cited name is not a valid GPFS object, names an
received late. Pings sent: object that is not valid in this context, or names an
pingsSent. Maximum pings missed: object that no longer exists.
maxPingsMissed.
User response:
Explanation: Correct the input to identify a GPFS object that exists
The cluster manager reports that the lease request and is valid in this context.
from the given node was received late, possibly
6027-2803 Policy set must start with
indicating a network outage.
VERSION.
User response:
Explanation:
Check the network connection between this node and
The policy set does not begin with VERSION as
the node listed in the message.
required.
6027-2798 [E] The node nodeName does not have
User response:
gpfs.ext package installed to run
Rewrite the policy rules, following the documented,
the requested command.
supported syntax and keywords.
Explanation:
6027-2804 Unexpected SQL result code -
The file system manager node does not have gpfs.ext
sqlResultCode.
package installed properly to run ILM, AFM, or CNFS
commands. Explanation:
This could be an IBM programming error.
User response:
Make sure gpfs.ext package is installed correctly on User response:
file system manager node and try again. Check that your SQL expressions are correct and
supported by the current release of GPFS. If the error
6027-2799 Option 'option' is incompatible
recurs, contact the IBM Support Center.
with option 'option'.
6027-2805 [I] Loaded policy 'policyFileName
Explanation:
or filesystemName':
The options specified on the command are
summaryOfPolicyRules
incompatible.

874 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: 6027-2810 [W] There are numberOfPools storage
The specified loaded policy has the specified policy pools but the policy file is missing
rules. or empty.
User response: Explanation:
None. Informational message only. The cited number of storage pools are defined, but the
policy file is missing or empty.
6027-2806 [E] Error while validating
policy 'policyFileName or User response:
filesystemName': rc=errorCode: You should probably install a policy with placement
errorDetailsString rules using the mmchpolicy command, so that at
least some of your data will be stored in your
Explanation:
nonsystem storage pools.
An error occurred while validating the specified policy.
6027-2811 Policy has no storage pool
User response:
placement rules!
Correct the policy rules, heeding the error details in
this message and other messages issued immediately Explanation:
before or after this message. Use the mmchpolicy The policy has no storage pool placement rules.
command to install a corrected policy rules file.
User response:
6027-2807 [W] Error in evaluation of placement You should probably install a policy with placement
policy for file fileName: rules using the mmchpolicy command, so that at
errorDetailsString least some of your data will be stored in your
nonsystem storage pools.
Explanation:
An error occurred while evaluating the installed 6027-2812 Keyword 'keywordValue' begins a
placement policy for a particular new file. Although the second clauseName clause - only
policy rules appeared to be syntactically correct when one is allowed.
the policy was installed, evidently there is a problem
Explanation:
when certain values of file attributes occur at runtime.
The policy rule should only have one clause of the
User response: indicated type.
Determine which file names and attributes trigger
User response:
this error. Correct the policy rules, heeding the error
Correct the rule and reissue the policy command.
details in this message and other messages issued
immediately before or after this message. Use the 6027-2813 This 'ruleName' rule is missing a
mmchpolicy command to install a corrected policy clauseType required clause.
rules file.
Explanation:
6027-2808 In rule 'ruleName' (ruleNumber), The policy rule must have a clause of the indicated
'wouldBePoolName' is not a valid type.
pool name.
User response:
Explanation: Correct the rule and reissue the policy command.
The cited name that appeared in the cited rule is not a
6027-2814 This 'ruleName' rule is of unknown
valid pool name. This may be because the cited name
was misspelled or removed from the file system. type or not supported.

User response: Explanation:


Correct or remove the rule. The policy rule set seems to have a rule of an unknown
type or a rule that is unsupported by the current
6027-2809 Validated policy 'policyFileName release of GPFS.
or filesystemName':
User response:
summaryOfPolicyRules
Correct the rule and reissue the policy command.
Explanation:
6027-2815 The value 'value' is not supported
The specified validated policy has the specified policy
rules. in a 'clauseType' clause.
Explanation:
User response:
The policy rule clause seems to specify an
None. Informational message only.
unsupported argument or value that is not supported
by the current release of GPFS.

Chapter 42. References 875


User response: Or:
Correct the rule and reissue the policy command.
Correct the macro definitions in your policy rules file.
6027-2816 Policy rules employ features that
If the problem persists, contact the IBM Support
would require a file system
Center.
upgrade.
6027-2819 Error opening temp file
Explanation:
temp_file_name: errorString
One or more policy rules have been written to use new
features that cannot be installed on a back-level file Explanation:
system. An error occurred while attempting to open the
specified temporary work file.
User response:
Install the latest GPFS software on all nodes and User response:
upgrade the file system or change your rules. (Note Check that the path name is defined and accessible.
that LIMIT was introduced in GPFS Release 3.2.) Check the file and then reissue the command.
6027-2817 Error on popen/pclose 6027-2820 Error reading temp file
(command_string): temp_file_name: errorString
rc=return_code_from_popen_or_pcl
Explanation:
ose
An error occurred while attempting to read the
Explanation: specified temporary work file.
The execution of the command_string by popen/
User response:
pclose resulted in an error.
Check that the path name is defined and accessible.
Check the file and then reissue the command.
User response
To correct the error, do one or more of the following: 6027-2821 Rule 'ruleName' (ruleNumber)
specifies a THRESHOLD
Check that the standard m4 macro processing for EXTERNAL POOL
command is installed on your system 'externalPoolName'. This is not
as /usr/bin/m4. supported.
Or: Explanation:
Set the MM_M4_CMD environment variable. GPFS does not support the THRESHOLD clause within a
migrate rule that names an external pool in the FROM
Or: POOL clause.
Correct the macro definitions in your policy rules file. User response:
If the problem persists, contact the IBM Support Correct or remove the rule.
Center. 6027-2822 This file system does not support
fast extended attributes, which
6027-2818 A problem occurred during m4
are needed for encryption.
processing of policy rules. rc =
return_code_from_popen_pclose_o Explanation:
r_m4 Fast extended attributes need to be supported by the
file system for encryption to be activated.
Explanation:
An attempt to expand the policy rules with an m4 User response:
subprocess yielded some warnings or errors or the m4 Enable the fast extended attributes feature in this file
macro wrote some output to standard error. Details or system.
related messages may follow this message.
6027-2823 [E] Encryption activated in the file
system, but node not enabled for
User response encryption.
To correct the error, do one or more of the following:
Explanation:
Check that the standard m4 macro processing The file system is enabled for encryption, but this node
command is installed on your system is not.
as /usr/bin/m4.
User response:
Or:
Set the MM_M4_CMD environment variable.

876 IBM Storage Scale 5.1.9: Problem Determination Guide


Ensure the GPFS encryption packages are installed. 6027-2951 [W] Value value for worker1Threads
Verify if encryption is supported on this node must be <= than the original
architecture. setting value
6027-2824 This file system version does not Explanation:
support encryption rules. An attempt to dynamically set worker1Threads
found the value out of range. The dynamic value must
Explanation:
be 2 <= value <= the original setting when the GPFS
This file system version does not support encryption.
daemon was started.
User response:
6027-2952 [E] Unknown assert class
Update the file system to a version which supports
'assertClass'.
encryption.
Explanation:
6027-2825 Duplicate encryption set name
The assert class is not recognized.
'setName'.
User response:
Explanation:
Specify a valid assert class.
The given set name is duplicated in the policy file.
6027-2953 [E] Non-numeric assert value 'value'
User response:
after class 'class'.
Ensure each set name appears only once in the policy
file. Explanation:
The specified assert value is not recognized.
6027-2826 The encryption set 'setName'
requested by rule 'rule' could not User response:
be found. Specify a valid assert integer value.
Explanation: 6027-2954 [E] Assert value 'value' after class
The given set name used in the rule cannot be found. 'class' must be from 0 to 127.
User response: Explanation:
Verify if the set name is correct. Add the given set if it The specified assert value is not recognized.
is missing from the policy.
User response:
6027-2827 [E] Error in evaluation of encryption Specify a valid assert integer value.
policy for file fileName: %s
6027-2955 [W] Time-of-day may have jumped
Explanation: back. Late by delaySeconds
An error occurred while evaluating the encryption rules seconds to wake certain threads.
in the given policy file.
Explanation:
User response: Time-of-day may have jumped back, which has
Examine the other error messages produced while resulted in some threads being awakened later than
evaluating the policy file. expected. It is also possible that some other factor has
caused a delay in waking up the threads.
6027-2828 [E] Encryption not supported on
Windows. Encrypted file systems User response:
are not allowed when Windows Verify if there is any problem with network time
nodes are present in the cluster. synchronization, or if time-of-day is being incorrectly
set.
Explanation:
Self-explanatory. 6027-2956 [E] Invalid crypto engine type
User response: (encryptionCryptoEngineType):
To activate encryption, ensure there are no Windows cryptoEngineType.
nodes in the cluster. Explanation:
6027-2950 [E] The specified value for
Trace value 'value' after class
encryptionCryptoEngineType is incorrect.
'class' must be from 0 to 14.
User response:
Explanation:
Specify a valid value for
The specified trace value is not recognized.
encryptionCryptoEngineType.
User response:
Specify a valid trace integer value.

Chapter 42. References 877


6027-2957 [E] Invalid cluster manager selection 6027-2962 [X] The daemon memory configuration
choice (clusterManagerSelection): hard floor of name1 MiB is above
clusterManagerSelection. the system memory configuration
of name2 MiB. GPFS is shutting
Explanation:
down. Ensure the system has
The specified value for clusterManagerSelection
enough memory for the daemon to
is incorrect.
run.
User response:
Explanation:
Specify a valid value for
The physical or system memory available to GPFS is
clusterManagerSelection.
below the hard limit for GPFS to function. GPFS cannot
6027-2958 [E] Invalid NIST compliance continue and will shut down.
type (nistCompliance):
User response:
nistComplianceValue.
Configure more memory on this node. The FAQ
Explanation: specifies minimum memory configurations.
The specified value for nistCompliance is incorrect.
6027-2963 [E] The Daemon memory
User response: configuration warning floor of
Specify a valid value for nistCompliance. name1 MiB is above the system
memory configuration of name2
6027-2959 [E] The CPU architecture on this
MiB. GPFS may be unstable.
node does not support tracing
in traceMode mode. Switching to Explanation:
traceMode mode. The physical or system memory available to GPFS is
below the warning limit for GPFS to function well.
Explanation:
GPFS will start up but may be unstable.
The CPU does not have constant time stamp counter
capability required for overwrite trace mode. The trace User response:
has been enabled in blocking mode. Configure more memory on this node. The FAQ
specifies minimum memory configurations.
User response:
Update configuration parameters to use trace facility 6027-2964 [E] Daemon failed to allocate kernel
in blocking mode or replace this node with modern memory for fast condvar feature.
CPU architecture. Available memory is too small or
fragmented. Consider clearing the
6027-2960 [W] Unable to establish a session
pagecache.
with Active Directory server
for the domain 'domainServer'. Explanation:
ID mapping through Microsoft The physical or system memory available to GPFS was
Identity Management for Unix will too small or fragmented to allocate memory for the
be unavailable. fast condvar feature. GPFS cannot continue and will
shut down.
Explanation:
GPFS tried to establish an LDAP session with the User response:
specified Active Directory server but was unable to do Configure more memory on this node. Consider
so. clearing the pagecache.
User response: 6027-3101 Pdisk rotation rate invalid in
Ensure that the specified domain controller is option 'option'.
available.
Explanation:
6027-2961 [I] Established a session with Active When parsing disk lists, the pdisk rotation rate is not
Directory server for the domain valid.
'domainServer'.
User response:
Explanation: Specify a valid rotation rate (SSD, NVRAM, or 1025
GPFS was able to successfully establish an LDAP through 65535).
session with the specified Active Directory server.
6027-3102 Pdisk FRU number too long in
User response: option 'option', maximum length
None. length.
Explanation:

878 IBM Storage Scale 5.1.9: Problem Determination Guide


When parsing disk lists, the pdisk FRU number is too User response:
long. Remove duplicate MSG_PARSE_DUPNAME which is
not documented.
User response:
Specify a valid FRU number that is shorter than or 6027-3109 [E] NUMA failed to load library
equal to the maximum length. numaLibraryName; GPFS NUMA
awareness is not available. Install
6027-3103 Pdisk location too long in option
the 'numactl' package for your
'option', maximum length length.
platform.
Explanation:
Explanation:
When parsing disk lists, the pdisk location is too long.
The 'numactl' package is required by the GPFS NUMA
User response: features.
Specify a valid location that is shorter than or equal to
User response:
the maximum length.
Install the 'numactl' package for your platform.
6027-3104 Pdisk failure domains too long in
6027-3110 [E] GPFS NUMA subsystem
option 'name1name2', maximum
initialization failed. NUMA
length name3.
awareness is disabled and is not
Explanation: available. See prior log messages
When parsing disk lists, the pdisk failure domains are for more information.
too long.
Explanation:
User response: The GPFS NUMA subsystem initialization failed.
Specify valid failure domains, shorter than the
User response:
maximum.
See previous log messages for more information.
6027-3105 Pdisk nPathActive invalid in option
6027-3111 [W] NUMA BIOS platform support for
'option'.
NUMA is disabled or not available.
Explanation: GPFS NUMA awareness is limited
When parsing disk lists, the nPathActive value is not or not available. This is a platform
valid. issue and is commonly seen with
VMs, LPARs, and containers.
User response:
Specify a valid nPathActive value (0 to 255). Explanation:
The platform does not support reporting NUMA
6027-3106 Pdisk nPathTotal invalid in option
information through libnuma.
'option'.
User response:
Explanation:
This is common with V Ms. Ensure that your platform
When parsing disk lists, the nPathTotal value is not
firmware and 'numactl' package are current. Contact
valid.
your hardware vendor.
User response:
6027-3112 [E] numaApiName is unsuccessful.
Specify a valid nPathTotal value (0 to 255).
Return code: returnCode, errno
6027-3107 Pdisk nsdFormatVersion invalid in errnoValue.
option 'name1name2'.
Explanation:
Explanation: NUMA API is unsuccessful. Probable causes can be an
The nsdFormatVersion that is entered while parsing insufficient CPU mask buffer as indicated in the NUMA
the disk is invalid. API documentation.
User response: User response:
Specify valid nsdFormatVersion, 1 or 2. Refer to the NUMA API documentation. Contact the
IBM Support Center.
6027-3108 Declustered array name name1
appears more than once in the 6027-3113 [E] RDMA thread threadId for
declustered array stanzas. numaApiName is unsuccessful.
Return code: returnCode, errno
Explanation:
errnoValue.
when parsing declustered array lists a duplicate name
is found. Explanation:

Chapter 42. References 879


NUMA affinity API is unsuccessful. Probable causes 6027-3205 AFM: Failed to get xattrs for inode
can be an invalid thread ID, CPU mask, memory inodeNum, ignoring.
address, or privileges as indicated in the NUMA API
Explanation:
documentation.
Getting extended attributes on an inode failed.
User response:
User response:
Refer to the NUMA API documentation. Contact the
None.
IBM Support Center.
6027-3209 Home NFS mount of host:path
6027-3114 [W] NUMA reporting is not supported
failed with error err
by this platform. GPFS NUMA
awareness is not available. Explanation:
NFS mounting of path from the home cluster failed.
Explanation:
This platform does not report NUMA information. GPFS User response:
NUMA awareness is not available. Make sure the exported path can be mounted over
NFSv3.
User response:
None. 6027-3210 Cannot find AFM control file for
fileset filesetName in the exported
6027-3115 [E] numaApiName is unsuccessful.
file system at home. ACLs and
Explanation: extended attributes will not be
NUMA API is unsuccessful. synchronized. Sparse files will
User response: have zeros written for holes.
Refer to the NUMA API documentation. Contact the Explanation:
IBM Support Center. Either home path does not belong to GPFS, or the AFM
control file is not present in the exported path.
6027-3200 AFM ERROR: command
pCacheCmd fileset filesetName User response:
fileids If the exported path belongs to a GPFS file system, run
[parentId.childId.tParentId.targetI the mmafmconfig command with the enable option
d,ReqCmd] original error oerr on the export path at home.
application error aerr remote error
6027-3211 Change in home export detected.
remoteError
Caching will be disabled.
Explanation:
Explanation:
AFM operations on a particular file failed.
A change in home export was detected or the home
User response: path is stale.
For asynchronous operations that are requeued, run
User response:
the mmafmctl command with the resumeRequeued
Ensure the exported path is accessible.
option after fixing the problem at the home cluster.
6027-3212 AFM ERROR: Cannot enable AFM
6027-3201 AFM ERROR DETAILS: type:
for fileset filesetName (error err)
remoteCmdType snapshot name
snapshotName snapshot ID Explanation:
snapshotId AFM was not enabled for the fileset because the root
file handle was modified, or the remote path is stale.
Explanation:
Peer snapshot creation or deletion failed. User response:
Ensure the remote export path is accessible for NFS
User response:
mount.
Fix snapshot creation or deletion error.
6027-3213 Cannot find snapshot link
6027-3204 AFM: Failed to set xattr on inode
directory name for exported
inodeNum error err, ignoring.
file system at home for fileset
Explanation: filesetName. Snapshot directory at
Setting extended attributes on an inode failed. home will be cached.
User response: Explanation:
None. Unable to determine the snapshot directory at the
home cluster.

880 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: User response:
None. Make sure the exported path can be mounted over
NFSv3.
6027-3214 [E] AFM: Unexpiration of fileset
filesetName failed with error 6027-3221 AFM: Home NFS mount
err. Use mmafmctl to manually of host:path succeeded for
unexpire the fileset. file system fileSystem fileset
filesetName. Caching is enabled.
Explanation:
Unexpiration of fileset failed after a home reconnect. Explanation:
NFS mount of the path from the home cluster
User response:
succeeded. Caching is enabled.
Run the mmafmctl command with the unexpire
option on the fileset. User response:
None.
6027-3215 [W] AFM: Peer snapshot delayed due
to long running execution of 6027-3224 [I] AFM: Failed to set extended
operation to remote cluster for attributes on file system fileSystem
fileset filesetName. Peer snapshot inode inodeNum error err, ignoring.
continuing to wait.
Explanation:
Explanation: Setting extended attributes on an inode failed.
Peer snapshot command timed out waiting to flush
User response:
messages.
None.
User response:
6027-3225 [I] AFM: Failed to get extended
None.
attributes for file system
6027-3216 Fileset filesetName encountered fileSystem inode inodeNum,
an error synchronizing with ignoring.
the remote cluster. Cannot
Explanation:
synchronize with the remote
Getting extended attributes on an inode failed.
cluster until AFM recovery is
executed. User response:
None.
Explanation:
Cache failed to synchronize with home because 6027-3226 [I] AFM: Cannot find control file
of an out of memory or conflict error. Recovery, for file system fileSystem fileset
resynchronization, or both will be performed by GPFS filesetName in the exported file
to synchronize cache with the home. system at home. ACLs and
extended attributes will not be
User response:
synchronized. Sparse files will
None.
have zeros written for holes.
6027-3217 AFM ERROR Unable to unmount
Explanation:
NFS export for fileset filesetName
Either the home path does not belong to GPFS, or the
Explanation: AFM control file is not present in the exported path.
NFS unmount of the path failed.
User response:
User response: If the exported path belongs to a GPFS file system, run
None. the mmafmconfig command with the enable option
on the export path at home.
6027-3220 AFM: Home NFS mount of
host:path failed with error err 6027-3227 [E] AFM: Cannot enable AFM for
for file system fileSystem fileset file system fileSystem fileset
id filesetName. Caching will be filesetName (error err)
disabled and the mount will be
Explanation:
tried again after mountRetryTime
AFM was not enabled for the fileset because the root
seconds, on next request to
file handle was modified, or the remote path is stale.
gateway
User response:
Explanation:
Ensure the remote export path is accessible for NFS
NFS mount of the home cluster failed. The mount will
mount.
be tried again after mountRetryTime seconds.

Chapter 42. References 881


6027-3228 [E] AFM: Unable to unmount NFS 6027-3234 [E] AFM: Unable to start thread to
export for file system fileSystem unexpire filesets.
fileset filesetName
Explanation:
Explanation: Failed to start thread for unexpiration of fileset.
NFS unmount of the path failed.
User response:
User response: None.
None.
6027-3235 [I] AFM: Stopping recovery for the
6027-3229 [E] AFM: File system fileSystem fileset file system fileSystem fileset
filesetName encountered an error filesetName
synchronizing with the remote
Explanation:
cluster. Cannot synchronize with
AFM recovery terminated because the current node is
the remote cluster until AFM
no longer MDS for the fileset.
recovery is executed.
User response:
Explanation:
None.
The cache failed to synchronize with home because
of an out of memory or conflict error. Recovery, 6027-3236 [E] AFM: Recovery on file system
resynchronization, or both will be performed by GPFS fileSystem fileset filesetName
to synchronize the cache with the home. failed with error err. Recovery will
be retried on next access after
User response:
recovery retry interval (timeout
None.
seconds) or manually resolve
6027-3230 [I] AFM: Cannot find snapshot link known problems and recover the
directory name for exported file fileset.
system at home for file system
Explanation:
fileSystem fileset filesetName.
Recovery failed to complete on the fileset. The next
Snapshot directory at home will be
access will restart recovery.
cached.
Explanation:
Explanation:
AFM recovery failed. Fileset will be temporarily put
Unable to determine the snapshot directory at the
into dropped state and will be recovered on accessing
home cluster.
fileset after timeout mentioned in the error message.
User response: User can recover the fileset manually by running
None. mmafmctl command with recover option after
rectifying any known errors leading to failure.
6027-3232 type AFM: pCacheCmd file system
fileSystem fileset filesetName file User response:
IDs None.
[parentId.childId.tParentId.targetI
6027-3239 [E] AFM: Remote command
d,flag] name sourceName origin
remoteCmdType on file
error err
system fileSystem snapshot
Explanation: snapshotName snapshot ID
AFM operations on a particular file failed. snapshotId failed.
User response: Explanation:
For asynchronous operations that are requeued, run A failure occurred when creating or deleting a peer
the mmafmctl command with the resumeRequeued snapshot.
option after fixing the problem at the home cluster.
User response:
6027-3233 [I] AFM: Previous error repeated Examine the error details and retry the operation.
repeatNum times.
6027-3240 [E] AFM: pCacheCmd file system
Explanation: fileSystem fileset filesetName file
Multiple AFM operations have failed. IDs
[parentId.childId.tParentId.targetI
User response:
d,flag] error err
None.
Explanation:

882 IBM Storage Scale 5.1.9: Problem Determination Guide


Operation failed to execute on home in independent- A mount of the home cluster failed. The mount will be
writer mode. tried again after mountRetryTime seconds.
User response: User response:
None. Verify that the afmTarget can be mounted using the
specified protocol.
6027-3241 [I] AFM: GW queue transfer started
for file system fileSystem fileset 6027-3246 [I] AFM: Prefetch recovery started for
filesetName. Transferring to the file system fileSystem fileset
nodeAddress. filesetName.
Explanation: Explanation:
An old GW initiated the queue transfer because a Prefetch recovery started.
new GW node joined the cluster, and the fileset now
User response:
belongs to the new GW node.
None.
User response:
6027-3247 [I] AFM: Prefetch recovery completed
None.
for the file system fileSystem
6027-3242 [I] AFM: GW queue transfer started fileset filesetName. error error
for file system fileSystem fileset
Explanation:
filesetName. Receiving from
Prefetch recovery completed.
nodeAddress.
User response:
Explanation:
None.
An old MDS initiated the queue transfer because this
node joined the cluster as GW and the fileset now 6027-3248 [E] AFM: Cannot find the control
belongs to this node. file for fileset filesetName in the
exported file system at home.
User response:
This file is required to operate in
None.
primary mode. The fileset will be
6027-3243 [I] AFM: GW queue transfer disabled.
completed for file system
Explanation:
fileSystem fileset filesetName.
Either the home path does not belong to GPFS, or the
error error
AFM control file is not present in the exported path.
Explanation:
User response:
A GW queue transfer completed.
If the exported path belongs to a GPFS file system, run
User response: the mmafmconfig command with the enable option
None. on the export path at home.
6027-3244 [I] AFM: Home mount of afmTarget 6027-3249 [E] AFM: Target for fileset filesetName
succeeded for file system is not a secondary-mode fileset
fileSystem fileset filesetName. or file system. This is required
Caching is enabled. to operate in primary mode. The
fileset will be disabled.
Explanation:
A mount of the path from the home cluster succeeded. Explanation:
Caching is enabled. The AFM target is not a secondary fileset or file
system.
User response:
None. User response:
The AFM target fileset or file system should be
6027-3245 [E] AFM: Home mount of afmTarget
converted to secondary mode.
failed with error error for file
system fileSystem fileset ID 6027-3250 [E] AFM: Refresh intervals cannot be
filesetName. Caching will be set for fileset.
disabled and the mount will be
Explanation:
tried again after mountRetryTime
Refresh intervals are not supported on primary and
seconds, on the next request to
secondary-mode filesets.
the gateway.
User response:
Explanation:
None.

Chapter 42. References 883


6027-3252 [I] AFM: Home has been User response:
restored for cache filesetName. None.
Synchronization with home will be 6027-3257 [E] AFM: Unable to start thread to
resumed. verify primary filesets for RPO.
Explanation: Explanation:
A change in home export was detected that caused the
Failed to start thread for verification of primary filesets
home to be restored. Synchronization with home will for RPO.
be resumed.
User response:
User response: None.
None.
6027-3258 [I] AFM: AFM is not enabled for
6027-3253 [E] AFM: Change in home is fileset filesetName in filesystem
detected for cache filesetName. fileSystem. Some features will not
Synchronization with home is be supported, see documentation
suspended until the problem is for enabling AFM and unsupported
resolved. features.
Explanation: Explanation:
A change in home export was detected or the home Either the home path does not belong to GPFS, or the
path is stale. AFM control file is not present in the exported path.
User response: User response:
Ensure the exported path is accessible. If the exported path belongs to a GPFS file system, run
6027-3254 [W] AFM: Home is taking longer than the mmafmconfig command with the enable option
expected to respond for cache on the export path at home.
filesetName. Synchronization with 6027-3259 [I] AFM: Remote file system
home is temporarily suspended. fileSystem is panicked due
Explanation: to unresponsive messages on
A pending message from gateway node to home is fileset filesetName, re-mount the
taking longer than expected to respond. This could be file system after it becomes
the result of a network issue or a problem at the home responsive.
site. Explanation:
User response: SGPanic is triggered if remote file system is
Ensure the exported path is accessible. unresponsive to bail out inflight messages that are
stuck in the queue.
6027-3255 [E] AFM: Target for fileset filesetName
is a secondary-mode fileset User response:
or file system. Only a primary- Re-mount the remote file system when it is
mode, read-only or local-update responsive.
mode fileset can operate on 6027-3260 [W] AFM home is taking longer
a secondary-mode fileset. The than expected to respond for
fileset will be disabled. cache filesetName, mount index
Explanation: mountIndex.
The AFM target is a secondary fileset or file system. Explanation:
Only a primary-mode, read-only, or local-update fileset A pending message from the gateway node to home is
can operate on a secondary-mode fileset. taking longer than expected to respond. This could be
User response: because of a network issue or a problem at the home
Use a secondary-mode fileset as the target for the site.
primary-mode, read-only or local-update mode fileset. User response:
6027-3256 [I] AFM: The RPO peer snapshot was Ensure that the exported path is accessible.
missed for file system fileSystem 6027-3261 [E] AFM: Bucket access issue is
fileset filesetName. detected for cache filesetname.
Explanation: Synchronization with home (COS)
The periodic RPO peer snapshot was not taken in time is suspended until the problem is
for the primary fileset. resolved.

884 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: Explanation:
A change in COS bucket access is detected. The afmRecoveryVer2 parameter is enabled at the
Cache/Primary site, but the Home/Secondary site
User response:
doesn't support for the afmRecoveryVer2 parameter.
Ensure that the COS bucket is accessible.
User response:
6027-3262 [I] AFM: COS access of afmTarget
Ensure that Home/Secondary is updated to the same
succeeded for file system
or higher version of IBM Storage Scale that is running
fileSystem fileset filesetName.
at the Cache/Primary.
Caching is enabled.
6027-3267 [E] AFM: Invalid control file detected
Explanation:
for fileset filesetName, re-enable
COS access succeeded. Caching is enabled.
AFM on the target path.
User response:
Explanation:
None.
The AFM control file has become invalid at the
6027-3263 [E] AFM: COS access of afmTarget exported path.
failed with error error for file
User response:
system fileSystem, fileset ID
Run the mmafmconfig command with the disable
filesetName. Caching will be
option and immediately run the mmafmconfig
disabled and the access will be
command with the enable option on the export path
tried again after mountRetryTime
at home.
seconds, on the next request to
the gateway. 6027-3268 [E] AFM: Filesets cannot be disabled
online for file system filesetName
Explanation:
because IBM Storage Scale
COS access failed. The access will be tried again after
version is less than 5.1.7.0.
mountRetryTime seconds.
Explanation:
User response:
The AFM fileset cannot be disabled online.
Verify that the afmTarget can be accessible by using
the specified protocol. User response:
Run the mmchfs command to upgrade the file
6027-3264 [E] AFM: IAM modes does not match
system to the latest version and re-run the disable
between Cache (cacheImode) and
command.
Home (homeImode) for the fileset
filesetName (fileSystem). 6027-3269 [E] AFM: Recovery on file system
SGName fileset filesetName
Explanation:
failed with error err. Please
The IAM compliance mode set on home fileset does
check file system manager
not match with the IAM compliance mode that is set at
logs for more information on
the cache fileset.
management command failures
User response: like crsnapshot, delsnapshot
Ensure that both the cache and home filesets carry the etc. Recovery will be retried on
same IAM compliance modes. next access after recovery retry
interval (pCacheDroppedTimeout
6027-3265 [I] AFM: AFM operation has
seconds) or manually resolve
encountered an error.
known problems and recover the
Explanation: fileset.
AFM operation has encountered an error. Refer to
Explanation:
the documentation to know more about AFM error
AFM Recovery failed. Fileset will be in dropped state
messages.
due to error 78. Please check the file system manager
User response: node logs for more information about the reason of
Refer to Troubleshooting AFM issues in the IBM this error, this is happened because of management
Storage Scale: Problem Determination Guide and take command failures like crsnapshot, delsnapshot
appropriate action. etc. Fileset will be recovered on accessing fileset after
timeout mentioned in the error message. User can
6027-3266 [W] AFM: Home/Secondary does not
recover the fileset manually by running the mmafmctl
support the afmRecoveryVer2
command with recover option after rectifying any
parameter of a fileset filesetName
known errors leading to failure.
(filesystemName).

Chapter 42. References 885


User response: Explanation:
Check the file system manager logs. Free disk space reclaims on some regions failed
during tsreclaim run. Typically this is due to the lack
6027-3270 [I] AFM: RPO or peer snapshots
of space reclaim support by the disk controller or
are not supported when
operating system. It may also be due to utilities such
afmFastCreates is enabled,
as mmdefragfs or mmfsck running concurrently.
rc=error.
User response:
Explanation:
Verify that the disk controllers and the operating
The mmpsnap status command is not valid for
systems in the cluster support thin-provisioning space
afmFastCreate enabled fileset.
reclaim. Or, rerun the mmfsctl reclaimSpace
User response: command after mmdefragfs or mmfsck completes.
N/A
6027-3305 AFM Fileset filesetName cannot be
6027-3300 Attribute afmShowHomeSnapshot changed as it is in beingDeleted
cannot be changed for a single- state
writer fileset.
Explanation:
Explanation: The user specified a fileset to tschfileset that
Changing afmShowHomeSnapshot is not supported cannot be changed.
for single-writer filesets.
User response:
User response: None. You cannot change the attributes of the root
None. fileset.
6027-3301 Unable to quiesce all nodes; some 6027-3306 Fileset cannot be changed
processes are busy or holding because it is unlinked.
required resources.
Explanation:
Explanation: The fileset cannot be changed when it is unlinked.
A timeout occurred on one or more nodes while
User response:
trying to quiesce the file system during a snapshot
Link the fileset and then try the operation again.
command.
6027-3307 Fileset cannot be changed.
User response:
Check the GPFS log on the file system manager node. Explanation:
Fileset cannot be changed.
6027-3302 Attribute afmShowHomeSnapshot
cannot be changed for a afmMode User response:
fileset. None.
Explanation: 6027-3308 This AFM option cannot be set for
Changing afmShowHomeSnapshot is not supported a secondary fileset.
for single-writer or independent-writer filesets.
Explanation:
User response: This AFM option cannot be set for a secondary fileset.
None. The fileset cannot be changed.
6027-3303 Cannot restore snapshot; quota User response:
management is active for None.
fileSystem.
6027-3309 The AFM attribute specified
Explanation: cannot be set for a primary fileset.
File system quota management is still active. The file
Explanation:
system must be unmounted when restoring global
This AFM option cannot be set for a primary fileset.
snapshots.
The fileset cannot be changed.
User response:
User response:
Unmount the file system and reissue the restore
None.
command.
6027-3310 A secondary fileset cannot be
6027-3304 Attention: Disk space reclaim on
changed.
number of number regions in
fileSystem returned errors. Explanation:
A secondary fileset cannot be changed.

886 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: Refer to the details given and correct the file system
None. parameters.
6027-3311 A primary fileset cannot be 6027-3317 Warning: file system is not 4K
changed. aligned due to small reasonString.
Native 4K sector disks cannot be
Explanation:
added to this file system unless
A primary fileset cannot be changed.
the disk that is used is dataOnly
User response: and the data block size is at least
None. 128K.
6027-3312 No inode was found matching the Explanation:
criteria. The file system is created with a small inode or block
size. Native 4K sector disk cannot be added to the file
Explanation:
system, unless the disk that is used is dataOnly and
No inode was found matching the criteria.
the data block size is at least 128K.
User response:
User response:
None.
None.
6027-3313 File system scan RESTARTED due
6027-3318 Fileset filesetName cannot be
to resume of all disks being
deleted as it is in compliant mode
emptied.
and it contains user files.
Explanation:
Explanation:
The parallel inode traversal (PIT) phase is restarted
An attempt was made to delete a non-empty fileset
with a file system restripe.
that is in compliant mode.
User response:
User response:
None.
None.
6027-3314 File system scan RESTARTED due
6027-3319 The AFM attribute optionName
to new disks to be emptied.
cannot be set for a primary fileset.
Explanation:
Explanation:
The file system restripe was restarted after a new disk
This AFM option cannot be set for a primary fileset.
was suspended.
Hence, the fileset cannot be changed.
User response:
User response:
None.
None.
6027-3315 File system scan CANCELLED due
6027-3320 commandName:
to new disks to be emptied or
indefiniteRetentionProtection is
resume of all disks being emptied.
enabled. File system cannot be
Explanation: deleted.
The parallel inode traversal (PIT) phase is cancelled
Explanation:
during the file system restripe.
Indefinite retention is enabled for the file system so it
User response: cannot be deleted.
None.
User response:
6027-3316 Unable to create file system None.
because there is not enough space
6027-3321 Snapshot snapshotName is an
for the log files. Number of log
internal pcache recovery snapshot
files: numberOfLogFiles. Log file
and cannot be deleted by user.
size: logFileSize. Change one or
more of the following as suggested Explanation:
and try again: The snapshot cannot be deleted by user as it is an
internal pcache recovery snapshot.
Explanation:
There is not enough space available to create all the User response:
required log files. This can happen when the storage None.
pool is not large enough.
6027-3327 Disk diskName cannot be
User response: added to the storage pool

Chapter 42. References 887


poolName. Allocation map cannot because file system version is less
accommodate disks larger than than 24.00 (5.1.0.0).
size.
Explanation:
Explanation: AFM object filesets cannot be created for file system
The specified disk is too large compared to the disks version less than 24.00.
that were initially used to create the storage pool.
User response:
User response: Minimum file system version 24.00 (5.1.0.0) is
Specify a smaller disk or add the disk to a new storage required to create AFM object fileset.
pool.
6027-3333 [E] IAM-mode cannot be enabled for
6027-3328 The parent of dependent fileset fileset on the file system name1
filesetName is not an independent because the file system version is
fileset or not in the Stopped state. less than 5.1.1.0.
Explanation: Explanation:
The parent of the specified dependent fileset is not an IAM-mode cannot be enabled for fileset on the file
independent fileset or not in AFM Stopped state. system because the file system version is less than
5.1.1.0
User response:
Ensure that the parent fileset is an independent fileset User response:
or stop the parent fileset's replication so that it moves Minimum file system version 5.1.1.0 is required for the
to Stopped state. AFM IAM-mode to be enabled on fileset. Upgrade the
file system version with the mmchfs command.
6027-3329 The AFM fileset of dependent
fileset filesetName is not in 6027-3334 Fileset filesetName cannot be
Stopped state. deleted as it is in compliant
or compliant-plus mode and it
Explanation:
contains user files.
The parent of the specified dependent fileset is not in
AFM Stopped state. Explanation:
An attempt was made to delete a non-empty fileset
User response:
that is in compliant or compliant-plus mode.
Stop the fileset replication so that it moves to the
Stopped state. User response:
None.
6027-3330 [E] AFM dependent filesets cannot
be created for the file system 6027-3335 Dependent fileset cannot be linked
fileSystemName because the file into AFM Object fileset.
system version is less than 22.00
Explanation:
(5.0.4.0).
Dependent fileset support is not enabled into AFM
Explanation: object fileset.
Dependent fileset is not supported in AFM fileset in the
User response:
file system with the version less than 22.00 (5.0.4.0).
None.
User response:
6027-3336 Fileset filesetName is the audit log
Minimum file system version 22.00 (5.0.4.0) is
fileset. Disable file audit logging
required for the AFM dependent fileset support.
by using the mmaudit disable
Upgrade the file system version by using the mmchfs
command before deleting this
command.
fileset.
6027-3331 The AFM file system is not in
Explanation:
Stopped state.
An attempt was made to delete the audit fileset that
Explanation: stores audit records while file audit logging is enabled.
The AFM file system is not in Stopped state.
User response:
User response: Disable file audit logging by using the mmaudit
Stop the root fileset replication by using mmafmctl disable command before deleting this fileset.
command with stop option.
6027-3337 Cannot retrieve the file audit
6027-3332 [E] AFM object filesets cannot be logging fileset name. You can
created for file system fileSystem

888 IBM Storage Scale 5.1.9: Problem Determination Guide


attempt to disable file audit Check the paths to all disks in the file system. Repair
logging to delete this fileset. any failed paths to disks then rediscover the local disk
access.
Explanation:
While deleting the fileset, an attempt to retrieve the 6027-3404 [E] The current file system version
file audit logging fileset failed. The audit log fileset does not support write caching.
should not be deleted while file audit logging is
Explanation:
enabled.
The current file system version does not allow the
User response: write caching option.
Disable file audit logging by using the mmaudit
User response:
disable command.
Use mmchfs -V to convert the file system to version
6027-3338 Scanning file system metadata, 14.04 (4.1.0.0) or higher and reissue the command.
phase number phaseDescription.
6027-3405 [E] Cannot change the rapid repair,
Explanation: \"fileSystemName\" is mounted on
Progress information. number node(s).
User response: Explanation:
N/A Rapid repair can only be changed on unmounted file
systems.
6027-3400 Attention: The file system
is at risk. The specified User response:
replication factor does not tolerate Unmount the file system before running this
unavailable metadata disks. command.
Explanation: 6027-3406 Error: Cannot add 4K native
The default metadata replication was reduced to one dataOnly disk diskName to non-4K
while there were unavailable, or stopped, metadata aligned file system unless the file
disks. This condition prevents future file system system version is at least 4.1.1.4.
manager takeover.
Explanation:
User response: An attempt was made through the mmadddisk
Change the default metadata replication, or delete command to add a 4K native disk to a non-4K aligned
unavailable disks if possible. file system while the file system version is not at
4.1.1.4 or later.
6027-3401 Failure group value for disk
diskName is not valid. User response:
Upgrade the file system to 4.1.1.4 or later, and then
Explanation:
retry the command.
An explicit failure group must be specified for each
disk that belongs to a write affinity enabled storage 6027-3407 [E] Disk failure. Volume name. rc =
pool. value, and physical volume name.
User response: Explanation:
Specify a valid failure group. An I/O request to a disk or a request to fence a disk is
failed in such a manner that GPFS can no longer use
6027-3402 [X] An unexpected device mapper
the disk.
path dmDevice (nsdId) was
detected. The new path does not User response:
have Persistent Reserve enabled. Check the disk hardware and the software subsystems
The local access to disk diskName in the path to the disk.
will be marked as down.
6027-3408 [X] File System fileSystemName
Explanation: unmounted by the system with
A new device mapper path was detected, or a return code value, reason code
previously failed path was activated after the local value, at line value in name.
device discovery was finished. This path lacks a
Explanation:
Persistent Reserve and cannot be used. All device
Console log entry caused by a forced unmount due to a
paths must be active at mount time.
problem such as disk or communication failure.
User response:
User response:

Chapter 42. References 889


Correct the underlying problem and remount the file mmchfs --nfs4-owner-write-acl cannot be
system. enabled under the current file system format version.
6027-3409 Failed to enable maintenance User response:
mode for this file system. Run mmchfs -V, this will change the file system
Maintenance mode can only be format to the latest format supported.
enabled once the file system
6027-3415 [E] Support for --inode-segment-
has been unmounted everywhere.
mgr has not been enabled. Use
You can run the mmlsmount
option -V to enable most recent
<File System> -L command to
features.
see which nodes have this file
system mounted. You can also Explanation:
run this command with the -- mmchfs --inode-segment-mgr cannot be enabled
wait option, which will prevent under the current file system format version.
new mounts and automatically
User response:
enables maintenance mode once
Run mmchfs -V, this will change the file system
the unmounts are finished.
format to the latest format supported.
Explanation:
6027-3416 [I] The new value of --inode-
An attempt was made through mmchfs command to
segment-mgr requires all the
enable maintenance mode while the file system has
mounted nodes to remount the file
been mounted by some nodes.
system.
User response:
Explanation:
Rerun the command after the file system has been
The new value of the --inode-segment-mgr option
unmounted everywhere, or run the command with --
would be effective on a node when the node remounts
wait option to enable maintenance mode for the file
the file system.
system.
User response:
6027-3411 [E] The option --flush-on-close
Remount the file system on all nodes that have the file
cannot be enabled. Use option -V
system mounted.
to enable most recent features.
6027-3417 [E] The options --mitigate-
Explanation:
recall-storms | --
mmchfs --flush-on-close cannot be enabled
nomitigate-recall-storms
under the current file system format version.
cannot be enabled. Use option -V
User response: to enable most recent features.
Run mmchfs -V, this will change the file system
Explanation:
format to the latest format supported.
mmchfs [--mitigate-recall-storms | --
6027-3412 [E] The options --auto-inode- nomitigate-recall-storms] cannot be enabled
limit | --noauto-inode- under the current file system format version.
limit cannot be enabled. Use
User response:
option -V to enable most recent
Run mmchfs -V, this will change the file system
features.
format to the latest format supported. Remount the
Explanation: file system on all nodes that have the file system
mmchfs [--auto-inode-limit | --noauto- mounted.
inode-limit] cannot be enabled under the current
6027-3450 Error errorNumber when purging
file system format version.
key (file system fileSystem). Key
User response: name format possibly incorrect.
Run mmchfs -V, this will change the file system
Explanation:
format to the latest format supported.
An error was encountered when purging a key from the
6027-3413 [E] Support for '--nfs4-owner-write- key cache. The specified key name might have been
acl' has not been enabled. Use incorrect, or an internal error was encountered.
option '-V' to enable most recent
User response:
features.
Ensure that the key name specified in the command is
Explanation: correct.

890 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-3451 Error errorNumber when emptying Restart GPFS. Contact the IBM Support Center.
cache (file system fileSystem). 6027-3460 [E] Incorrect format for the Keyname
Explanation: string.
An error was encountered when purging all the keys Explanation:
from the key cache. An incorrect format was used when specifying the
User response: Keyname string.
Contact the IBM Support Center. User response:
6027-3452 [E] Unable to create encrypted file Verify the format of the Keyname string.
fileName (inode inodeNumber, 6027-3461 [E] Error code: errorNumber.
fileset filesetNumber, file system
fileSystem). Explanation:
An error occurred when processing a key ID.
Explanation:
Unable to create a new encrypted file. The key User response:
required to encrypt the file might not be available. Contact the IBM Support Center.
User response: 6027-3462 [E] Unable to rewrap key: original
Examine the error message following this message for key name: originalKeyname,
information on the specific failure. new key name: newKeyname
(inode inodeNumber, fileset
6027-3453 [E] Unable to open encrypted filesetNumber, file system
file: inode inodeNumber, fileset fileSystem).
filesetNumber, file system
fileSystem. Explanation:
Unable to rewrap the key for a specified file, possibly
Explanation: because the existing key or the new key cannot be
Unable to open an existing encrypted file. The key retrieved from the key server.
used to encrypt the file might not be available.
User response:
User response: Examine the error message following this message for
Examine the error message following this message for information on the specific failure.
information on the specific failure.
6027-3463 [E] Rewrap error.
6027-3457 [E] Unable to rewrap key with name
Keyname (inode inodeNumber, Explanation:
fileset filesetNumber, file system An internal error occurred during key rewrap.
fileSystem). User response:
Explanation: Examine the error messages surrounding this
Unable to rewrap the key for a specified file because of message. Contact the IBM Support Center.
an error with the key name. 6027-3464 [E] New key is already in use.
User response: Explanation:
Examine the error message following this message for The new key specified in a key rewrap is already being
information on the specific failure. used.
6027-3458 [E] Invalid length for the Keyname User response:
string. Ensure that the new key specified in the key rewrap is
Explanation: not being used by the file.
The Keyname string has an incorrect length. The length 6027-3465 [E] Cannot retrieve original key.
of the specified string was either zero or it was larger
than the maximum allowed length. Explanation:
Original key being used by the file cannot be retrieved
User response: from the key server.
Verify the Keyname string.
User response:
6027-3459 [E] Not enough memory. Verify that the key server is available, the credentials
Explanation: to access the key server are correct, and that the key is
Unable to allocate memory for the Keyname string. defined on the key server.
User response: 6027-3466 [E] Cannot retrieve new key.

Chapter 42. References 891


Explanation: 6027-3473 [E] Could not locate the RKM.conf file.
Unable to retrieve the new key specified in the rewrap
Explanation:
from the key server.
Unable to locate the RKM.conf configuration file.
User response:
User response:
Verify that the key server is available, the credentials
Contact the IBM Support Center.
to access the key server are correct, and that the key is
defined on the key server. 6027-3474 [E] Could not open fileType file
('fileName' was specified).
6027-3468 [E] Rewrap error code errorNumber.
Explanation:
Explanation:
Unable to open the specified configuration file.
Key rewrap failed.
Encryption files will not be accessible.
User response:
User response:
Record the error code and contact the IBM Support
Ensure that the specified configuration file is present
Center.
on all nodes.
6027-3469 [E] Encryption is enabled but the
6027-3475 [E] Could not read file 'fileName'.
crypto module could not be
initialized. Error code: number. Explanation:
Ensure that the GPFS crypto Unable to read the specified file.
package was installed.
User response:
Explanation: Ensure that the specified file is accessible from the
Encryption is enabled, but the cryptographic module node.
required for encryption could not be loaded.
6027-3476 [E] Could not seek through file
User response: 'fileName'.
Ensure that the packages required for encryption are
Explanation:
installed on each node in the cluster.
Unable to seek through the specified file. Possible
6027-3470 [E] Cannot create file fileName: inconsistency in the local file system where the file is
extended attribute is too stored.
large: numBytesRequired bytes
User response:
(numBytesAvailable available)
Ensure that the specified file can be read from the
(fileset filesetNumber, file system
local node.
fileSystem).
6027-3477 [E] Could not wrap the FEK.
Explanation:
Unable to create an encryption file because the Explanation:
extended attribute required for encryption is too large. Unable to wrap the file encryption key.
User response: User response:
Change the encryption policy so that the file key is Examine other error messages. Verify that the
wrapped fewer times, reduce the number of keys used encryption policies being used are correct.
to wrap a file key, or create a file system with a larger
6027-3478 [E] Insufficient memory.
inode size.
6027-3471 [E] Explanation:
At least one key must be specified.
Internal error: unable to allocate memory.
Explanation:
User response:
No key name was specified.
Restart GPFS. Contact the IBM Support Center.
User response:
6027-3479 [E] Missing combine parameter string.
Specify at least one key name.
6027-3472 [E] Explanation:
Could not combine the keys.
The combine parameter string was not specified in the
Explanation: encryption policy.
Unable to combine the keys used to wrap a file key.
User response:
User response: Verify the syntax of the encryption policy.
Examine the keys being used. Contact the IBM Support
6027-3480 [E] Missing encryption parameter
Center.
string.

892 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: Explanation:
The encryption parameter string was not specified in The remote key manager ID cannot be longer than the
the encryption policy. specified length.
User response: User response:
Verify the syntax of the encryption policy. Use a shorter remote key manager ID.
6027-3481 [E] Missing wrapping parameter 6027-3488 [E] The length of the key ID cannot be
string. zero.
Explanation: Explanation:
The wrapping parameter string was not specified in the The length of the specified key ID string cannot be
encryption policy. zero.
User response: User response:
Verify the syntax of the encryption policy. Specify a key ID string with a valid length.
6027-3482 [E] 'combineParameter' could not be 6027-3489 [E] The length of the RKM ID cannot
parsed as a valid combine be zero.
parameter string.
Explanation:
Explanation: The length of the specified RKM ID string cannot be
Unable to parse the combine parameter string. zero.
User response: User response:
Verify the syntax of the encryption policy. Specify an RKM ID string with a valid length.
6027-3483 [E] 'encryptionParameter' could not 6027-3490 [E] The maximum size of the
be parsed as a valid encryption RKM.conf file currently supported
parameter string. is number bytes.
Explanation: Explanation:
Unable to parse the encryption parameter string. The RKM.conf file is larger than the size that is
currently supported.
User response:
Verify the syntax of the encryption policy. User response:
User a smaller RKM.conf configuration file.
6027-3484 [E] 'wrappingParameter' could not
be parsed as a valid wrapping 6027-3491 [E] The string 'Keyname' could not be
parameter string. parsed as a valid key name.
Explanation: Explanation:
Unable to parse the wrapping parameter string. The specified string could not be parsed as a valid key
name.
User response:
Verify the syntax of the encryption policy. User response:
Specify a valid Keyname string.
6027-3485 [E] The Keyname string cannot be
longer than number characters. 6027-3493 [E] numKeys keys were specified but
a maximum of numKeysMax is
Explanation:
supported.
The specified Keyname string has too many characters.
Explanation:
User response:
The maximum number of specified key IDs was
Verify that the specified Keyname string is correct.
exceeded.
6027-3486 [E] The KMIP library could not be
User response:
initialized.
Change the encryption policy to use fewer keys.
Explanation:
6027-3494 [E] Unrecognized cipher mode.
The KMIP library used to communicate with the key
server could not be initialized. Explanation:
Unable to recognize the specified cipher mode.
User response:
Restart GPFS. Contact the IBM Support Center. User response:
Specify one of the valid cipher modes.
6027-3487 [E] The RKM ID cannot be longer than
number characters. 6027-3495 [E] Unrecognized cipher.

Chapter 42. References 893


Explanation: User response:
Unable to recognize the specified cipher. Specify a valid cipher.
User response: 6027-3504 [E] Unrecognized encryption mode
Specify one of the valid ciphers. ('mode').
6027-3496 [E] Unrecognized combine mode. Explanation:
The specified encryption mode was not recognized.
Explanation:
Unable to recognize the specified combine mode. User response:
Specify a valid encryption mode.
User response:
Specify one of the valid combine modes. 6027-3505 [E] Invalid key length ('keyLength').
6027-3497 [E] Unrecognized encryption mode. Explanation:
The specified key length was incorrect.
Explanation:
Unable to recognize the specified encryption mode. User response:
Specify a valid key length.
User response:
Specify one of the valid encryption modes. 6027-3506 [E] Mode 'mode1' is not compatible
with mode 'mode2', aborting.
6027-3498 [E] Invalid key length.
Explanation:
Explanation:
The two specified encryption parameters are not
An invalid key length was specified.
compatible.
User response:
User response:
Specify a valid key length for the chosen cipher mode.
Change the encryption policy and specify compatible
6027-3499 [E] Unrecognized wrapping mode. encryption parameters.
Explanation: 6027-3509 [E] Key 'keyID:RKMID' could not
Unable to recognize the specified wrapping mode. be fetched (RKM reported error
errorNumber).
User response:
Specify one of the valid wrapping modes. Explanation:
The key with the specified name cannot be fetched
6027-3500 [E] Duplicate Keyname string
from the key server.
'keyIdentifier'.
User response:
Explanation:
Examine the error messages to obtain information
A given Keyname string has been specified twice.
about the failure. Verify connectivity to the key server
User response: and that the specified key is present at the server.
Change the encryption policy to eliminate the
6027-3510 [E] Could not bind symbol
duplicate.
symbolName (errorDescription).
6027-3501 [E] Unrecognized combine mode
Explanation:
('combineMode').
Unable to find the location of a symbol in the library.
Explanation:
User response:
The specified combine mode was not recognized.
Contact the IBM Support Center.
User response:
6027-3512 [E] The specified type 'type' for
Specify a valid combine mode.
backend 'backend' is invalid.
6027-3502 [E] Unrecognized cipher mode
Explanation:
('cipherMode').
An incorrect type was specified for a key server
Explanation: backend.
The specified cipher mode was not recognized.
User response:
User response: Specify a correct backend type in RKM.conf.
Specify a valid cipher mode.
6027-3513 [E] Duplicate backend 'backend'.
6027-3503 [E] Unrecognized cipher ('cipher').
Explanation:
Explanation: A duplicate backend name was specified in RKM.conf.
The specified cipher was not recognized.

894 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: 6027-3521 [E] 'timeout' is not a valid connection
Specify unique RKM backends in RKM.conf. timeout.
6027-3515 cmdName error: The password Explanation:
specified exceeds the maximum The value specified for the connection timeout is
length of length characters. incorrect.
Explanation: User response:
The password provided with the --pwd argument is Specify a valid connection timeout (in seconds).
too long.
6027-3522 [E] 'url' is not a valid URL.
User response:
Explanation:
Pick a shorter password and retry the operation.
The specified string is not a valid URL for the key
6027-3516 [E] Error on server.
gpfs_restripe_file(pathName):
User response:
errorString (offset=offset)
Specify a valid URL for the key server.
Explanation:
6027-3524 [E] 'tenantName' is not a valid
An error occurred while migrating the named file. File
tenantName.
may be ill-placed due to this failure.
Explanation:
User response:
An incorrect value was specified for the tenant name.
Investigate the file and possibly reissue the command.
Use the mmlsattr and mmchattr commands to User response:
examine and change the pool attributes of the named Specify a valid tenant name.
file.
6027-3527 [E] Backend 'backend' could not be
6027-3517 [E] Could not open library (libName). initialized (error errorNumber).
Explanation: Explanation:
Unable to open the specified library. Key server backend could not be initialized.
User response: User response:
Verify that all required packages are installed for Examine the error messages. Verify connectivity to the
encryption. Contact the IBM Support Center. server. Contact the IBM Support Center.
6027-3518 [E] The length of the RKM ID string 6027-3528 [E] Unrecognized wrapping mode
is invalid (must be between 0 and ('wrapMode').
length characters).
Explanation:
Explanation: The specified key wrapping mode was not recognized.
The length of the RKM backend ID is invalid.
User response:
User response: Specify a valid key wrapping mode.
Specify an RKM backend ID with a valid length.
6027-3529 [E] An error was encountered while
6027-3519 [E] 'numAttempts' is not a valid processing file 'fileName':
number of connection attempts.
Explanation:
Explanation: An error was encountered while processing the
The value specified for the number of connection specified configuration file.
attempts is incorrect.
User response:
User response: Examine the error messages that follow and correct
Specify a valid number of connection attempts. the corresponding conditions.
6027-3520 [E] 'sleepInterval' is not a valid sleep 6027-3530 [E] Unable to open encrypted file:
interval. key retrieval not initialized
Explanation: (inode inodeNumber, fileset
The value specified for the sleep interval is incorrect. filesetNumber, file system
fileSystem).
User response:
Explanation:
Specify a valid sleep interval value (in microseconds).

Chapter 42. References 895


File is encrypted but the infrastructure required to 6027-3541 [E] Encryption is not supported on
retrieve encryption keys was not initialized, likely Windows.
because processing of RKM.conf failed.
Explanation:
User response: Encryption cannot be activated if there are Windows
Examine error messages at the time the file system nodes in the cluster.
was mounted.
User response:
6027-3533 [E] Invalid encryption key derivation Ensure that encryption is not activated if there are
function. Windows nodes in the cluster.
Explanation: 6027-3543 [E] The integrity of the file encrypting
An incorrect key derivation function was specified. key could not be verified after
unwrapping; the operation was
User response:
cancelled.
Specify a valid key derivation function.
Explanation:
6027-3534 [E] Unrecognized encryption
When opening an existing encrypted file, the integrity
key derivation function
of the file encrypting key could not be verified. Either
('keyDerivation').
the cryptographic extended attributes were damaged,
Explanation: or the master key(s) used to unwrap the FEK have
The specified key derivation function was not changed.
recognized.
User response:
User response: Check for other symptoms of data corruption, and
Specify a valid key derivation function. verify that the configuration of the key server has not
changed.
6027-3535 [E] Incorrect client certificate label
'clientCertLabel' for backend 6027-3545 [E] Encryption is enabled but there
'backend'. is no valid license. Ensure that
Explanation: the GPFS crypto package was
The specified client keypair certificate label is installed properly.
incorrect for the backend. Explanation:
The required license is missing for the GPFS
User response:
encryption package.
Ensure that the correct client certificate label is used
in RKM.conf. User response:
Ensure that the GPFS encryption package was
6027-3537 [E] Setting default encryption
installed properly.
parameters requires empty
combine and wrapping parameter 6027-3546 [E] Key 'keyID:rkmID' could not be
strings. fetched. The specified RKM ID
Explanation: does not exist; check the RKM.conf
A non-empty combine or wrapping parameter string settings.
was used in an encryption policy rule that also uses Explanation:
the default parameter string. The specified RKM ID part of the key name does not
exist, and therefore the key cannot be retrieved. The
User response:
corresponding RKM might have been removed from
Ensure that neither the combine nor the wrapping
RKM.conf.
parameter is set when the default parameter string is
used in the encryption rule. User response:
Check the set of RKMs specified in RKM.conf.
6027-3540 [E] The specified RKM backend type
(rkmType) is invalid. 6027-3547 [E] Key 'keyID:rkmID' could not be
Explanation: fetched. The connection was reset
The specified RKM type in RKM.conf is incorrect. by the peer while performing the
TLS handshake.
User response:
Explanation:
Ensure that only supported RKM types are specified in
The specified key could not be retrieved from the
RKM.conf.
server, because the connection with the server was
reset while performing the TLS handshake.

896 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: 6027-3554 The restore command encountered
Check connectivity to the server. Check credentials to an out-of-memory error.
access the server. Contact the IBM Support Center.
Explanation:
6027-3548 [E] Key 'keyID:rkmID' could not be The fileset snapshot restore command encountered an
fetched. The IP address of the out-of-memory error.
RKM could not be resolved.
User response:
Explanation: None.
The specified key could not be retrieved from the
6027-3555 name must be combined with
server because the IP address of the server could not
FileInherit, DirInherit or both.
be resolved.
Explanation:
User response:
NoPropagateInherit must be accompanied by
Ensure that the hostname of the key server is
other inherit flags. Valid values are FileInherit and
correct. Verify whether there are problems with name
DirInherit.
resolutions.
User response:
6027-3549 [E] Key 'keyID:rkmID' could not be
Specify a valid NFSv4 option and reissue the
fetched. The TCP connection with
command.
the RKM could not be established.
6027-3556 cmdName error: insufficient
Explanation:
memory.
Unable to establish a TCP connection with the key
server. Explanation:
User response: The command exhausted virtual memory.
Check the connectivity to the key server. User response:
Consider some of the command parameters that might
6027-3550 Error when retrieving encryption
affect memory usage. Contact the IBM Support Center.
attribute: errorDescription.
6027-3557 cmdName error: could not create a
Explanation:
Unable to retrieve or decode the encryption attribute temporary file.
for a given file. Explanation:
A temporary file could not be created in the current
User response:
directory.
File could be damaged and may need to be removed if
it cannot be read. User response:
Ensure that the file system is not full and that files can
6027-3551 Error flushing work file fileName:
be created. Contact the IBM Support Center.
errorString
6027-3558 cmdName error: could not
Explanation:
An error occurred while attempting to flush the named initialize the key management
work file or socket. subsystem (error returnCode).
Explanation:
User response:
An internal component of the cryptographic library
None.
could not be properly initialized.
6027-3552 Failed to fork a new process to
User response:
operationString file system.
Ensure that the gpfs.gskit package was installed
Explanation: properly. Contact the IBM Support Center.
Failed to fork a new process to suspend/resume file
6027-3559 cmdName error: could not
system.
create the key database (error
User response: returnCode).
None.
Explanation:
6027-3553 Failed to sync fileset filesetName. The key database file could not be created.
Explanation: User response:
Failed to sync fileset. Ensure that the file system is not full and that files can
be created. Contact the IBM Support Center.
User response:
None.

Chapter 42. References 897


6027-3560 cmdName error: could not create User response:
the new self-signed certificate Ensure that the specified path and file name are
(error returnCode). correct and that you have sufficient permissions to
access the file.
Explanation:
A new certificate could not be successfully created. 6027-3567 cmdName error: could not convert
the private key.
User response:
Ensure that the supplied canonical name is valid. Explanation:
Contact the IBM Support Center. The private key material could not be converted
successfully.
6027-3561 cmdName error: could not extract
the key item (error returnCode). User response:
Contact the IBM Support Center.
Explanation:
The public key item could not be extracted 6027-3568 cmdName error: could not extract
successfully. the private key information
structure.
User response:
Contact the IBM Support Center. Explanation:
The private key could not be extracted successfully.
6027-3562 cmdName error: base64
conversion failed (error User response:
returnCode). Contact the IBM Support Center.
Explanation: 6027-3569 cmdName error: could not convert
The conversion from or to the BASE64 encoding could the private key information to DER
not be performed successfully. format.
User response: Explanation:
Contact the IBM Support Center. The private key material could not be converted
successfully.
6027-3563 cmdName error: could not extract
the private key (error returnCode). User response:
Contact the IBM Support Center.
Explanation:
The private key could not be extracted successfully. 6027-3570 cmdName error: could not encrypt
the private key information
User response:
structure (error returnCode).
Contact the IBM Support Center.
Explanation:
6027-3564 cmdName error: could not The private key material could not be encrypted
initialize the ICC subsystem (error successfully.
returnCode returnCode).
User response:
Explanation: Contact the IBM Support Center.
An internal component of the cryptographic library
could not be properly initialized. 6027-3571 cmdName error: could not insert
the key in the keystore, check your
User response:
system's clock (error returnCode).
Ensure that the gpfs.gskit package was installed
properly. Contact the IBM Support Center. Explanation:
Insertion of the new keypair into the keystore failed
6027-3565 cmdName error: I/O error. because the local date and time are not properly set
Explanation: on your system.
A terminal failure occurred while performing I/O. User response:
User response: Synchronize the local date and time on your system
Contact the IBM Support Center. and try this command again.
6027-3566 cmdName error: could not open 6027-3572 cmdName error: could not insert
file 'fileName'. the key in the keystore (error
returnCode).
Explanation:
The specified file could not be opened. Explanation:
Insertion of the new keypair into the keystore failed.

898 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: 6027-3579 cmdName error: the cryptographic
Contact the IBM Support Center. library could not be initialized in
FIPS mode.
6027-3573 cmdName error: could not insert
the certificate in the keystore Explanation:
(error returnCode). The cluster is configured to operate in FIPS mode but
the cryptographic library could not be initialized in that
Explanation:
mode.
Insertion of the new certificate into the keystore failed.
User response:
User response:
Verify that the gpfs.gskit package has been
Contact the IBM Support Center.
installed properly and that GPFS supports FIPS mode
6027-3574 cmdName error: could not on your platform. Contact the IBM Support Center.
initialize the digest algorithm.
6027-3580 Failed to sync file system:
Explanation: fileSystem Error: errString.
Initialization of a cryptographic algorithm failed.
Explanation:
User response: Failed to sync file system.
Contact the IBM Support Center.
User response:
6027-3575 cmdName error: error while Check the error message and try again. If the problem
computing the digest. persists, contact the IBM Support Center.
Explanation: 6027-3581 Failed to create the operation list
Computation of the certificate digest failed. file.
User response: Explanation:
Contact the IBM Support Center. Failed to create the operation list file.
6027-3576 cmdName error: could not User response:
initialize the SSL environment Verify that the file path is correct and check the
(error returnCode). additional error messages.
Explanation: 6027-3582 [E] Compression is not supported for
An internal component of the cryptographic library clone or clone-parent files.
could not be properly initialized.
Explanation:
User response: File compression is not supported as the file being
Ensure that the gpfs.gskit package was installed compressed is a clone or a clone parent file.
properly. Contact the IBM Support Center.
User response:
6027-3577 Failed to sync fileset filesetName. None.
errString.
6027-3583 [E] Compression is not supported for
Explanation: snapshot files.
Failed to sync fileset.
Explanation:
User response: The file being compressed is within a snapshot and
Check the error message and try again. If the problem snapshot file compression is not supported.
persists, contact the IBM Support Center.
User response:
6027-3578 [E] pathName is not a valid argument None.
for this command. You must
6027-3584 [E] Current file system version does
specify a path name within a
not support compression.
single GPFS snapshot.
Explanation:
Explanation:
The current file system version is not recent enough
This message is similar to message number 6027-872,
for file compression support.
but the pathName does not specify a path that can be
scanned. The value specified for pathName might be User response:
a .snapdir or similar object. Upgrade the file system to the latest version and retry
the command.
User response:
Correct the command invocation and reissue the
command.

Chapter 42. References 899


6027-3585 [E] Compression is not supported for 6027-3591 cmdName error: The password
AFM cached files. specified in file fileName exceeds
the maximum length of length
Explanation:
characters.
The file being compressed is cached in an AFM cache
fileset and compression is not supported for such files. Explanation:
The password stored in the specified file is too long.
User response:
None. User response:
Pick a shorter password and retry the operation.
6027-3586 [E] Compression/uncompression
failed. 6027-3592 cmdName error: Could not read
the password from file fileName.
Explanation:
Compression or uncompression failed. Explanation:
The password could not be read from the specified file.
User response:
Refer to the error message below this line for the User response:
cause of the compression failure. Ensure that the file can be read.
6027-3587 [E] Aborting compression as the file is 6027-3593 [E] Compression is supported only for
opened in hyper allocation mode. regular files.
Explanation: Explanation:
Compression operation is not performed because the The file is not compressed because compression is
file is opened in hyper allocation mode. supported only for regular files.
User response: User response:
Compress this file after the file is closed. None.
6027-3588 [E] Aborting compression as the file 6027-3594 [E] Failed to synchronize the being
is currently memory mapped, restored fileset:filesetName. [I]
opened in direct I/O mode, or Please stop the activities in the
stored in a horizontal storage pool. fileset and rerun the command.
Explanation: Explanation:
Compression operation is not performed because it is Failed to synchronize the being restored fileset due to
inefficient or unsafe to compress the file at this time. some conflicted activities in the fileset.
User response: User response:
Compress this file after the file is no longer memory Stop the activities in the fileset and try the command
mapped, opened in direct I/O mode, or stored in a again. If the problem persists, contact the IBM
horizontal storage pool. Support Center.
6027-3589 cmdName error: Cannot set the 6027-3595 [E] Failed to synchronize the
password twice. file system that is being
restored:fileSystem. [I] Please stop
Explanation:
the activities in the file system and
An attempt was made to set the password by using
rerun the command.
different available options.
Explanation:
User response:
Failed to synchronize the file system that is being
Set the password either through the CLI or by
restored due to some conflicted activities in the file
specifying a file that contains it.
system.
6027-3590 cmdName error: Could not access
User response:
file fileName (error errorCode).
Stop the activities in the file system and try the
Explanation: command again. If the problem persists, contact the
The specified file could not be accessed. IBM Support Center.
User response: 6027-3596 cmdName error: could not read/
Check whether the file name is correct and verify write file from/to directory
whether you have required access privileges to access 'pathName'. This path does not
the file. exist.

900 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: Upgrade the file system to the latest version and retry
A file could not be read from/written to the specified the command.
directory.
6027-3602 [E] Current file system version
User response: does not support the specified
Ensure that the path exists. compression library. Supported
libraries include \"z\" and \"lz4\".
6027-3597 cmdName error: Could not
open directory 'pathName' (error Explanation:
errorCode). The current file system version does not support the
compression library that is specified by the user.
Explanation:
The specified directory could not be opened. User response:
Select "z" or "lz4" as the compression library.
User response:
Ensure that the path exists and it is readable. 6027-3602 [E] Snapshot data migration to
external pool, externalPoolName,
6027-3598 cmdName error: Could not insert
is not supported.
the key in the keystore. Another
key with the specified label Explanation:
already exists (error errorCode). Snapshot data can only be migrated among internal
pools.
Explanation:
The key could not be inserted into the keystore with User response:
the specified label, since another key or certificate None.
already exists with that label.
6027-3604 [E] Fail to load compression library.
User response:
Explanation:
Use another label for the key.
Compression library failed to load.
6027-3599 cmdName error: A certificate with
User response:
the label 'certLabel' already exists
Make sure that the gpfs.compression package is
(error errorCode).
installed.
Explanation:
6027-3605 [E] Current file system version
The certificate could not be inserted into the keystore
does not support the specified
with the specified label, since another key or
compression method. Supported
certificate already exists with that label.
methods in your file version
User response: include: name1.
Use another label for the certificate.
Explanation:
6027-3600 cmdName error: The certificate The current file system version does not support the
'certFilename' already exists in compression library that is specified by the user.
the keystore under another label
User response:
(error errorCode).
Select one of the compression methods indicated.
Explanation:
6027-3606 [E] Unsupported compression
The certificate could not be inserted into the keystore,
method. Supported compression
since it is already stored in the keystore.
methods include: name1.
User response:
Explanation:
None, as the certificate already exists in the keystore
The method specified by user is not supported in any
under another label. The command does not need to
version of IBM Storage Scale.
be rerun.
User response:
6027-3601 [E] Current file system version does
Select one of the compression methods indicated.
not support compression library
selection. 6027-3607 name1: [E:name2] File cloning is
not supported for the compressed
Explanation:
file.
The current file system version does not support file
compression library selection. Explanation:
Compressed file is not supported to be clone parent.
User response:
User response:

Chapter 42. References 901


This file is not a valid target for mmclone operations. 6027-3613 cmdName error: Could not create
the keystore. A keystore file with
6027-3608 Failed to find the clone parent
same name already exists (error
from snapshot ID: name1 for file:
errorCode).
name2 Check whether the clone
file tree is in consistent state. Explanation:
A file with same name as the keystore to be created
Explanation:
already exists.
Either the clone parent or the snapshot which contains
the parent does not exist. User response:
Use a file name for the keystore that is not already
User response:
used.
Run the mmclone command with show option to check
the clone parent file information and ensure that it is 6027-3614 cmdName error: The password
accessible. provided contains non valid
characters.
6027-3609 cmdName error: the certificate's
CA chain file is missing. Explanation:
The password provided contains non-valid characters.
Explanation:
The required --chain option was not provided to the User response:
command. Provide a password that contains only alphanumerical
and punctuation characters.
User response:
Provide a certificate chain file with the --chain 6027-3617 [E] Only one replica is allowed for a
option. file in the performance pool.
6027-3610 cmdName error: No certificates Explanation:
found in the CA chain file. For a file in the performance pool, only one replica is
allowed. mmlsattr -L can be used to check which
Explanation:
storage pool a file belongs. mmlspool -L can be
The CA cert chain file does not contain any certificate.
used to check whether a storage pool is a performance
User response: pool or not.
Provide a certificate chain file that contain all the CA
User response:
certificate chain, starting with the CA's root certificate
Do not use mmchattr -r to increase the number
and continuing with the necessary intermediate CA
of data replicas for a file in the performance pool.
certificates, in order.
Use mmchattr -r 1 -P <performance pool>
6027-3611 cmdName error: Failed to process to migrate a file to the <performance pool>.
the client certificate (error
6027-3700 [E] Key 'keyID' was not found on RKM
name1).
ID 'rkmID'.
Explanation:
Explanation:
The client certificate cannot be processed correctly.
The specified key could not be retrieved from the key
User response: server.
Provide a client certificate received from a certificate
User response:
signing authority or generate a certificate using one of
Verify that the key is present at the server. Verify that
the software packages that provide such functionality,
the name of the keys used in the encryption policy is
like OpenSSL.
correct.
6027-3612 cmdName error: Cannot insert key
6027-3701 [E] Key 'keyID:rkmID' could not be
in keystore. Cannot find a valid
fetched. The authentication with
certificate chain in the keystore
the RKM was not successful.
(error errorCode).
Explanation:
Explanation:
Unable to authenticate with the key server.
The key could not be inserted into the keystore
because a valid and complete certificate chain does User response:
not exist in the keystore. Verify that the credentials used to authenticate with
the key server are correct.
User response:
Ensure a valid certificate chain for the key is added to 6027-3702 [E] Key 'keyID:rkmID' could not be
the keystore with the --chain or --prefix options. fetched. Permission denied.

902 IBM Storage Scale 5.1.9: Problem Determination Guide


Explanation: GPFS is operating in FIPS mode, but the initialization
Unable to authenticate with the key server. of the cryptographic library failed.
User response: User response:
Verify that the credentials used to authenticate with Ensure that the packages required for encryption are
the key server are correct. properly installed on each node in the cluster.
6027-3703 [E] I/O error while accessing the 6027-3708 [E] Incorrect passphrase for backend
keystore file 'keystoreFileName'. 'backend'.
Explanation: Explanation:
An error occurred while accessing the keystore file. The specified passphrase is incorrect for the backend.
User response: User response:
Verify that the name of the keystore file in RKM.conf Ensure that the correct passphrase is used for the
is correct. Verify that the keystore file can be read on backend in RKM.conf.
each node.
6027-3709 [E] Error encountered when parsing
6027-3704 [E] The keystore file line lineNumber: expected a new
'keystoreFileName' has an invalid RKM backend stanza.
format.
Explanation:
Explanation: An error was encountered when parsing a line
The specified keystore file has an invalid format. in RKM.conf. Parsing of the previous backend is
complete, and the stanza for the next backend is
User response:
expected.
Verify that the format of the keystore file is correct.
User response:
6027-3705 [E] Incorrect FEK length after
Correct the syntax in RKM.conf.
unwrapping; the operation was
cancelled. 6027-3710 [E] Error encountered when parsing
line lineNumber: invalid key
Explanation:
'keyIdentifier'.
When opening an existing encrypted file, the size of
the FEK that was unwrapped did not correspond to the Explanation:
one recorded in the file's extended attributes. Either An error was encountered when parsing a line in
the cryptographic extended attributes were damaged, RKM.conf.
or the master key(s) used to unwrap the FEK have
User response:
changed.
Specify a well-formed stanza in RKM.conf.
User response:
6027-3711 [E] Error encountered when parsing
Check for other symptoms of data corruption, and
line lineNumber: invalid key-value
verify that the configuration of the key server has not
pair.
changed.
Explanation:
6027-3706 [E] The crypto library with FIPS
An error was encountered when parsing a line in
support is not available for this
RKM.conf: an invalid key-value pair was found.
architecture. Disable FIPS mode
and reattempt the operation. User response:
Correct the specification of the RKM backend in
Explanation:
RKM.conf.
GPFS is operating in FIPS mode, but the initialization
of the cryptographic library failed because FIPS mode 6027-3712 [E] Error encountered when parsing
is not yet supported on this architecture. line lineNumber: incomplete RKM
backend stanza 'backend'.
User response:
Disable FIPS mode and attempt the operation again. Explanation:
An error was encountered when parsing a line in
6027-3707 [E] The crypto library could not be
RKM.conf. The specification of the backend stanza
initialized in FIPS mode. Ensure
was incomplete.
that the crypto library package
was correctly installed. User response:
Correct the specification of the RKM backend in
Explanation:
RKM.conf.

Chapter 42. References 903


6027-3713 [E] An error was encountered when 6027-3717 [E] Key 'keyID:rkmID' could not be
parsing line lineNumber: duplicate fetched. The RKM is in quarantine
key 'key'. after experiencing a fatal error.
Explanation: Explanation:
A duplicate keyword was found in RKM.conf. GPFS has quarantined the remote key management
(RKM) server and will refrain from initiating further
User response:
connections to it for a limited amount of time.
Eliminate duplicate entries in the backend
specification. User response:
Examine the error messages that precede this
6027-3714 [E] Incorrect permissions for the /var/
message to determine the cause of the quarantine.
mmfs/etc/RKM.conf configuration
file on node nodeName: the file 6027-3718 [E] Key 'keyID:rkmID' could not be
must be owned by the root user fetched. Invalid request.
and be in the root group, must be
Explanation:
a regular file and be readable and
The key could not be fetched because the remote key
writable by the owner only.
management (RKM) server reported that the request
Explanation: was invalid.
The permissions for the /var/mmfs/etc/RKM.conf
User response:
configuration file are incorrect. The file must be owned
Ensure that the RKM server trusts the client certificate
by the root user, must be in the root group, must be a
that was used for this request. If this does not resolve
regular file, and must be readable and writeable by the
the issue, contact the IBM Support Center.
owner only.
6027-3719 [W] Wrapping parameter string
User response:
'oldWrappingParameter' is not
Fix the permissions on the file and retry the operation.
safe and will be replaced with
6027-3715 [E] Error encountered when parsing 'newWrappingParameter'.
line lineNumber: RKM ID 'RKMID'
Explanation:
is too long, it cannot exceed length
The wrapping parameter specified by the policy should
characters.
no longer be used since it may cause data corruption
Explanation: or weaken the security of the system. For this reason,
The RKMID chosen at the specified line of /var/ the wrapping parameter specified in the message will
mmfs/etc/RKM.conf contains too many characters. be used instead.
User response: User response:
Choose a shorter string for the RKMID. Change the policy file and replace the specified
wrapping parameter with a more secure one. Consult
6027-3716 [E] Key 'keyID:rkmID' could not be
the IBM Storage Scale: Administration Guide for a list
fetched. The TLS handshake could
of supported wrapping parameters.
not be completed successfully.
6027-3720 [E] binaryName error: Invalid
Explanation:
command type 'command'.
The specified key could not be retrieved from the
server because the TLS handshake did not complete Explanation:
successfully. The command supplied to the specified binary is
invalid.
User response:
Ensure that the configurations of GPFS and the remote User response:
key management (RKM) server are compatible when Specify a valid command. Refer to the documentation
it comes to the version of the TLS protocol used for a list of supported commands.
upon key retrieval (GPFS uses the nistCompliance
6027-3721 [E] binaryName error: Invalid
configuration variable to control that). In particular,
arguments.
if nistCompliance=SP800-131A is set in GPFS,
ensure that the TLS v1.2 protocol is enabled in the Explanation:
RKM server. If this does not resolve the issue, contact The arguments supplied to the specified binary are
the IBM Support Center. invalid.
User response:

904 IBM Storage Scale 5.1.9: Problem Determination Guide


Supply valid arguments. Refer to the documentation User response:
for a list of valid arguments. Ensure the specified file is not granted access to non
root. Explanation: The specified file allows access from
6027-3722 [E] An error was encountered
a non-root user, or has execute permission, which is
while processing file 'fileName':
not allowed.
errorString
6027-3729 [E] Key 'keyID:rkmID' could not be
Explanation:
fetched. The SSL connection
An error was encountered while processing the
cannot be initialized.
specified configuration file.
Explanation:
User response:
The specified key could not be retrieved from the
Examine the error message and correct the
server, because the SSL connection with the server
corresponding conditions.
cannot be initialized. Key server daemon may be
6027-3723 [E] Incorrect permissions for the unresponsive.
configuration file fileName on node
User response:
nodeName.
Check connectivity to the server. Check credentials to
Explanation: access the server. Perform problem determination on
The permissions for the specified configuration file are key server daemon. Contact the IBM Support Center.
incorrect. The file must be owned by the root user,
6027-3730 [E] Certificate with label
must be in the root group, must be a regular file, and
'clientCertLabel' for backend
must be readable and writeable by the owner only.
'backend' has expired.
User response:
Explanation:
Fix the permissions on the file and retry the operation.
The certificate identified by the specified label for the
6027-3726 [E] Key 'keyID:rkmID' could not be backend has expired.
fetched. Bad certificate.
User response:
Explanation: Create new client credentials.
The key could not be fetched from the remote key
6027-3731 [W] The client certificate for key server
management (RKM) server because of a problem with
host (port port) will expire at
the validation of the certificate.
expirationDate.
User response:
Explanation:
Verify the steps used to generate the server and
The client certificate used to communicate with the
client certificates. Check whether the NIST settings
key server identified by host and the port expires soon.
are correct on the server. If this does not resolve the
issue, contact the IBM Support Center. User response:
Refer to the Encryption chapter in the IBM Storage
6027-3727 [E] Key 'keyID:rkmID' could not be
Scale documentation for guidance on how to create
fetched. Invalid tenantName.
new client credentials and install them at the key
Explanation: server.
The key could not be fetched from the remote key
6027-3732 [W] The server certificate for key
management (RKM) server because the tenantName
server host (port port) will expire
specified in the RKM.conf file stanza was invalid.
at expirationDate.
User response:
Explanation:
Verify that the tenantName specified in the RKM.conf
The server certificate used to communicate between
file stanza is valid, and corresponds to an existing
IBM Storage Scale and the key server identified by
Device Group in the RKM server.
host and the port expires soon.
6027-3728 [E] The keyStore permissions are
User response:
incorrect for fileName. Access
Refer to the Encryption chapter in the IBM Storage
should be only granted to root, and
Scale documentation for guidance on how to create
no execute permission is allowed
new server credentials and install them at the key
for the file.
server and in IBM Storage Scale.
Explanation:
6027-3733 [W] The client certificate for key server
The specified file allows access from a non-root user,
host (port port) is invalid and
or has execute permission, which is not allowed.

Chapter 42. References 905


may have expired (validity period: 'cachedKeyMaterial') and remote
validityPeriod). (source: 'remoteKeySource',
len: remoteKeyLength, mat:
Explanation:
'remoteKeyMaterial') keys differ.
The client certificate used to communicate with the
key server identified by host and port has expired. Explanation:
The remote key server reported different key material
User response:
from the one that GPFS was caching.
Refer to the Encryption chapter in the IBM Storage
Scale documentation for guidance on how to create User response:
new client credentials and install them at the key Verify that the IBM Storage Scale cluster is pointed
server. at the correct RKM server(s). If the RKM is deployed
as an HA cluster, verify that the replication is set
6027-3734 [W] The server certificate for key
up correctly, and that the same key material is
server host (port port) is invalid
served from each RKM node in the cluster. Do not
and may have expired (validity
restart GPFS on this node until the source of the key
period: validityPeriod).
material inconsistency is located and fixed. Otherwise
Explanation: a potentially incorrect Master Encryption Key (MEK)
The server certificate used to communicate between may be used for the existing and new files.
IBM Storage Scale and the key server identified by
6027-3738 [E] Detected inconsistent key material
host and port has expired.
for key keyID:rkmID: cached
User response: (source: 'cachedKeySource', len:
Refer to the Encryption chapter in the IBM Storage cachedKeyLength) and remote
Scale documentation or key server documentation for (source: 'remoteKeySource', len:
guidance on how to create new server credentials and remoteKeyLength) keys differ.
install them at the key server and in IBM Storage
Explanation:
Scale.
The remote key server reported different key material
6027-3735 [E] Key 'keyID:rkmID' could not be from the one that GPFS was caching.
fetched. The response could not
User response:
be read from the RKM.
Verify that the IBM Storage Scale cluster is pointed
Explanation: at the correct RKM server(s). If the RKM is deployed
The key could not be fetched from the remote as an HA cluster, verify that the replication is set
key management (RKM) server because after the up correctly, and that the same key material is
connection was established, no response was served from each RKM node in the cluster. Do not
received. restart GPFS on this node until the source of the key
material inconsistency is located and fixed. Otherwise
User response:
a potentially incorrect Master Encryption Key (MEK)
Verify that the RKM server is configured correctly and
may be used for the existing and new files.
is running properly. Verify that the RKM.conf stanza
is pointing to the correct RKM server. Verify that the 6027-3739 [E] 'kpEndpoint' is not a valid Key
network is operating normally. Protect endpoint URL.
6027-3736 [E] The keystore password file Explanation:
permissions are incorrect for No valid Key Protect endpoint has been specified in
name1. Access should be only the RKM.conf stanza.
granted to root, and no execute
User response:
permission is allowed for the file
Specify a Key Protect endpoint in the RKM.conf
Explanation: stanza and reload the RKM.conf file by restarting
The specified file allows access from a non-root user, mmfsd, or re-applying a policy.
or has execute permission, which is not allowed.
6027-3740 [E] Error adding entry to Key Protect
User response: MEK repository: kpMekRepoEntry
Ensure the specified file is not granted access to non
Explanation:
root.
The MEK entry could not be added Key Protect MEK
6027-3737 [E] Detected inconsistent key material repository.
for key keyID:rkmID: cached
User response:
(source: 'cachedKeySource',
len: cachedKeyLength, mat:

906 IBM Storage Scale 5.1.9: Problem Determination Guide


Check the MEK repository file format and supported User response:
versions. The token should be automatically renewed once it
expires. In this case it was not. Verify that the IAM
6027-3741 [E] The Key Protect MEK repository
endpoint is correctly configured in the RKM.conf
file permissions are incorrect for
stanza, and that the IAM endpoint is reachable.
mekRepoFilename. Access should
be only granted to root, and no 6027-3747 [E] The Key Protect resource for key
execute permission is allowed for 'keyID:rkmID' could not be found
the file. (404).
Explanation: Explanation:
The specified file allows access from a non-root user, The Key Protect resource (either IAM, Key Protect, or
or has execute permission, which is not allowed. the key itself) of the specified key and RKM could not
be found.
User response:
Ensure the specified file is not granted access to non User response:
root. Verify that the IAM endpoint and the Key Protect
endpoints are correctly configured in the RKM.conf
6027-3742 [E] Error in Key Protect request:
stanza. Also verify that a key with the specified
'errorMsg'.
ID exists in the Key Protect instance, and that it is
Explanation: accessible for the user.
The request to Key Protect failed.
6027-3748 [E] The authentication data or key
User response: is invalid, or the entity-body
Refer to the error message for further details. is missing a required field,
when trying to unwrap key
6027-3743 [E] Key 'keyID' was not found in Key
'keyID:rkmID'.
Protect MEK repository for RKM ID
'rkmID'. Explanation:
The Key Protect authentication data or key is invalid.
Explanation:
The MEK key could not be found in the Key Protect User response:
MEK repository. Verify that the API key configured for the Key Protect
RKM instance is correct, and that the user has access
User response:
to the Key Protect instance.
Verify that the UUID specified in the encryption policy
matches the UUID of an MEK stored in the Key Protect 6027-3749 [E] The access token is invalid or
MEK repository. does not have the necessary
permissions to unwrap the key
6027-3744 [E] The API key could not be found in
'keyID:rkmID'.
IAM for RKM ID 'rkmID'.
Explanation:
Explanation:
The access token is invalid or does not have the
The API key specified for the RKM was not found in the
necessary permissions to unwrap the key.
IBM Cloud IAM.
User response:
User response:
Verify that the API key configured for the Key Protect
Verify that the API key specified for the RKM is valid.
RKM instance is correct, and that the user has access
6027-3745 [E] The Key Protect instance ID is to the Key Protect instance.
invalid for RKM ID 'rkmID'.
6027-3900 Invalid flag 'flagName' in the
Explanation: criteria file.
No Key Protect instance exists with the ID specified for
Explanation:
the RKM.
An invalid flag was found in the criteria file.
User response:
User response:
Verify that a valid Key Protect instance ID is specified.
None.
6027-3746 [E] The Key Protect token has expired
6027-3901 Failed to receive inode list:
for key 'keyID:rkmID'.
listName.
Explanation:
Explanation:
The Key Protect token has expired for the specified key
A failure occurred while receiving an inode list.
and RKM.

Chapter 42. References 907


User response: 6027-3908 Check file 'fileName' on fileSystem
None. for inodes with broken disk
addresses or failures.
6027-3902 Check file 'fileName' on fileSystem
for inodes that were found Explanation:
matching the criteria. The named file contains the inodes generated by
parallel inode traversal (PIT) with interesting flags; for
Explanation:
example, dataUpdateMiss or BROKEN.
The named file contains the inodes generated by
parallel inode traversal (PIT) with interesting flags; for User response:
example, dataUpdateMiss or BROKEN. None.
User response: 6027-3909 The file (backupQuotaFile) is a
None. quota file in fileSystem already.
6027-3903 [W] quotaType quota is disabled or Explanation:
quota file is invalid. The file is a quota file already. An incorrect file name
might have been specified.
Explanation:
The corresponding quota type is disabled or invalid, User response:
and cannot be copied. None.
User response: 6027-3910 [I] Delay number seconds for safe
Verify that the corresponding quota type is enabled. recovery.
6027-3904 [W] quotaType quota file is not a Explanation:
metadata file. File was not copied. When disk lease is in use, wait for the existing lease
to expire before performing log and token manager
Explanation:
recovery.
The quota file is not a metadata file, and it cannot be
copied in this way. User response:
None.
User response:
Copy quota files directly. 6027-3911 Error reading message from the
6027-3905 [E] file system daemon: errorString :
Specified directory does not exist
The system ran out of memory
or is invalid.
buffers or memory to expand the
Explanation: memory buffer pool.
The specified directory does not exist or is invalid.
Explanation:
User response: The system ran out of memory buffers or memory to
Check the spelling or validity of the directory. expand the memory buffer pool. This prevented the
client from receiving a message from the file system
6027-3906 [W] backupQuotaFile already exists.
daemon.
Explanation:
User response:
The destination file for a metadata quota file backup
Try again later.
already exists.
6027-3912 [E] File fileName cannot run with error
User response:
Move or delete the specified file and retry. errorCode: errorString.

6027-3907 [E] Explanation:


No other quorum node found
The named shell script cannot run.
during cluster probe.
User response:
Explanation:
Verify that the file exists and that the access
The node could not renew its disk lease and there was
permissions are correct.
no other quorum node available to contact.
6027-3913 Attention: disk diskName is a 4K
User response:
Determine whether there was a network outage, and native dataOnly disk and it is used
also ensure the cluster is configured with enough in a non-4K aligned file system.
quorum nodes. The node will attempt to rejoin the Its usage is not allowed to change
cluster. from dataOnly.
Explanation:

908 IBM Storage Scale 5.1.9: Problem Determination Guide


An attempt was made through the mmchdisk If message is repeated then investigate the
command to change the usage of a 4K native disk communication outage.
in a non-4K aligned file system from dataOnly to
6027-3919 [E] No attribute found.
something else.
Explanation:
User response:
The attribute does not exist.
None.
User response:
6027-3914 [E] Current file system version does
None.
not support compression.
6027-3920 [E] Cannot find an available quorum
Explanation:
node that would be able to
File system version is not recent enough for file
successfully run Expel command.
compression support.
Explanation:
User response:
Expel command needs to be run on quorum node but
Upgrade the file system to the latest version, then
cannot find any available quorum node that would
retry the command.
be able to successfully run the Expel command. All
6027-3915 Invalid file system name provided: quorum nodes are either down or being expelled.
'FileSystemName'.
User response:
Explanation: None.
The specified file system name contains invalid
6027-3921 Partition `partitionName' is
characters.
created and policy broadcasted to
User response: all nodes.
Specify an existing file system name or one which only
Explanation:
contains valid characters.
A partition is created and policy broadcasted to all
6027-3916 [E] fileSystemName is a clone of nodes.
fileSystemName, which is mounted
User response:
already.
None.
Explanation:
6027-3922 Partition `partitionName' is
A cloned file system is already mounted internally or
deleted and policy broadcasted to
externally with the same stripe group ID. The mount
all nodes.
will be rejected.
Explanation:
User response:
A partition is deleted and policy broadcasted to all
Unmount the cloned file system and remount.
nodes.
6027-3917 [E] The file fileName does not
User response:
exist in the root directory of
None.
fileSystemName.
6027-3923 Partition `partitionName' does
Explanation:
not exist for the file system
The backup file for quota does not exist in the root
fileSystemName.
directory.
Explanation:
User response:
Given policy partition does not exist for the file system.
Check the file name and root directory and rerun the
command after correcting the error. User response:
Verify the partition name and try again.
6027-3918 [N] Disk lease period expired
number seconds ago in cluster 6027-3924 Null partition: partitionName
clusterName. Attempting to
Explanation:
reacquire the lease.
Null partition.
Explanation:
User response:
The disk lease period expired, which will prevent the
None.
local node from being able to perform disk I/O. May be
caused by a temporary communication outage. 6027-3925 No partitions defined.
User response: Explanation:
No partitions defined.

Chapter 42. References 909


User response: The GPFS daemon has an incompatible version.
None.
User response:
6027-3926 Empty policy file. None.
Explanation: 6027-3933 [E] File system is in maintenance
Empty policy file. mode.
User response: Explanation:
None. The file system is in maintenance mode and not
allowed to be opened.
6027-3927 Failed to read policy partition
'partitionName' for file system User response:
'fileSystemName'. Try the command again after turning off the
maintenance mode for the file system.
Explanation:
Could not read the given policy partition for the file 6027-3934 [E] Node nodeName has been deleted
system. from the cluster, but it has been
detected as still up. Deletion of
User response:
that node may not have completed
Reissue the command. If the problem persists, re-
successfully. Manually bring GPFS
install the subject policy partition.
down on that node.
6027-3928 Failed to list policy partitions for
Explanation:
file system 'fileSystemName'.
Deletion of that node may not have completed
Explanation: successfully.
Could not list the policy partitions for the file system.
User response:
User response: Manually bring GPFS down on that node.
Reissue the command. If the problem persists, contact
6027-3935 [E] Node nodeName has been moved
IBM Support.
into another cluster, but it has
6027-3929 Policy file for file been detected as still up. Deletion
system fileSystemName contains of that node from its original
numOfPartitions partitions. cluster may not have completed
successfully. Manually bring GPFS
Explanation:
down on that node and remove the
Number of partitions in the policy file of the file
node from its original cluster.
system.
Explanation:
User response:
A node has been moved into another cluster but still
None.
detected up on its original cluster.
6027-3930 No policy partitions defined for file
User response:
system fileSystemName.
Manually bring GPFS down on that node and remove
Explanation: the node from its original cluster.
Policy partitions are not defined for the file system.
6027-3936 [E] Error encountered during Kafka
User response: authentication: errorCode error
None. message
6027-3931 Policy partitions are not enabled Explanation:
for file system 'fileSystemName'. This filesystem ran into an error while retrieving its
Kafka authentication info from its home cluster CCR.
Explanation:
The cited file system must be upgraded to use policy User response:
partitions. Try the command again after checking the file system
home cluster CCR status and Kafka message queue
User response:
status. If the problem persists, contact IBM Service.
Upgrade the file system by using the mmchfs -V
command. 6027-3937 [I] This node got elected. Sequence:
SequenceNumber Took: TimeTook
6027-3932 [E] GPFS daemon has an incompatible
seconds.
version.
Explanation:
Explanation:

910 IBM Storage Scale 5.1.9: Problem Determination Guide


Local node got elected in the CCR election. This node User response:
will become the cluster manager. Check the error log for hardware or device driver
problems that might keep causing timer interrupts to
User response:
be lost or the tick rate to change dramatically.
None. Informational message only.
6027-3942 [W] Attention: Excessive timer
6027-3938 fileSystem: During mmcheckquota
drift between node and
processing, number node(s)
node (number over number
failed and the results may
sec) lastLeaseRequested number
not be accurate. To obtain a
lastLeaseProcessed number
more accurate result, repeat
leaseAge number now
mmcheckquota at a later time.
number leaseDuration number
Explanation: ticksPerSecond number.
Nodes failed while an online quota check was running.
Explanation:
User response: GPFS has detected an unusually large difference in the
Reissue the mmcheckquota command. rate of clock ticks (as returned by the times() system
call) between the cluster node and the cc-mgr node.
6027-3939 Failed to get a response while
probing quorum node name1: error User response:
errorCode Check error log for hardware or device driver problems
that might keep causing timer interrupts to be lost or
Explanation:
the tick rate to change dramatically.
The node failed to connect to a quorum node while
trying to join a cluster. That is, no response indicating 6027-3943 [W] Excessive timer drift with
an established connection was received from the GPFS node now number (number
daemon on the quorum node. number) lastLeaseReplyReceived
number (number)
User response:
lastLeaseRequestSent number
Make sure GPFS is up and running on the quorum
(number) lastLeaseObtained
nodes of the cluster. Diagnose and correct any network
number (number) endOfLease
issues.
number (number) leaseAge number
6027-3940 Failed to join cluster name1. Make ticksPerSecond number.
sure enough quorum nodes are up
Explanation:
and there are no network issues.
GPFS has detected an unusually large difference in the
Explanation: rate of clock ticks (as returned by the times() system
The node failed to join the cluster after trying for a call) between the cluster node and the cc-mgr node.
while. Enough quorum nodes need to be up and the
User response:
network needs to be reliable to allow the node to join.
Check error log for hardware or device driver problems
User response: that might keep causing timer interrupts to be lost or
Make sure the quorum nodes are up and GPFS is the tick rate to change dramatically.
running on the quorum nodes in the cluster. Diagnose
6027-3945 [D] Quota report message size
and correct any network issues. If all else fails,
is too big: messageSize
consider restarting GPFS on the node which failed to
(maximum maximumMessageSize,
join the cluster.
new maximum
6027-3941 [W] Excessive timer drift issue keeps newMaximumMessageSize),
getting detected (number nodes; communication version
above threshold of number). receiverCommunicationVersion,
Resigning the cluster manager error error
role (amIGroupLeader number,
Explanation:
resignCCMgrRunning number).
The quota report message size is bigger than the
Explanation: supported message RPC size.
GPFS has detected an unusually large difference in the
User response:
rate of clock ticks (as returned by the times() system
If the message size is smaller than the new maximum
call) between the cluster node and the cc-mgr node,
message size, run the mmrepquota command from
and the rate difference between the cc-mgr node and
a client that supports a large quota reply message.
different cluster nodes still remains.
Reduce the number of quota elements in the file

Chapter 42. References 911


system, if possible. Use mmcheckquota to clean up Upgrade the file system by using the mmchfs -V
unused the quota elements. command.
6027-3946 Failed to read tscCmdPortRange 6027-3951 [E] Operation failed because one or
from /var/mmfs/gen/mmfs.cfg. more pdisks was in an unavailable
Using the default. state.
Explanation: Explanation:
Failed to read the GPFS configuration file, /var/ IBM Storage Scale Native RAID failed an I/O request
mmfs/gen/mmfs.cfg. The reason for such failure internally due to a temporary state conflict, and the
might be that a config update operation was either in operation will be retried automatically.
progress or started while reading the file.
User response:
User response: None.
None.
6027-3952 [E] An error occurred while executing
6027-3947 [E] Operation failed because one or command for fileSystem: Cannot
more pdisks had volatile write set file limits over 2^31 on this file
caching enabled. system. Run mmchfs -V full, first.
Explanation: Explanation:
The IBM Storage Scale Native RAID does not support The current file system version does not support
drives with volatile write caching. setting quota file limite over 2^31
User response: User response:
Make sure that volatile write caching is disabled on all Use mmchfs -V to change the file system to the latest
disks. format.
6027-3948 Unable to start command on 6027-3953 [E] Target name1 does not support
the fileSystem because conflicting client command RPCs
program name is running. Waiting
Explanation:
until it completes or moves to the
The target node specified by the IP address does not
next phase, which may allow the
support client command RPCs.
current command to start.
User response:
Explanation:
Set the tscCmdAllowRemoteConnections
A program detected that it cannot start because
parameter to "yes" or update all the clusters to at least
a conflicting program is running. The program will
version 5.1.3.0.
automatically start once the conflicting program has
completed its work as long as there are no other 6027-3954 [E] This node has been expelled from
conflicting programs running at that time. the cluster.
User response: Explanation:
None. The node has been expelled from a cluster and is
not allowed to join a cluster until this state has been
6027-3949 [E] Operation failed because it is
changed.
taking too long to send audit/
watch events to sink(s). User response:
Remove the node from the list of expelled nodes.
Explanation:
See the mmexpelnode -r or -reset option
Event log was not able to consume or acknowledge all
command .
events produced by WF/FAL
6027-3955 [E] The current system version does
User response:
not support concurrent perfileset
Make sure WF/FAL producers are healthy.
quota check.
6027-3950 [E] fileSystem: The current file system
Explanation:
version does not support perfileset
The current file system version does not support
quota check.
concurrent perfileset quota check.
Explanation:
User response:
The current file system version does not support
Use mmchfs -V to change the file system to the latest
perfileset quota check.
format.
User response:

912 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-3956 [E] File system fileSystem: Concurrent 6027-4003 [E] The on-disk descriptorType
perfileset quota check instances descriptor of
exceed the maximum-allowed, nsdName descriptorIdName
maximum is number. descriptorId:descriptorId is not
valid because of bad
Explanation:
corruptionType:
Maximum support of concurrent perfileset quota
check instances reached. Explanation:
The descriptor validation thread found an on-disk
User response:
descriptor that is corrupted. GPFS will automatically
Rerun the command after some perfileset quota check
fix it.
instance completes.
User response:
6027-4000 [I] descriptorType descriptor on this
None.
NSD can be updated by running
the following command from the 6027-4004 [D] On-disk NSD descriptor:
node physically connected to NSD nsdId nsdId nsdMagic
nsdName: nsdMagic nsdFormatVersion
nsdFormatVersion on disk
Explanation:
nsdChecksum nsdChecksum
This message is displayed when a descriptor validation
calculated checksum
thread finds a valid NSD, or disk, or stripe group
calculatedChecksum nsdDescSize
descriptor but with a different ID. This can happen if
nsdDescSize firstPaxosSector
a device is reused for another NSD.
firstPaxosSector nPaxosSectors
User response: nPaxosSectors nsdIsPdisk
None. After this message, another message is nsdIsPdisk
displayed with a command to fix the problem.
Explanation:
6027-4001 [I] 'mmfsadm writeDesc Description of an on-disk NSD descriptor.
<device> descriptorType
User response:
descriptorId:descriptorId
None.
nsdFormatVersion pdiskStatus',
where device is the device name 6027-4005 [D] Local copy of NSD descriptor:
of that NSD. nsdId nsdId nsdMagic nsdMagic
formatVersion formatVersion
Explanation:
nsdDescSize nsdDescSize
This message displays the command that must run
firstPaxosSector firstPaxosSector
to fix the NSD or disk descriptor on that device. The
nPaxosSectors nPaxosSectors
deviceName must be supplied by system administrator
or obtained from mmlsnsd -m command. The Explanation:
descriptorId is a hexadecimal value. Description of the cached NSD descriptor.
User response: User response:
Run the command that is displayed on that NSD server None.
node and replace deviceName with the device name of
6027-4006 [I] Writing NSD descriptor of
that NSD.
nsdName with local copy:
6027-4002 [I] Before running this command, nsdId nsdId nsdFormatVersion
check both NSDs. You might have formatVersion firstPaxosSector
to delete one of the NSDs. firstPaxosSector nPaxosSectors
nPaxosSectors nsdDescSize
Explanation:
nsdDescSize nsdIsPdisk
Informational message.
nsdIsPdisk nsdChecksum
User response: nsdChecksum
The system administrator should decide which NSD to
Explanation:
keep before running the command to fix it. If you want
Description of the NSD descriptor that was written.
to keep the NSD found on disk, then you do not run
the command. Instead, delete the other NSD found in User response:
cache (the NSD ID shown in the command). None.

Chapter 42. References 913


6027-4007 errorType descriptor on descSize checksum on
descriptorType nsdId nsdId:nsdId disk diskChecksum calculated
error error checksum calculatedChecksum
firstSGDescSector
Explanation: firstSGDescSector nSGDescSectors
This message is displayed after reading and writing nSGDescSectors lastUpdateTime
NSD, disk and stripe group descriptors.
lastUpdateTime
User response: Explanation:
None. Description of the on-disk disk descriptor.
6027-4008 [E] On-disk descriptorType descriptor User response:
of nsdName is valid but None.
has a different UID: uid
descriptorId:descriptorId on-disk 6027-4012 [D] Local copy of disk descriptor:
uid descriptorId:descriptorId nsdId uid descriptorID:descriptorID
nsdId:nsdId firstSGDescSector
firstSGDescSector nSGDescSectors
Explanation: nSGDescSectors
While verifying an on-disk descriptor, a valid descriptor
was found but with a different ID. This can happen if Explanation:
a device is reused for another NSD with the mmcrnsd Description of the cached disk descriptor.
-v no command. User response:
User response: None.
After this message there are more messages displayed
6027-4013 [I] Writing disk descriptor of
that describe the actions to follow.
nsdName with local copy:
6027-4009 [E] On-disk NSD descriptor of uid descriptorID:descriptorID,
nsdName is valid but has a magic magic, formatVersion
different ID. ID in cache is formatVersion firstSGDescSector
cachedId and ID on-disk is firstSGDescSector nSGDescSectors
ondiskId nSGDescSectors descSize descSize
Explanation: Explanation:
While verifying an on-disk NSD descriptor, a valid Writing disk descriptor to disk with local information.
descriptor was found but with a different ID. This can User response:
happen if a device is reused for another NSD with the None.
mmcrnsd -v no command.
6027-4014 [D] Local copy of
User response:
StripeGroup descriptor:
After this message, there are more messages
uid descriptorID:descriptorID
displayed that describe the actions to follow.
curFmtVersion curFmtVersion
6027-4010 [I] This corruption can happen if the configVersion configVersion
device is reused by another NSD Explanation:
with the -v option and a file Description of the cached stripe group descriptor.
system is created with that reused
NSD. User response:
None.
Explanation:
Description of a corruption that can happen when an 6027-4015 [D] On-disk StripeGroup descriptor:
NSD is reused. uid sgUid:sgUid magic magic
curFmtVersion curFmtVersion
User response:
descSize descSize on-disk
Verify that the NSD was not reused to create another
checksum diskChecksum
NSD with the -v option and that the NSD was not used
calculated checksum
for another file system.
calculatedChecksum configVersion
6027-4011 [D] On-disk disk descriptor: configVersion lastUpdateTime
uid descriptorID:descriptorID lastUpdateTime
magic descMagic formatVersion Explanation:
formatVersion descSize Description of the on-disk stripe group descriptor.

914 IBM Storage Scale 5.1.9: Problem Determination Guide


User response: writeDesc on that NSD server
None. node to update the NSD
descriptor.
6027-4016 [E] Data buffer checksum mismatch
during write. File system Explanation:
fileSystem tag tag1 tag2 nBytes Informational message.
nBytes diskAddresses
User response:
Explanation: If wwn is available, the system administrator should
GPFS detected a mismatch in the checksum of the first run command mmfsadm test nsd wwn
data buffer content which means content of data wwnName to retrieve the device with matching wwn,
buffer was changing while a direct I/O write operation and then update the descriptor on that device.
was in progress.
6027-4021 [I] Before running this command,
User response: determine the correct disk to
None. update the descriptor with: It is
possible that the device mapping
6027-4017 [E] Current file system version does
could change after a reboot. At the
not support the initial disk status
NSD server, if the node has not
BeingAddedByGNR.
been rebooted, run mmlsnsd -X |
Explanation: grep nsdName to determine the
File system version must be upgraded to specify local devname, or if the NSD has
BeingAddedByGNR as the initial disk status. not been recreated multiple times,
tspreparedisk -b pvid_str
User response:
-t dev_type to find the local
Upgrade the file system version.
devname, and then run mmfsadm
6027-4018 [E] Disk diskName is not an writeDesc on that nsd server.
existing vdisk, but initial status
Explanation:
BeingAddedByGNR is specified
Informational message.
Explanation:
User response:
When you specify the initial disk status
If wwn is not available, the system administrator
BeingAddedByGNR, all disks that are being added must
should determine which NSD to update the descriptor
be existing NSDs of type vdisk
with by either mmlsns -X" or "tspreparedisk
User response: -b command.
Ensure that NSDs are of type vdisk and try again.
6027-4022 [E] The on-disk descriptorType
6027-4019 [D] On-disk StripeGroup descriptor: descriptor of nsdName wwn
uid 0xsgUid:sgUid magic wwnName descriptorIdName
0xmagic curFmtVersion 0xdescriptorId:descriptorId is
curFmtVersion on-disk descSize not valid because of bad
descSize cached descSize corruptionType:
descSize on-disk checksum
Explanation:
0xdiskChecksum calculated
The descriptor validation thread found an on-disk
checksum 0xcalculatedChecksum
descriptor that is corrupted. GPFS will give instructions
configVersion configVersion
how to fix it.
lastUpdateTime lastUpdateTime
User response:
Explanation:
None.
Description of the on-disk stripe group descriptor.
6027-4023 [E] On-disk descriptorType descriptor
User response:
of nsdName wwn wwnName is
None.
valid but has a different UID: uid
6027-4020 [I] Before running this command, descriptorId:descriptorId on-disk
determine the correct disk to uid descriptorId:descriptorId nsdId
update the descriptor with: at the nsdId:nsdId
NSD server node, run mmfsadm
Explanation:
test nsd wwn wwnName to
While verifying an on-disk descriptor, a valid descriptor
find the local devname, then run
was found but with a different ID. This can happen if
the above command mmfsadm

Chapter 42. References 915


a device is reused for another NSD with the mmcrnsd 6027-4102 [E] The key used to wrap the
-v no command. key 'keyID:rkmID' was previously
deleted and is no longer available.
User response:
After this message there are more messages displayed Explanation:
that describe the actions to follow. The Key Protect root key used to wrap the specified
key has been deleted as is no longer available.
6027-4024 [E] On-disk NSD descriptor of
nsdName wwn wwnName is valid User response:
but has a different ID. ID in cache Verify that the correct Key Protect root key is specified
is cachedId and ID on-disk is in the RKM.conf stanza. If the root key is specified
ondiskId correctly, the file that was encrypted using that key are
no longer accessible.
Explanation:
While verifying an on-disk NSD descriptor, a valid 6027-4103 [E] The ciphertext provided for
descriptor was found but with a different ID. This can the unwrap operation for key
happen if a device is reused for another NSD with the 'keyID:rkmID' was wrapped by
mmcrnsd -v no command. another key.
User response: Explanation:
After this message, there are more messages The Key Protect root key specified in the RKM.conf
displayed that describe the actions to follow. stanza was not used to wrap the specified key.
6027-4025 [I] The inode0 record on this NSD User response:
can be updated by running the Verify that the correct Key Protect root key is specified
following command from the node in the RKM.conf stanza.
physically connected to NSD.
6027-4104 [E] Too many requests while trying
Explanation: to unwrap key 'keyID:rkmID'. Try
This message is displayed when a descriptor validation again later.
thread finds a valid stripe group descriptor but with
Explanation:
a different ID, in such case, the inode0 record is
Rate limits at Key Protect were hit when trying to
probably overwritten. This can happen if a device is
unwrap the specified key.
reused for another NSD.
User response:
User response:
Retry the file access operation, which will lead to the
None. After this message, another message is
unwrap operation also being retried. If necessary, wait
displayed with a command to fix the problem.
for a minute before retrying, so that the RKM is taken
6027-4100 [E] The Key Protect instance ID for out of quarantine.
key 'keyID:rkmID' is malformed or
6027-4105 [E] The Key Protect service returned
invalid.
a currently unavailable error
Explanation: when trying to unwrap key
The instance ID for the given key is invalid. 'keyID:rkmID'.
User response: Explanation:
Verify that the instance ID specified in the RKM.conf The Key Protect service is currently unavailable.
stanza is correct, and that it specifies the instance ID,
User response:
and not the file name containing the instance ID.
Retry the file access operation, which will lead to the
6027-4101 [E] The key used to wrap the key unwrap operation also being retried. If the problem
'keyID:rkmID' is not in an active persists, enable GPFS traces and contact IBM Cloud
state. support, specifying the Correlation ID of the failing
request(s), found in the traces.
Explanation:
The Key Protect root key used to wrap the specified 6027-4106 [E] The API key file permissions are
key is not in an active state. incorrect for fileName. Access
User response: should be only granted to root, and
Verify that the Key Protect root key specified in the no execute permission is allowed
RKM.conf stanza is in the correct state. for the file.
Explanation:

916 IBM Storage Scale 5.1.9: Problem Determination Guide


The specified file allows access from a non-root user, Explanation
or has execute permission, which is not allowed. The maximum number of retries to get a response
User response: from the quorum nodes have been reached due to a
Ensure the specified file is not granted access to non- failure. Most of the IBM Storage Scale commands will
root. not work because the IBM Storage Scale commands
use the CCR. The following two sections describe the
6027-4107 [E] No Key Protect instance ID possible failure scenarios in detail, depending on the
specified. cluster configuration.
Explanation: This section applies to all clusters:
No Key Protect instance ID is specified in the
RKM.conf stanza. Explanation: The failure occurs due to missing or
corrupted files in the CCR committed directory /var/
User response: mmfs/ccr/committed/. The missing or corrupted
Ensure the instance ID is specified in the RKM.conf files might be caused by a hard or cold power off
stanza and reload the RKM.conf file by restarting or a crash of the affected quorum nodes. Files in
mmfsd, or re-applying a policy. the committed directory might be truncated to zero-
6027-4108 [E] 'idEndpoint' is not a valid IAM length.
endpoint URL. User response:
Explanation: 1. Verify any corrupted files in the committed
No valid IAM endpoint has been specified in the directory by issuing mmccr check -Y -e on
RKM.conf stanza. every available quorum node. If the command
User response: responds with a message like the following one, the
Specify an IAM endpoint ('idEndpoint') in the directory contains corrupted files:
RKM.conf stanza and reload the RKM.conf file by mmccr::0:1:::1:FC_COMMITTED_DIR:5:File
restarting mmfsd, or re-applying a policy. s in committed directory missing or
corrupted:1:7:WARNING:
6027-4109 [W] The client certificate with label
2. Follow the instructions in the topic Recovery
clientCertLabel for key server with
procedures for a broken cluster when no CCR
RKM ID rkmID (host:port) will
backup is available in the IBM Storage Scale:
expire on expirationDate.
Problem Determination Guide.
Explanation:
This section applies only to clusters that are
The client certificate that is identified by the specified
configured with tiebreaker disks:
label and RKM ID used to communicate with the key
server, which is identified by host and the port expires Explanation: Files committed to the CCR, like the
soon. mmsdrfs file, reside only on quorum nodes and not
on tiebreaker disks. Using tiebreaker disks allows a
User response:
cluster to remain available even if only one quorum
Refer to the Encryption chapter in the IBM Storage
node is active. If a file commit to the CCR happens
Scale for information on how to create new client
when only one quorum node is active and the quorum
credentials and install them at the key server.
node then fails, this error occurs on other nodes.
6027-4110 [I] Encryption key cache Possible reason for the failure:
expiration time =
1. The quorum nodes that hold the most recent
encryptionKeyCacheExpiration
version of the file are not up.
(cache does not expire).
2. The IBM Storage Scale daemon (mmsdrserv or
Explanation: mmfsd) is not running on the quorum nodes that
The encryption key cache was set to not expire (the hold the most recent version of the file.
encryptionKeyCacheExpiration parameter was set to
0). 3. The IBM Storage Scale daemon (mmsdrserv or
mmfsd) is not reachable on the quorum nodes that
User response: hold the most recent version of the file.
Ensure that the key cache expiration time is set
appropriately.
User response
6027-4200 [E] Maximum number of retries The most recent file version can be on any not
reached available quorum node.
1. Try to start up as many quorum nodes as possible.

Chapter 42. References 917


2. Verify that either the mmsdrserv daemon (if IBM 2. The IBM Storage Scale daemon (mmsdrserv or
Storage Scale is down) or the mmfsd daemon (if mmfsd) is not running on a majority of quorum
IBM Storage Scale is active) is running on every nodes.
quorum node (for example, on Linux by issuing the 3. A communication problem might exist among a
ps command). majority of quorum nodes.
3. Make sure the IBM Storage Scale daemons are
reachable on all quorum nodes. To identify the User response
problem, issue the mmhealth node show GPFS
-v command or the mmnetverify command as 1. Make sure that a majority of quorum nodes are up.
described in the topic Analyze network problems 2. Verify that either the mmsdrserv daemon (if IBM
with the mmnetverify command in the IBM Storage Storage Scale is down) or the mmfsd daemon (if
Scale: Problem Determination Guide. IBM Storage Scale is active) is running on a majority
6027-4201 [B] Version mismatch on conditional of quorum nodes (for example, on Linux by issuing
PUT. the ps command).
3. Make sure the IBM Storage Scale daemons can
Explanation:
communicate with each other on a majority of
This is an informational message only for the caller of
quorum nodes. To identify the problem, issue
the CCR.
the mmhealth node show GPFS -v command
User response: or the mmnetverify command as described
If this error message is displayed, the cause might be in the topic Analyze network problems with the
a problem in the CCR caller. Please record the above mmnetverify command in the IBM Storage Scale:
information and contact the IBM Support Center. Problem Determination Guide.
6027-4202 [B] Version match on conditional GET. 6027-4205 [E] The ccr.nodes file missing or
Explanation: empty
This is an informational message only for the caller of Explanation:
the CCR. This file is needed by the CCR and the file is either
User response: missing or corrupted.
If this error message is displayed, the cause might be User response:
a problem in the CCR caller. Please record the above If this error is displayed but the cluster still has
information and contact the IBM Support Center. a quorum, the cluster can be recovered by issuing
6027-4203 [B] Invalid version on PUT. the mmsdrrestore -p command on the node on
which the error occurred. If this error occurs on
Explanation: one or multiple nodes and the cluster does not
This is an informational message only for the caller of have a quorum, follow the instructions in the topic
the CCR. Recovery procedures for a broken cluster when no CCR
User response: backup is available in the IBM Storage Scale: Problem
If this error message is displayed, the cause might be Determination Guide.
a problem in the CCR caller. Please record the above 6027-4206 [E] CCR is already initialized
information and contact the IBM Support Center.
6027-4204 [E] Not enough CCR quorum nodes Explanation
available. The CCR received an initialization request, but it
has been initialized already. Possible reasons for the
Explanation failure are:
A CCR request failed due to a loss of a majority 1. The node on which the error occurred is a part of
of quorum nodes. To fulfill a CCR request the IBM another cluster.
Storage Scale daemons (mmsdrserv or mmfsd) must
2. The node was part of another cluster and cleanup
communicate with each other on a majority of quorum
of it was not done properly when this node was
nodes. Most of the IBM Storage Scale commands will
removed from the old cluster.
not work because the IBM Storage Scale commands
use the CCR. Possible reasons for the failure are:
1. A majority of quorum nodes are down.

918 IBM Storage Scale 5.1.9: Problem Determination Guide


User response This node is taking over as clmgr without challenge, as
the current clmgr is being expelled.
1. If the node is still a member of another cluster,
remove the node from its current cluster before User response:
adding it to the new cluster. None.
2. Otherwise, record the above information and 6027-4209 [E] Unable to contact current cluster
contact the IBM Support Center to clean up the manager
node properly.
6027-4207 [E] Unable to reach any quorum node. Explanation
Check your firewall or network The CCR is not able to contact the current cluster
settings. manager over the daemon network during a file
commit, when the cluster is configured with tiebreaker
Explanation disks. Files committed to the CCR, like the mmsdrfs
The node on which this error occurred could not file, reside only on quorum nodes and not on
reach any of the quorum nodes. Most of IBM Storage tiebreaker disks. In order to commit a new version of a
Scale commands will fail on the node, because the file to the CCR, the cluster manager must be reachable
commands depend on the CCR. Possible reasons for over the daemon network, so that the file can be sent
the failure are: to the cluster manager.
1. All quorum nodes are down. The following are the possible reasons for the file
commit failure:
2. The IBM Storage Scale daemon (mmsdrserv or
mmfsd) is not running on any quorum node. 1. A new cluster manager election process has started
3. The IBM Storage Scale daemon (mmsdrserv or but it is not finished when the file is committed to
mmfsd) is not reachable on any quorum node. the CCR.
4. The node on which the error occurred cannot 2. The cluster manager has access to the tiebreaker
communicate with any of the quorum nodes, due disks but it is not reachable by other quorum
to network or firewall configuration issues. nodes over the daemon network. In such cases, the
cluster manager still responds to challenges written
to the tiebreaker disks. However, the file commit
User response fails until the cluster manager is reachable over the
1. Make sure that at least a majority of quorum nodes daemon network.
are up.
2. Verify that either the mmsdrserv daemon (if IBM User response
Storage Scale is down) or the mmfsd daemon (if 1. Issue the mmlsmgr -c command to verify whether
IBM Storage Scale is active) is running on a majority a cluster manager is elected.
of quorum nodes (for example on Linux by issuing
the ps command). 2. Verify that either the mmsdrserv daemon or
mmfsd daemon is running on the current cluster
3. Make sure the IBM Storage Scale daemons are manager. For example, you can verify this on the
reachable on a majority of quorum nodes. To Linux environment by issuing the ps command. If
identify the problem, issue the mmhealth node mmsdrserv daemon is running, it indicates that IBM
show GPFS -v command or the mmnetverify Storage Scale is down and if the mmfsd daemon
command as described in the topic Analyze network is running, it indicates that IBM Storage Scale is
problems with the mmnetverify command in the IBM active on the cluster manager.
Storage Scale: Problem Determination Guide.
3. Ensure that the IBM Storage Scale daemon is
4. Make sure the failed node can communicate reachable on the current cluster manager node. To
with at least one quorum node. To identify the identify the problem, issue the mmhealth node
problem, issue the mmhealth node show GPFS show GPFS -v command or the mmnetverify
-v command or the mmnetverify command as command as described in the topic Analyze network
described in the topic Analyze network problems problems with the mmnetverify command in the IBM
with the mmnetverify command in the IBM Storage Storage Scale: Problem Determination Guide.
Scale: Problem Determination Guide.
6027-4210 [E] Paxos state on tiebreaker disk
6027-4208 [I] Starting a new clmgr election, as diskName contains wrong cluster
the current clmgr is expelled ID. Expected: clusterId is: clusterId
Explanation: Explanation:

Chapter 42. References 919


When the cluster is configured with tiebreaker disks, 6027-4300 [W] RDMA hang break: Verbs timeout
the CCR stores its Paxos state on a reserved area (verbsHungRdmaTimeout sec) is
on these tiebreaker disks. The tiebreaker disks are expired on connection index
also used by IBM Storage Scale to declare quorum , connIndex, cookie connCookie
if enough quorum nodes are not reachable. This (nPosted nPosted).
message is returned by the CCR, if it found a Paxos
Explanation:
state on a tiebreaker disk with an unexpected cluster
Verbs request exceeds verbsHungRdmaTimeout
identification number. This means that the Paxos state
threshold.
comes from another cluster. Possible reason for this
failure: The affected tiebreaker disks are replicated User response:
or mirrored from another cluster, typically in an active- None. The RDMA hang break thread breaks the verb
passive disaster recovery setup. connection and wakes up threads that are waiting
for these hung RDMA requests. Threads retry these
User response requests. The verbs connection will be reconnected.
By reconfiguring the affected tiebreaker disks, the 6027-4301 [I] RDMA hang break: Wake up thread
wrong Paxos state will be overwritten with a correct threadId (waiting waitSecond.
state. Perform the following procedures: waitHundredthSecond sec for
1. Ensure that all quorum nodes are up and IBM verbsOp on index connIndex cookie
Storage Scale reached quorum. connCookie).

2. Reconfigure the cluster not to use the affected Explanation:


disks as tiebreaker disks, by using the "mmchconfig The RDMA hang break thread wakes up a thread,
tiebreakerDisks" command. which is waiting for a hung RDMA request.
3. Reconfigure the cluster again, now to use the User response:
affected disks as tiebreaker disks by using the None.
"mmchconfig tiebreakerDisks" command.
6027-4302 [E] RDMA fatal connection
6027-4211 [I] CCR leader election: lease error: Breaking connection index
overdue; must declare quorum connIndex cookie connCookie.
loss before taking over. Explanation:
Explanation: The RDMA subsystem is breaking and disabling a verbs
The current cluster leader has failed and the local connection for which a fatal error has been reported.
node has successfully challenged the old leader and User response:
been elected as the new leader by using tiebreaker Check RDMA adapters for hardware or firmware errors.
disks. However, too much time has passed since the This verbs connection will not be reconnected.
challenger's last successful lease renewal. Because of
that, it is possible that the old leader has expelled 6027-4303 [E] [E] RDMA cluster release level
the challenger and started log recovery (or will start must be 5.1.2 required to enable
log recovery) while the old leader still has a valid GPUDirect Storage (GDS).
lease. As a result, data or metadata that the challenger Explanation:
has cached may become stale. To prevent file system The RDMA feature GPUDirect Storage (GDS) requires
corruption, the challenger must declare quorum loss to release 5.1.2 or later.
reset its state before it takes over.
User response:
User response: Make sure all nodes in the cluster run on version
None. 5.1.2 or later, and the cluster release level
6027-4212 [I] CCR server suspended. (minReleaseLevel) is 5.1.2 or later.
Explanation: 6027-4304 [E] No valid GID provided by the
mmsdrserv running on the current node is in suspend nvidia_fs kernel module.
mode so that most CCR services are not available Explanation:
from this quorum server. This is probably because of The nvidia_fs kernel module did not provide a valid
a recovery operation in progress. GID to identify an RDMA port to perform GPU Direct
User response: Storage (GDS) I/O.
None. User response:
Contact support.

920 IBM Storage Scale 5.1.9: Problem Determination Guide


6027-4305 [E] Unable to find an RDMA port with Explanation:
GID GID for GPU Direct Storage While processing an IP address change event an
I/O. internal error occured.
Explanation: User response:
The RDMA port with the given GID cannot be found. Contact support.
This is required to perform GPU Direct Storage (GDS)
6027-4309 [E] Internal error when rebuilding GID
I/O. cache (errorCode).
User response: Explanation:
The IP over IB addresses specified in the GPU Direct While processing an GID change event an internal
Storage configuration file for key rdma_dev_addr_list error occurred.
in object properties must be consistent with the IBM
Storage Scale verbsPorts configuration variable on the User response:
GPU Direct Storage clients. Refer to the NVIDIA GPU Contact support.
Direct Storage documentation for more information on 6027-4310 [E] GPU Direct Storage I/O via
the NVIDIA configuration file. RDMA adapter rdmaAdapter port
6027-4306 [E] Unable to find an RDMA portNumber is not possible
port on virtual fabric number because port is down.
fabricNumber for GPU Direct Explanation:
Storage I/O. The specified RDMA port is selected for a GPU Direct
Explanation: Storage I/O operation but the port is in down state.
An RDMA port cannot be found on the given fabric. The I/O operation will be retried in compatibility mode
This is required to perform GPU Direct Storage (GDS) resulting in degraded performance.
I/O. User response:
User response: Check RDMA adapter port and the RDMA fabric for
Verify the IBM Storage Scale verbsPorts configuration errors.
variable and make sure an RDMA port on the given 6027-4311 [E] Unable to find an operational
virtual fabric number is specified.
RDMA port on virtual fabric
6027-4307 [E] GPU Direct Storage I/O to node number fabricNumber for GPU
nodeName via rdmaAdapter port Direct Storage I/O.
portNumber failed. Explanation:
Explanation: All RDMA ports on the given virtual fabric are in
A GPU Direct Storage RDMA operation targeting the down state. At least one operational port is required
specified remote node failed. to perform GPU Direct Storage (GDS) I/O. The
I/O operation will be retried in compatibility mode
User response: resulting in degraded performance.
Verify RDMA network connectivity between the local
and the specified remote node. Check all RDMA User response:
adapter ports on the local and the remote node. Check all RDMA adapter ports and the RDMA fabric for
errors.
6027-4308 [E] Internal error when scanning IP
adresses (errorCode).

Chapter 42. References 921


922 IBM Storage Scale 5.1.9: Problem Determination Guide
Accessibility features for IBM Storage Scale
Accessibility features help users who have a disability, such as restricted mobility or limited vision, to use
information technology products successfully.

Accessibility features
The following list includes the major accessibility features in IBM Storage Scale:
• Keyboard-only operation
• Interfaces that are commonly used by screen readers
• Keys that are discernible by touch but do not activate just by touching them
• Industry-standard devices for ports and connectors
• The attachment of alternative input and output devices
IBM Documentation, and its related publications, are accessibility-enabled.

Keyboard navigation
This product uses standard Microsoft Windows navigation keys.

IBM and accessibility


See the IBM Human Ability and Accessibility Center (www.ibm.com/able) for more information about the
commitment that IBM has to accessibility.

© Copyright IBM Corp. 2015, 2024 923


924 IBM Storage Scale 5.1.9: Problem Determination Guide
Notices
This information was developed for products and services offered in the US. This material might be
available from IBM in other languages. However, you may be required to own a copy of the product or
product version in that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that
only that IBM product, program, or service may be used. Any functionally equivalent product, program, or
service that does not infringe any IBM intellectual property right may be used instead. However, it is the
user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this
document. The furnishing of this document does not grant you any license to these patents. You can
send license inquiries, in writing, to:

IBM Director of Licensing IBM Corporation North Castle Drive, MD-NC119 Armonk, NY 10504-1785 US

For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual
Property Department in your country or send inquiries, in writing, to:
Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 19-21, Nihonbashi-
Hakozakicho, Chuo-ku Tokyo 103-8510, Japan
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this
one) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM Director of Licensing IBM Corporation North Castle Drive, MD-NC119 Armonk, NY 10504-1785 US

Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided by
IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any
equivalent agreement between us.
The performance data discussed herein is presented as derived under specific operating conditions.
Actual results may vary.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and

© Copyright IBM Corp. 2015, 2024 925


cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of
those products.
Statements regarding IBM's future direction or intent are subject to change or withdrawal without notice,
and represent goals and objectives only.
All IBM prices shown are IBM's suggested retail prices, are current and are subject to change without
notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to change before the
products described become available.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to actual people or business enterprises is
entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs
in any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform
for which the sample programs are written. These examples have not been thoroughly tested under
all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.

Each copy or any portion of these sample programs or any derivative work must include
a copyright notice as follows:

© (your company name) (year).


Portions of this code are derived from IBM Corp.
Sample Programs. © Copyright IBM Corp. _enter the year or years_.

If you are viewing this information softcopy, the photographs and color illustrations may not appear.

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at
Copyright and trademark information at www.ibm.com/legal/copytrade.shtml.
Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or
its affiliates.
The registered trademark Linux is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or
both.
Red Hat, OpenShift®, and Ansible® are trademarks or registered trademarks of Red Hat, Inc. or its
subsidiaries in the United States and other countries.
UNIX is a registered trademark of the Open Group in the United States and other countries.

Terms and conditions for product documentation


Permissions for the use of these publications are granted subject to the following terms and conditions.

926 Notices
IBM Privacy Policy
At IBM we recognize the importance of protecting your personal information and are committed to
processing it responsibly and in compliance with applicable data protection laws in all countries in which
IBM operates.
Visit the IBM Privacy Policy for additional information on this topic at https://www.ibm.com/privacy/
details/us/en/.

Applicability
These terms and conditions are in addition to any terms of use for the IBM website.

Personal use
You can reproduce these publications for your personal, noncommercial use provided that all proprietary
notices are preserved. You cannot distribute, display, or make derivative work of these publications, or
any portion thereof, without the express consent of IBM.

Commercial use
You can reproduce, distribute, and display these publications solely within your enterprise provided
that all proprietary notices are preserved. You cannot make derivative works of these publications, or
reproduce, distribute, or display these publications or any portion thereof outside your enterprise, without
the express consent of IBM.

Rights
Except as expressly granted in this permission, no other permissions, licenses, or rights are granted,
either express or implied, to the Publications or any information, data, software or other intellectual
property contained therein.
IBM reserves the right to withdraw the permissions that are granted herein whenever, in its discretion, the
use of the publications is detrimental to its interest or as determined by IBM, the above instructions are
not being properly followed.
You cannot download, export, or reexport this information except in full compliance with all applicable
laws and regulations, including all United States export laws and regulations.
IBM MAKES NO GUARANTEE ABOUT THE CONTENT OF THESE PUBLICATIONS. THE PUBLICATIONS
ARE PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT,
AND FITNESS FOR A PARTICULAR PURPOSE.

Notices 927
928 IBM Storage Scale 5.1.9: Problem Determination Guide
Glossary
This glossary provides terms and definitions for IBM Storage Scale.
The following cross-references are used in this glossary:
• See refers you from a nonpreferred term to the preferred term or from an abbreviation to the spelled-
out form.
• See also refers you to a related or contrasting term.
For other terms and definitions, see the IBM Terminology website (www.ibm.com/software/globalization/
terminology) (opens in new window).

B
block utilization
The measurement of the percentage of used subblocks per allocated blocks.

C
cluster
A loosely coupled collection of independent systems (nodes) organized into a network for the purpose
of sharing resources and communicating with each other. See also GPFS cluster.
cluster configuration data
The configuration data that is stored on the cluster configuration servers.
Cluster Export Services (CES) nodes
A subset of nodes configured within a cluster to provide a solution for exporting GPFS file systems by
using the Network File System (NFS), Server Message Block (SMB), and Object protocols.
cluster manager
The node that monitors node status using disk leases, detects failures, drives recovery, and selects
file system managers. The cluster manager must be a quorum node. The selection of the cluster
manager node favors the quorum-manager node with the lowest node number among the nodes that
are operating at that particular time.
Note: The cluster manager role is not moved to another node when a node with a lower node number
becomes active.
clustered watch folder
Provides a scalable and fault-tolerant method for file system activity within an IBM Storage Scale
file system. A clustered watch folder can watch file system activity on a fileset, inode space, or an
entire file system. Events are streamed to an external Kafka sink cluster in an easy-to-parse JSON
format. For more information, see the mmwatch command in the IBM Storage Scale: Command and
Programming Reference Guide.
control data structures
Data structures needed to manage file data and metadata cached in memory. Control data structures
include hash tables and link pointers for finding cached data; lock states and tokens to implement
distributed locking; and various flags and sequence numbers to keep track of updates to the cached
data.

D
Data Management Application Program Interface (DMAPI)
The interface defined by the Open Group's XDSM standard as described in the publication
System Management: Data Storage Management (XDSM) API Common Application Environment (CAE)
Specification C429, The Open Group ISBN 1-85912-190-X.

© Copyright IBM Corp. 2015, 2024 929


deadman switch timer
A kernel timer that works on a node that has lost its disk lease and has outstanding I/O requests. This
timer ensures that the node cannot complete the outstanding I/O requests (which would risk causing
file system corruption), by causing a panic in the kernel.
dependent fileset
A fileset that shares the inode space of an existing independent fileset.
disk descriptor
A definition of the type of data that the disk contains and the failure group to which this disk belongs.
See also failure group.
disk leasing
A method for controlling access to storage devices from multiple host systems. Any host that wants to
access a storage device configured to use disk leasing registers for a lease; in the event of a perceived
failure, a host system can deny access, preventing I/O operations with the storage device until the
preempted system has reregistered.
disposition
The session to which a data management event is delivered. An individual disposition is set for each
type of event from each file system.
domain
A logical grouping of resources in a network for the purpose of common management and
administration.

E
ECKD
See extended count key data (ECKD).
ECKD device
See extended count key data device (ECKD device).
encryption key
A mathematical value that allows components to verify that they are in communication with the
expected server. Encryption keys are based on a public or private key pair that is created during the
installation process. See also file encryption key, master encryption key.
extended count key data (ECKD)
An extension of the count-key-data (CKD) architecture. It includes additional commands that can be
used to improve performance.
extended count key data device (ECKD device)
A disk storage device that has a data transfer rate faster than some processors can utilize and that is
connected to the processor through use of a speed matching buffer. A specialized channel program is
needed to communicate with such a device. See also fixed-block architecture disk device.

F
failback
Cluster recovery from failover following repair. See also failover.
failover
(1) The assumption of file system duties by another node when a node fails. (2) The process of
transferring all control of the ESS to a single cluster in the ESS when the other clusters in the ESS fails.
See also cluster. (3) The routing of all transactions to a second controller when the first controller fails.
See also cluster.
failure group
A collection of disks that share common access paths or adapter connections, and could all become
unavailable through a single hardware failure.
FEK
See file encryption key.

930 IBM Storage Scale 5.1.9: Problem Determination Guide


fileset
A hierarchical grouping of files managed as a unit for balancing workload across a cluster. See also
dependent fileset, independent fileset.
fileset snapshot
A snapshot of an independent fileset plus all dependent filesets.
file audit logging
Provides the ability to monitor user activity of IBM Storage Scale file systems and store events
related to the user activity in a security-enhanced fileset. Events are stored in an easy-to-parse JSON
format. For more information, see the mmaudit command in the IBM Storage Scale: Command and
Programming Reference Guide.
file clone
A writable snapshot of an individual file.
file encryption key (FEK)
A key used to encrypt sectors of an individual file. See also encryption key.
file-management policy
A set of rules defined in a policy file that GPFS uses to manage file migration and file deletion. See
also policy.
file-placement policy
A set of rules defined in a policy file that GPFS uses to manage the initial placement of a newly created
file. See also policy.
file system descriptor
A data structure containing key information about a file system. This information includes the disks
assigned to the file system (stripe group), the current state of the file system, and pointers to key files
such as quota files and log files.
file system descriptor quorum
The number of disks needed in order to write the file system descriptor correctly.
file system manager
The provider of services for all the nodes using a single file system. A file system manager processes
changes to the state or description of the file system, controls the regions of disks that are allocated
to each node, and controls token management and quota management.
fixed-block architecture disk device (FBA disk device)
A disk device that stores data in blocks of fixed size. These blocks are addressed by block number
relative to the beginning of the file. See also extended count key data device.
fragment
The space allocated for an amount of data too small to require a full block. A fragment consists of one
or more subblocks.

G
GPUDirect Storage
IBM Storage Scale's support for NVIDIA's GPUDirect Storage (GDS) enables a direct path between
GPU memory and storage. File system storage is directly connected to the GPU buffers to reduce
latency and load on CPU. Data is read directly from an NSD server's pagepool and it is sent to the GPU
buffer of the IBM Storage Scale clients by using RDMA.
global snapshot
A snapshot of an entire GPFS file system.
GPFS cluster
A cluster of nodes defined as being available for use by GPFS file systems.
GPFS portability layer
The interface module that each installation must build for its specific hardware platform and Linux
distribution.

Glossary 931
GPFS recovery log
A file that contains a record of metadata activity and exists for each node of a cluster. In the event of
a node failure, the recovery log for the failed node is replayed, restoring the file system to a consistent
state and allowing other nodes to continue working.

I
ill-placed file
A file assigned to one storage pool but having some or all of its data in a different storage pool.
ill-replicated file
A file with contents that are not correctly replicated according to the desired setting for that file. This
situation occurs in the interval between a change in the file's replication settings or suspending one of
its disks, and the restripe of the file.
independent fileset
A fileset that has its own inode space.
indirect block
A block containing pointers to other blocks.
inode
The internal structure that describes the individual files in the file system. There is one inode for each
file.
inode space
A collection of inode number ranges reserved for an independent fileset, which enables more efficient
per-fileset functions.
ISKLM
IBM Security Key Lifecycle Manager. For GPFS encryption, the ISKLM is used as an RKM server to
store MEKs.

J
journaled file system (JFS)
A technology designed for high-throughput server environments, which are important for running
intranet and other high-performance e-business file servers.
junction
A special directory entry that connects a name in a directory of one fileset to the root directory of
another fileset.

K
kernel
The part of an operating system that contains programs for such tasks as input/output, management
and control of hardware, and the scheduling of user tasks.

M
master encryption key (MEK)
A key used to encrypt other keys. See also encryption key.
MEK
See master encryption key.
metadata
Data structures that contain information that is needed to access file data. Metadata includes inodes,
indirect blocks, and directories. Metadata is not accessible to user applications.
metanode
The one node per open file that is responsible for maintaining file metadata integrity. In most cases,
the node that has had the file open for the longest period of continuous time is the metanode.

932 IBM Storage Scale 5.1.9: Problem Determination Guide


mirroring
The process of writing the same data to multiple disks at the same time. The mirroring of data
protects it against data loss within the database or within the recovery log.
Microsoft Management Console (MMC)
A Windows tool that can be used to do basic configuration tasks on an SMB server. These tasks
include administrative tasks such as listing or closing the connected users and open files, and creating
and manipulating SMB shares.
multi-tailed
A disk connected to multiple nodes.

N
namespace
Space reserved by a file system to contain the names of its objects.
Network File System (NFS)
A protocol, developed by Sun Microsystems, Incorporated, that allows any host in a network to gain
access to another host or netgroup and their file directories.
Network Shared Disk (NSD)
A component for cluster-wide disk naming and access.
NSD volume ID
A unique 16-digit hex number that is used to identify and access all NSDs.
node
An individual operating-system image within a cluster. Depending on the way in which the computer
system is partitioned, it may contain one or more nodes.
node descriptor
A definition that indicates how GPFS uses a node. Possible functions include: manager node, client
node, quorum node, and nonquorum node.
node number
A number that is generated and maintained by GPFS as the cluster is created, and as nodes are added
to or deleted from the cluster.
node quorum
The minimum number of nodes that must be running in order for the daemon to start.
node quorum with tiebreaker disks
A form of quorum that allows GPFS to run with as little as one quorum node available, as long as there
is access to a majority of the quorum disks.
non-quorum node
A node in a cluster that is not counted for the purposes of quorum determination.
Non-Volatile Memory Express (NVMe)
An interface specification that allows host software to communicate with non-volatile memory
storage media.

P
policy
A list of file-placement, service-class, and encryption rules that define characteristics and placement
of files. Several policies can be defined within the configuration, but only one policy set is active at one
time.
policy rule
A programming statement within a policy that defines a specific action to be performed.
pool
A group of resources with similar characteristics and attributes.

Glossary 933
portability
The ability of a programming language to compile successfully on different operating systems without
requiring changes to the source code.
primary GPFS cluster configuration server
In a GPFS cluster, the node chosen to maintain the GPFS cluster configuration data.
private IP address
An IP address used to communicate on a private network.
public IP address
An IP address used to communicate on a public network.

Q
quorum node
A node in the cluster that is counted to determine whether a quorum exists.
quota
The amount of disk space and number of inodes assigned as upper limits for a specified user, group of
users, or fileset.
quota management
The allocation of disk blocks to the other nodes writing to the file system, and comparison of the
allocated space to quota limits at regular intervals.

R
Redundant Array of Independent Disks (RAID)
A collection of two or more disk physical drives that present to the host an image of one or more
logical disk drives. In the event of a single physical device failure, the data can be read or regenerated
from the other disk drives in the array due to data redundancy.
recovery
The process of restoring access to file system data when a failure has occurred. Recovery can involve
reconstructing data or providing alternative routing through a different server.
remote key management server (RKM server)
A server that is used to store master encryption keys.
replication
The process of maintaining a defined set of data in more than one location. Replication consists of
copying designated changes for one location (a source) to another (a target) and synchronizing the
data in both locations.
RKM server
See remote key management server.
rule
A list of conditions and actions that are triggered when certain conditions are met. Conditions include
attributes about an object (file name, type or extension, dates, owner, and groups), the requesting
client, and the container name associated with the object.

S
SAN-attached
Disks that are physically attached to all nodes in the cluster using Serial Storage Architecture (SSA)
connections or using Fibre Channel switches.
Scale Out Backup and Restore (SOBAR)
A specialized mechanism for data protection against disaster only for GPFS file systems that are
managed by IBM Storage Protect for Space Management.
secondary GPFS cluster configuration server
In a GPFS cluster, the node chosen to maintain the GPFS cluster configuration data in the event that
the primary GPFS cluster configuration server fails or becomes unavailable.

934 IBM Storage Scale 5.1.9: Problem Determination Guide


Secure Hash Algorithm digest (SHA digest)
A character string used to identify a GPFS security key.
session failure
The loss of all resources of a data management session due to the failure of the daemon on the
session node.
session node
The node on which a data management session was created.
Small Computer System Interface (SCSI)
An ANSI-standard electronic interface that allows personal computers to communicate with
peripheral hardware, such as disk drives, tape drives, CD-ROM drives, printers, and scanners faster
and more flexibly than previous interfaces.
snapshot
An exact copy of changed data in the active files and directories of a file system or fileset at a single
point in time. See also fileset snapshot, global snapshot.
source node
The node on which a data management event is generated.
stand-alone client
The node in a one-node cluster.
storage area network (SAN)
A dedicated storage network tailored to a specific environment, combining servers, storage products,
networking products, software, and services.
storage pool
A grouping of storage space consisting of volumes, logical unit numbers (LUNs), or addresses that
share a common set of administrative characteristics.
stripe group
The set of disks comprising the storage assigned to a file system.
striping
A storage process in which information is split into blocks (a fixed amount of data) and the blocks are
written to (or read from) a series of disks in parallel.
subblock
The smallest unit of data accessible in an I/O operation, equal to one thirty-second of a data block.
system storage pool
A storage pool containing file system control structures, reserved files, directories, symbolic links,
special devices, as well as the metadata associated with regular files, including indirect blocks and
extended attributes. The system storage pool can also contain user data.

T
token management
A system for controlling file access in which each application performing a read or write operation
is granted some form of access to a specific block of file data. Token management provides data
consistency and controls conflicts. Token management has two components: the token management
server, and the token management function.
token management function
A component of token management that requests tokens from the token management server. The
token management function is located on each cluster node.
token management server
A component of token management that controls tokens relating to the operation of the file system.
The token management server is located at the file system manager node.
transparent cloud tiering (TCT)
A separately installable add-on feature of IBM Storage Scale that provides a native cloud storage tier.
It allows data center administrators to free up on-premise storage capacity, by moving out cooler data
to the cloud storage, thereby reducing capital and operational expenditures.

Glossary 935
twin-tailed
A disk connected to two nodes.

U
user storage pool
A storage pool containing the blocks of data that make up user files.

V
VFS
See virtual file system.
virtual file system (VFS)
A remote file system that has been mounted so that it is accessible to the local user.
virtual node (vnode)
The structure that contains information about a file system object in a virtual file system (VFS).

W
watch folder API
Provides a programming interface where a custom C program can be written that incorporates the
ability to monitor inode spaces, filesets, or directories for specific user activity-related events within
IBM Storage Scale file systems. For more information, a sample program is provided in the following
directory on IBM Storage Scale nodes: /usr/lpp/mmfs/samples/util called tswf that can be
modified according to the user's needs.

936 IBM Storage Scale 5.1.9: Problem Determination Guide


Index

Special Characters AFM to cloud object storage


download and upload 209
.ptrash directory 251 issues 517
.rhosts 353 mmperfmon 208, 209
.snapshots 400, 402 monitoring commands 207–209
/etc/filesystems 378 troubleshooting 517
/etc/fstab 378 AFM, extended attribute size supported by 251
/etc/hosts 352 AFM, messages requeuing 404
/etc/resolv.conf 375 AIX
/tmp/mmfs 249, 555 kernel debugger 331
/usr/lpp/mmfs/bin 357 AIX error logs
/usr/lpp/mmfs/bin/runmmfs 284 MMFS_DISKFAIL 410
/usr/lpp/mmfs/samples/gatherlogs.samples.sh file 258 MMFS_QUOTA 383
/var/adm/ras/mmfs.log.previous 363 unavailable disks 383
/var/mmfs/etc/mmlock 354 AIX logical volume
/var/mmfs/gen/mmsdrfs 355 down 413
AIX platform
gpfs.snap command 297
A application program errors 366
access application programs
to disk 411 errors 275, 277, 361, 365
ACCESS_TIME attribute 324 audit
accessibility features for IBM Storage Scale 923 fileset 524
active directory authentication audit events
declared DNS servers 438 Cloud services 726
active file management, questions related to 250 audit messages 258
AD authentication audit types 233
nameserver issues 438 authentication
Adding new sensors 110 problem determination 436
administration commands Authentication error events 437
failure 354 Authentication errors
administration, collector node 215 SSD process not running (sssd_down) 437
AFM sssd_down 437
callback events 194 Winbind process not running (wnbd_down) 437
fileset states 185, 205 wnbd_down 437
issues 511 yp_down 437
mmdiag 197 YPBIND process not running (yp_down) 437
mmhealth 194 authentication events ,,
mmperfmon 196, 208 dns_krb_tcp_dc_msdcs_down 438
mmpmon 196 dns_krb_tcp_down 438
monitor prefetch 197 dns_ldap_tcp_dc_msdcs_down 438
monitoring commands 194–197, 208 dns_ldap_tcp_down 438
monitoring policies 199 authorization error 353
monitoring using GUI 200 authorization issues
troubleshooting 511 IBM Storage Scale 438
AFM DR autofs 386
callback events 194 autofs mount 385
fileset states 189 autoload option
issues 515 on mmchconfig command 358
mmdiag 197 on mmcrcluster command 358
mmhealth 194 automatic backup of cluster data 356
mmperfmon 196 automount 380, 385
mmpmon 196 automount daemon 385
monitoring commands 194–197 automount failure 385–387
monitoring policies 199
monitoring using GUI 200
AFM fileset, changing mode of 250

Index 937
B cluster security configuration 388
cluster state information 312
back up data 245 cluster status information, SNMP 219
backup clustered watch folder 237, 238
automatic backup of cluster data 356 collecting details of issues by using dumps 283
best practices for troubleshooting 245 Collecting details of issues by using logs, dumps, and traces
block allocation 485 283
Broken cluster recovery collector
mmsdrcommand 553 performance monitoring tool
multiple nodes 552 configuring 106, 114, 116
no CCR backup 552 migrating 116
single node 552 collector node
installing MIB files 214
commands
C cluster state information 312
call home conflicting invocation 377
mmcallhome 227 errpt 555
monitoring IBM Storage Scale system remotely 227 file system and disk information 317
callback events gpfs.snap 295–298, 301, 302, 308, 555
AFM 194 grep 274
candidate file lslpp 556
attributes 324 lslv 248
CCR 552, 553 lsof 318, 381, 382
CES lspv 414
IP removal 273 lsvg 413
monitoring 272, 273 lxtrace 283, 312
troubleshooting 272, 273 mmadddisk 392, 398, 413, 415, 422
upgrade error 273 mmaddnode 248, 351, 382
CES administration 272, 273 mmafmctl 404
CES collection 287–293 mmafmctl Device getstate 313
ces configuration issues 365 mmapplypolicy 319, 395, 396, 399, 436
CES exports mmauth 329, 388
NFS mmbackup 403
warning 49 mmchcluster 353
SMB mmchconfig 315, 358, 382, 391
warning 49 mmchdisk 378, 392, 398, 399, 407, 410, 415, 425
CES logs 263 mmcheckquota 276, 325, 366, 383
CES monitoring 272, 273 mmchfs 277, 357, 363, 378, 380, 383, 385, 454, 497
CES tracing 287–293 mmchnode 215, 216, 248
changing mode of AFM fileset 250 mmchnsd 407
checking, Persistent Reserve 426 mmcommon recoverfs 393
chosen file 320, 321 mmcommon showLocks 355
CIFS serving, Windows SMB2 protocol 367 mmcrcluster 248, 315, 351, 353, 358
cipherList 390 mmcrfs 363, 364, 407, 415, 454
Clearing a leftover Persistent Reserve reservation 426 mmcrnsd 407, 410
client node 380 mmcrsnapshot 400, 402
clock synchronization 258, 395 mmdeldisk 392, 398, 413, 415
Cloud Data sharing mmdelfileset 397
audit events 726 mmdelfs 424
Cloud services mmdelnode 351, 363
health status 713 mmdelnsd 410, 424
Cloud services audit events 726 mmdelsnapshot 401
cluster mmdf 393, 413, 496
deleting a node 363 mmdiag 313
cluster configuration information mmdsh 354
displaying 314 mmdumpperfdata 310
cluster configuration information, SNMP 219 mmexpelnode 316
Cluster Export Services mmfileid 327, 404, 415
administration 272, 273 mmfsadm 287, 312, 360, 404, 415, 497
issue collection 287, 289–293 mmfsck 250, 318, 377, 398, 404, 413, 423
monitoring 272, 273 mmgetstate 313, 359, 363
tracing 287, 289–293 mmlsattr 396, 397
cluster file systems mmlscluster 215, 248, 314, 351, 389
displaying 315 mmlsconfig 283, 315, 385

938 IBM Storage Scale 5.1.9: Problem Determination Guide


commands (continued) Configuring webhook
mmlsdisk 364, 377, 378, 383, 392, 407, 410, 415, 425, WEBHOOK 44
556 connectivity problems 354
mmlsfileset 397 contact node address 388
mmlsfs 379, 422, 424, 556 contact node failure 389
mmlsmgr 312, 378 crash files for Ubuntu 283
mmlsmount 318, 358, 365, 377, 381, 382, 407 create custom events 19
mmlsnsd 326, 408, 413 creating a file, failure 435
mmlspolicy 396 cron 249
mmlsquota 365, 366 Custom defined events 19
mmlssnapshot 400–402
mmmount 318, 377, 383, 415
mmperfmon 110, 150, 158, 170
D
mmpmon 59, 60, 62, 64–83, 85–87, 90, 92–95, 97, 98, data
104, 332, 498, 499 replicated 422
mmquotaoff 366 data always gathered by gpfs.snap
mmquotaon 366 for a master snapshot 302
mmrefresh 315, 378, 385 on AIX 297
mmremotecluster 329, 388–390 on all platforms 296
mmremotefs 385, 389 on Linux 298
mmrepquota 366 on Windows 301
mmrestorefs 401, 402 Data always gathered for an Object on Linux 304
mmrestripefile 396, 399 Data always gathered for authentication on Linux 307
mmrestripefs 399, 413, 423 Data always gathered for CES on Linux 306
mmrpldisk 392, 398, 415 Data always gathered for NFS on Linux 304
mmsdrrestore 316 Data always gathered for performance on Linux 309
mmshutdown 216, 314, 315, 358–361, 386, 391 Data always gathered for SMB on Linux 303
mmsnapdir 400, 402 data block
mmstartup 358, 386 replica mismatches 417
mmumount 381, 383, 413 data block replica mismatches 416, 417
mmunlinkfileset 397 data collection 303
mmwindisk 326 data file issues
mount 377, 378, 380, 415, 424 cluster configuration 354
ping 354 data gathered by
rcp 353 gpfs.snap on Linux 303
rpm 556 data gathered by gpfs.snap for call home entries 308
rsh 353, 363 data integrity 277, 404
scp 353 Data Management API (DMAPI)
ssh 353 file system does not mount 379
umount 382, 383 data replication 412
varyonvg 415 data structure 275
commands, administration data update
failure 354 file system with snapshot 495
common issues and workarounds fileset with snapshot 495
transparent cloud tiering 519 dataOnly attribute 398
communication paths dataStructureDump 283
unavailable 378 deadlock
compiling mmfslinux module 357 automated breakup 336
config populate breakup on demand 337
object endpoint issue 350 deadlocks
configuration automated detection 334, 335
hard loop ID 352 debug data 333
performance tuning 352 information about 333
configuration data 392 log 333
configuration parameters deadman switch timer 429
kernel 357 debug data
configuration problems 350 deadlocks 333
configuration variable settings debug data collection
displaying 315 CES tracing 287–293
configuring delays 496, 497
API keys 153 DELETE rule 320, 322
configuring Net-SNMP 212 deleting a node
configuring SNMP for use with IBM Storage Scale 213 from a cluster 363
configuring SNMP-based management applications 213 descOnly 384

Index 939
diagnostic data encryption issues (continued)
deadlock diagnostics 294 issues with adding encryption policy 435
standard diagnostics 294 permission denied message 435
directed maintenance procedure ERRNO I/O error code 363
activate AFM 536 error
activate NFS 536 events 371
activate SMB 536 error codes
configure NFS sensors 537 EINVAL 395
configure SMB sensors 537 EIO 275, 407, 423, 424
increase fileset space 534, 539 ENODEV 361
mount file system 538 ENOENT 382
start gpfs daemon 534 ENOSPC 393, 423
start NSD 533 ERRNO I/O 363
start performance monitoring collector service 535 ESTALE 277, 361, 382
start performance monitoring sensor service 535 NO SUCH DIRECTORY 361
start the GUI service 538 NO SUCH FILE 361
synchronize node clocks 534 error log
directories MMFS_LONGDISKIO 276
.snapshots 400, 402 MMFS_QUOTA 276
/tmp/mmfs 249, 555 error logs
directory that has not been cached, traversing 251 AFM 279
disabling IPv6 example 277
for SSH connection delays 375 MMFS_ABNORMAL_SHUTDOWN 275
disabling Persistent Reserve manually 428 MMFS_DISKFAIL 275
disaster recovery MMFS_ENVIRON 275
other problems 478 MMFS_FSSTRUCT 275
setup problems 477 MMFS_GENERIC 275
disk access 411 MMFS_LONGDISKIO 276
disk commands MMFS_QUOTA 276, 325
hang 415 MMFS_SYSTEM_UNMOUNT 277
disk configuration information, SNMP 224 MMFS_SYSTEM_WARNING 277
disk connectivity failure 412 operating system 274
disk descriptor replica 383 error messages
disk failover 412 6027-1209 362
disk failure 424 6027-1242 354
disk leasing 429 6027-1290 393
disk performance information, SNMP 224 6027-1598 351
disk recovery 412 6027-1615 354
disk status information, SNMP 223 6027-1617 354
disk subsystem 6027-1627 364
failure 407 6027-1628 355
diskReadExclusionList 417 6027-1630 355
disks 6027-1631 355
damaged files 327 6027-1632 355
declared down 410 6027-1633 355
displaying information of 326 6027-1636 408
failure 275, 277, 407 6027-1661 408
media failure 415 6027-1662 410
partial failure 413 6027-1995 403
replacing 393 6027-1996 391
usage 384 6027-2108 408
disks down 413 6027-2109 408
disks, viewing 326 6027-300 358
displaying disk information 326 6027-306 360
displaying NSD information 408 6027-319 359, 360
DMP 533, 539 6027-320 360
DNS server failure 388 6027-321 360
6027-322 360
6027-341 357, 361
E 6027-342 357, 361
enable 6027-343 357, 361
performance monitoring sensors 112 6027-344 357, 361
enabling Persistent Reserve manually 428 6027-361 412
encryption issues 6027-418 384, 424

940 IBM Storage Scale 5.1.9: Problem Determination Guide


error messages (continued) event notifications
6027-419 379, 384 emails 4
6027-435 477 events
6027-473 384 AFM events 559
6027-474 384 authentication events 565
6027-482 379, 424 Call Home events 570
6027-485 424 CES Network events 573
6027-490 477 CESIP events 578
6027-506 366 cluster state events 578
6027-533 497 disk events 582
6027-538 364 Enclosure events 584
6027-549 379 Encryption events 593
6027-580 379 File audit logging events 597
6027-631 392 file system events 602
6027-632 392 Filesysmgr events 613
6027-635 392 GDS events 615
6027-636 392, 424 GPFS events 615
6027-638 392 GUI events 634
6027-645 379 hadoop connector events 645
6027-650 361 HDFS data node events 645
6027-663 364 HDFS name node events 646
6027-665 358, 364 keystone events 648
6027-695 366 Local cache events 650
6027-953 403 network events 652
ANS1312E 403 NFS events 660
cluster configuration data file issues 355 notifications
descriptor replica 477 snmp 5
disk media failures 424 NVMe events 671
failed to connect 358, 412 NVMeoF events 674
file system forced unmount problems 384 object events 675
file system mount problems 379 performance events 685
GPFS cluster data recovery 355 Server RAID events 687
IBM Storage Protect 403 SMB events 687
incompatible version number 360 stretch cluster events 691
mmbackup 403 TCT events 693
mmfsd ready 358 Threshold events 708
multiple file system manager failures 392 Watchfolder events 710
network problems 360 example
quorum 477 error logs 277
rsh problems 354 EXCLUDE rule 323, 324
shared segment problems 359, 360 excluded file
snapshot 400–402 attributes 324
error numbers extended attribute size supported by AFM 251
application calls 380
configuration problems 357
data corruption 404
F
EALL_UNAVAIL = 218 384 facility
ECONFIG = 208 357 Linux kernel crash dump (LKCD) 331
ECONFIG = 215 357, 361 failure
ECONFIG = 218 357 disk 410
ECONFIG = 237 357 mmccr command 250
ENO_MGR = 212 392, 425 mmfsck command 250
ENO_QUOTA_INST = 237 380 of disk media 415
EOFFLINE = 208 424 snapshot 400
EPANIC = 666 384 failure creating a file 435
EVALIDATE = 214 404 failure group 383
file system forced unmount 384 failure groups
GPFS application calls 424 loss of 384
GPFS daemon will not come up 361 use of 383
installation problems 357 failure to append messages 524
multiple file system manager failures 392 failure, key rewrap 436
errors, application program 366 failure, mount 435
errors, Persistent Reserve 425 failures
errpt command 555 mmbackup 403

Index 941
file audit logging files (continued)
issues 523, 524 /etc/resolv.conf 375
JSON 523 /usr/lpp/mmfs/bin/runmmfs 284
logs 278 /usr/lpp/mmfs/samples/gatherlogs.samples.sh 258
monitor 233 /var/adm/ras/mmfs.log.previous 363
monitoring 233 /var/mmfs/etc/mmlock 354
states 233 /var/mmfs/gen/mmsdrfs 355
troubleshooting 523 detecting damage 327
file audit logs 235 mmfs.log 358, 359, 361, 377, 380, 382, 386–390, 555
File Authentication protocol authentication log 270
setup problems 436 fileset
file creation failure 435 issues 524
file level FILESET_NAME attribute 324
replica 418 filesets
file level replica 418 child 397
file migration deleting 397
problems 396 emptying 397
file placement policy 396 errors 398
file system lost+found 398
mount status 391 moving contents 397
space 393 performance 397
File system problems 393
high utilization 488 snapshots 397
file system descriptor unlinking 397
failure groups 383 usage errors 397
inaccessible 384 find custom events 19
file system manager FSDesc structure 383
cannot appoint 382 full file system or fileset 251
contact problems
communication paths unavailable 378
multiple failures 391, 392
G
file system mount failure 435 Ganesha 2.7
file system or fileset getting full 251 Unknown parameters
file system performance information, SNMP 222 Dispatch_Max_Reqs 451
file system status information, SNMP 221 GDS 431
file systems generate
cannot be unmounted 318 trace reports 283
creation failure 363 generating GPFS trace reports
determining if mounted 391 mmtracectl command 283
discrepancy between configuration data and on-disk getting started with troubleshooting 245
data 392 GPFS
displaying statistics 62, 65 /tmp/mmfs directory 249
do not mount 377 abnormal termination in mmpmon 498
does not mount 377 active file management 250
does not unmount 381 AFM 404
forced unmount 277, 382, 391, 392 AIX 386
free space shortage 402 application program errors 366
issues 524 authentication issues 436
listing mounted 318 automatic failure 543
loss of access 365 automatic recovery 543
remote 387 automount 385
remotely mounted 524 automount failure 386
reset statistics 75 automount failure in Linux 385
state after restore 402 checking Persistent Reserve 426
unable to determine if mounted 391 cipherList option has not been set properly 390
will not mount 318 clearing a leftover Persistent Reserve reservation 426
FILE_SIZE attribute 324 client nodes 380
files cluster configuration
.rhosts 353 issues 355
/etc/filesystems 378 cluster name 389
/etc/fstab 378 cluster security configurations 388
/etc/group 276 cluster state information commands 312–316
/etc/hosts 352 command 295–298, 301, 302, 308, 313
/etc/passwd 276 configuration data 392

942 IBM Storage Scale 5.1.9: Problem Determination Guide


GPFS (continued) GPFS (continued)
contact node address 388 file systems manager failure 391, 392
contact nodes down 389 filesets usage 397
core dumps 281 forced unmount 382
corrupted data integrity 404 gpfs.snap 295–298, 301, 302, 308
create script 17 guarding against disk failures 412
data gathered for protocol on Linux 303, 304, 306–309 GUI logs 294
data integrity 404 hang in mmpmon 498
data integrity may be corrupted 404 health of integrated SMB server 452
deadlocks 333, 336, 337 IBM Storage Protect error messages 403
delays and deadlocks 496 ill-placed files 395
determine if a file system is mounted 391 incorrect output from mmpmon 498
determining the health of integrated SMB server 452 indirect issues with snapshot 400
disaster recovery issues 477 installation and configuration issues 339, 351, 352,
discrepancy between GPFS configuration data and the 354, 358, 360–362, 364, 366
on-disk data for a file system 392 installation toolkit issues 347–350, 352, 369, 370
disk accessing command failure 415 installing 282
disk connectivity failure 412 installing on Linux nodes 282
disk failure 412, 424 integrated SMB server 452
disk information commands 317–319, 326, 327 issues while working with Samba 454
disk issues 407, 429 issues with snapshot 400, 401
disk media failure 415, 424 key rewrap 436
disk recovery 412 limitations 253
disk subsystem failures 407 local node failure 390
displaying NSD information 408 locating snapshot 400
encryption rules 435 logs 256
error creating internal storage 250 manually disabling Persistent Reserve 428
error encountered while creating NSD disks 407 manually enabling Persistent Reserve 428
error encountered while using NSD disks 407 mapping 387
error message 250, 392 message 6027-648 249
error message "Function not implemented" 387 message referring to an existing NSD 410
error messages 401–403, 424 message requeuing 404
error messages for file system 379, 380 message requeuing in AFM 404
error messages for file system forced unmount message severity tags 728
problems 384 messages 728
error messages for file system mount status 391 mmafmctl Device getstate 313
error messages for indirect snapshot errors 400 mmapplypolicy -L command 320–324
error messages not directly related to snapshots 400 mmbackup command 403
error messages related to snapshots 400, 401 mmbackup errors 403
error numbers 384, 392, 424 mmdumpperfdata command 310
error numbers specific to GPFS application calls 404 mmexpelnode command 316
errors 396, 397, 425 mmfsadm command 312
errors associated with filesets 393 mmpmon 498
errors associated with policies 393 mmpmon command 499
errors associated with storage pools, 393 mmpmon output 498
errors encountered 399 mmremotecluster command 389
errors encountered with filesets 398 mount 249, 385, 387
events 17, 559, 565, 570, 573, 578, 582, 584, 593, mount failure 380, 435
597, 602, 613, 615, 634, 645, 646, 648, 650, 652, 660, mounting cluster 390
671, 674, 675, 685, 687, 691, 693, 708, 710 mounting cluster does not have direct access to the
failure group considerations 383 disks 390
failures using the mmbackup command 403 multipath device 428
file system 381, 382, 435 multiple file system manager failures 391, 392
file system commands 317–319, 326, 327 negative values in the 'predicted pool utilizations', 395
file system failure 377 network issues 375
file system has adequate free space 393 NFS client 443
file system is forced to unmount 384 NFS problems 443
file system is mounted 391 NFS V4 405
file system issues 377 NFS V4 issues 405
file system manager appointment fails 392 NFS V4 problem 405
file system manager failures 392 no replication 423
file system mount problems 379, 380, 385 NO_SPACE error 393
file system mount status 391 nodes do not start 360
file system mounting 250 NSD creation failure 410

Index 943
GPFS (continued) GPFS (continued)
NSD disk does not have an NSD server specified 390 remote file system 387, 388
NSD information 408 remote file system does not mount 387, 388
NSD is down 410 remote file system I/O failure 387
NSD server 380 remote mount failure 391
NSD subsystem failures 407 replicated data 422
NSDs built on top of AIX logical volume is down 413 replicated metadata 422, 423
offline mmfsck command failure 250 replication 412, 423
old inode data 443 Requeing message 404
on-disk data 392 requeuing of messages in AFM 404
Operating system error logs 274 restoring a snapshot 402
partial disk failure 413 Samba 454
permission denied error message 391 security issues 436
permission denied failure 436 set up 281
Persistent Reserve errors 425 setup issues 498
physical disk association 248 SMB server health 452
physical disk association with logical volume 248 snapshot directory name conflict 402
policies 395, 396 snapshot problems 400
predicted pool utilizations 395 snapshot status errors 401
problem determination 339, 342, 343, 346 snapshot usage errors 400, 401
problem determination hints 247 some files are 'ill-placed' 395
problem determination tips 247 stale inode data 443
problems not directly related to snapshots 400 storage pools 398, 399
problems while working with Samba in 454 strict replication 423
problems with locating a snapshot 400 system load increase in night 249
problems with non-IBM disks 415 timeout executing function error message 250
protocol service logs 270, 272, 273, 287–293 trace facility 283
quorum nodes in cluster 248 tracing the mmpmon command 499
RAS events troubleshooting 339, 342, 343, 346
AFM events 559 UID mapping 387
authentication events 565 unable to access disks 411
Call Home events 570 unable to determine if a file system is mounted 391
CES Network events 573 unable to start 351
CESIP events 578 underlying disk subsystem failures 407
cluster state events 578 understanding Persistent Reserve 425
disk events 582 unmount failure 381
Enclosure events 584 unused underlying multipath device 428
Encryption events 593 upgrade failure 544
File audit logging events 597 upgrade issues 370
file system events 602 upgrade recovery 544
Filesysmgr events 613 usage errors 395, 398
GDS events 615 using mmpmon 498
GPFS events 615 value to large failure 435
GUI events 634 value to large failure while creating a file 435
hadoop connector events 645 varyon problems 414
HDFS data node events 645 volume group 414
HDFS name node events 646 volume group on each node 414
keystone events 648 Windows file system 249
Local cache events 650 Windows issues 366, 367
network events 652 working with Samba 454
NFS events 660 GPFS cluster
Nvme events 671 problems adding nodes 351
NvmeoF events 674 recovery from loss of GPFS cluster configuration data
object events 675 files 355
performance events 685 GPFS cluster data
Server raid events 687 locked 354
SMB events 687 GPFS cluster data files storage 355
stretch cluster events 691 GPFS command
TCT events 693 failed 362
Threshold events 708 return code 362
Watchfolder events 710 unsuccessful 363
RDMA atomic operation issues 371 GPFS commands
remote cluster name 389 mmpmon 62
remote command issues 353, 354 unsuccessful 362

944 IBM Storage Scale 5.1.9: Problem Determination Guide


GPFS configuration data 392 GUI (continued)
GPFS configuration parameters system health
low values 481 overview 1
GPFS daemon GUI issues
crash 361 gui automatically logs off users 510
fails to start 358 GUI login page does not open 503
went down 276, 361 GUI logs 294
will not start 358 GUI refresh tasks 506
GPFS daemon went down 361
GPFS failure
network failure 375
H
GPFS GUI logs 294 hard loop ID 352
GPFS is not using the underlying multipath device 428 HDFS
GPFS kernel extension 357 transparency log 270
GPFS local node failure 390 health monitoring
GPFS log 358, 359, 361, 377, 380, 382, 386–390, 555 features 13, 22, 24
GPFS logs status 18
master GPFS log file 258 Health monitoring
GPFS messages 730 features 313
GPFS modules Health status
cannot be loaded 357 Monitoring 25, 29, 33, 36, 37, 39, 43
unable to load on Linux 357 hints and tips for GPFS problems 247
GPFS problems 339, 377, 407 Home and .ssh directory ownership and permissions 366
GPFS startup time 258
GPFS support for SNMP 211
GPFS trace facility 283 I
GPFS Windows SMB2 protocol (CIFS serving) 367
I/O failure
gpfs.readReplicaRule
remote file system 387
example 420
I/O hang 429
gpfs.snap 303
I/O operations slow 276
gpfs.snap command
IBM Spectrum Scale
data always gathered for a master snapshot 302
Active File Management 194–197, 207–209, 511
data always gathered on AIX 297
Active File Management DR 194–197, 515
data always gathered on all platforms 296
AFM
data always gathered on Linux 298
callback events 194
data always gathered on Windows 301
issues 511
data gathered by gpfs.snap for call home entries 308
monitor prefetch 197
using 295
monitoring commands 195–197, 208, 209
GPUDirect Storage
AFM DR
troubleshooting 431
callback events 194
Grafana 158, 176
issues 515
grep command 274
monitoring commands 195–197, 208
Group Services
AFM to cloud object storage
verifying quorum 359
fileset states 205
GROUP_ID attribute 324
issues 517
gui
monitoring commands 207–209
event notifications
best practices for troubleshooting 245
snmp 5
IBM Storage Protect client 403
GUI
IBM Storage Protect server
capacity information is not available 509
MAXNUMMP 403
directed maintenance procedure 533, 539
IBM Storage Scale
displaying outdated information 506
/tmp/mmfs directory 249
DMP 533, 539
abnormal termination in mmpmon 498
file audit logging 235
active directory 438
GUI fails to restart 502
active file management 250
GUI fails to start 501, 503
Active File Management 185, 194, 199
GUI issues 501
Active File Management DR 189, 194, 199
login page does not open 503
Active File Management error logs 279
logs 294
active tracing 494
monitoring AFM and AFM DR 200
AD authentication 438
performance monitoring 158
add new nodes 482
performance monitoring issues 504
AFM
performance monitoring sensors 112
fileset states 185
server was unable to process the request error 506

Index 945
IBM Storage Scale (continued) IBM Storage Scale (continued)
AFM (continued) data gathered for SMB on Linux 303
monitoring commands 194 data integrity may be corrupted 404
monitoring policies 199 deadlock breakup
AFM DR on demand 337
fileset states 189 deadlocks 333, 336, 337
monitoring commands 194 default parameter value 482
monitoring policies 199 deployment problem determination 339, 342, 343, 346
AFM logs 279 deployment troubleshooting 339, 342, 343, 346
AIX 386 determining the health of integrated SMB server 452
AIX platform 297 disaster recovery issues 477
application calls 357 discrepancy between GPFS configuration data and the
application program errors 365, 366 on-disk data for a file system 392
audit messages 258 disk accessing commands fail to complete 415
Authentication disk connectivity failure
error events 437 failover to secondary server 487
errors 437 disk failure 424
authentication issues 436 disk information commands 317–319, 326, 327, 329
authentication on Linux 307 disk media failure 422, 423
authorization issues 353, 438 disk media failures 415
automatic failure 543 disk recovery 412
automatic recovery 543 displaying NSD information 408
automount fails to mount on AIX 386 dumps 255
automount fails to mount on Linux 385 encryption issues 435
automount failure 386 encryption rules 435
automount failure in Linux 385 error creating internal storage 250
Automount file system 385 error encountered while creating and using NSD disks
Automount file system does not mount 385 407
back up data 245 error log 275, 276
call home entries 308 error message "Function not implemented" 387
CES NFS error message for file system 379, 380
failure 443 error messages 339
network failure 443 error numbers 380
CES tracing error numbers for GPFS application calls 392
debug data collection 287–293 error numbers specific to GPFS application calls 380,
checking Persistent Reserve 426 404, 424
cipherList option has not been set properly 390 Error numbers specific to GPFS application calls 384
clearing a leftover Persistent Reserve reservation 426 error numbers specific to GPFS application calls when
client nodes 380 data integrity may be corrupted 404
cluster configuration error numbers when a file system mount is unsuccessful
issues 354, 355 380
recovery 355 errors associated with filesets 393
cluster crash 351 errors associated with policies 393
cluster name 389 errors associated with storage pools 393
cluster state information 312–316 errors encountered 399
clusters with SELinux enabled and enforced 471 errors encountered while restoring a snapshot 402
collecting details of issues 278 errors encountered with filesets 398
command 313 errors encountered with policies 396
commands 312–316 errors encountered with storage pools 399
connectivity problems 354 events 17, 559, 565, 570, 573, 578, 582, 584, 593,
contact node address 388 597, 602, 613, 615, 634, 645, 646, 648, 650, 652, 660,
contact nodes down 389 671, 674, 675, 685, 687, 691, 693, 708, 710
core dumps 281 failure analysis 339
corrupted data integrity 404 failure group considerations 383
create script 17 failures using the mmbackup command 403
creating a file 435 file system block allocation type 485
data always gathered 296 file system commands 317–319, 326, 327, 329
data gathered file system does not mount 387
Object on Linux 304 file system fails to mount 377
data gathered for CES on Linux 306 file system fails to unmount 381
data gathered for core dumps on Linux 309 file system forced unmount 382
data gathered for hadoop on Linux 308 file system is forced to unmount 384
data gathered for performance 309 file system is known to have adequate free space 393
data gathered for protocols on Linux 303, 304, 306–309 file system is mounted 391

946 IBM Storage Scale 5.1.9: Problem Determination Guide


IBM Storage Scale (continued) IBM Storage Scale (continued)
file system manager appointment fails 392 logs (continued)
file system manager failures 392 protocol service logs 263
file system mount problems 379, 380 syslog 258
file system mount status 391 lsof command 318
file system mounting on wrong drive 250 maintenance commands 492
file system update 495 maintenance commands execution 493
File system utilization 488 manually disabling Persistent Reserve 428
file systems manager failure 391, 392 manually enabling Persistent Reserve 428
fileset update 495 master snapshot 302
filesets usage errors 397 message 6027-648 249
Ganesha NFSD process not running (nfsd_down) 447 message referring to an existing NSD 410
GPFS cluster security configurations 388 message requeuing in AFM 404
GPFS commands unsuccessful 364 message severity tags 728
GPFS configuration parameters 481 messages 728
GPFS daemon does not start 361 mixed OS levels 342
GPFS daemon issues 358, 360, 361 mmafmctl Device getstate 313
GPFS declared NSD is down 410 mmapplypolicy -L 0 command 320
GPFS disk issues 407, 429 mmapplypolicy -L 1 command 321
GPFS down on contact nodes 389 mmapplypolicy -L 2 command 321
GPFS error message 379 mmapplypolicy -L 3 command 322
GPFS error messages 392, 401 mmapplypolicy -L 4 command 323
GPFS error messages for disk media failures 424 mmapplypolicy -L 5 command 324
GPFS error messages for file system forced unmount mmapplypolicy -L 6 command 324
problems 384 mmapplypolicy -L command 320–324
GPFS error messages for file system mount status 391 mmapplypolicy command 319
GPFS error messages for mmbackup errors 403 mmdumpperfdata command 310
GPFS failure mmfileid command 327
network issues 375 MMFS_DISKFAIL 275
GPFS file system issues 377 MMFS_ENVIRON
GPFS has declared NSDs built on top of AIX logical error log 275
volume as down 413 MMFS_FSSTRUCT error log 275
GPFS is not running on the local node 390 MMFS_GENERIC error log 275
GPFS modules MMFS_LONGDISKIO 276
unable to load on Linux 357 mmfsadm command 312
gpfs.gpfsbin mmhealth 13, 18, 22, 24
issues 356 mmlscluster command 314
gpfs.snap mmlsconfig command 315
gpfs.snap command 298 mmlsmount command 318
Linux platform 298 mmrefresh command 315
gpfs.snap command mmremotecluster command 389
usage 295 mmsdrrestore command 316
guarding against disk failures 412 mmwindisk command 326
GUI logs 294 mount 385, 387
hang in mmpmon 498 mount failure 380
HDFS transparency log 270 mount failure as the client nodes joined before NSD
health monitoring 13, 18, 22 servers 380
healthcheck component 50 mount failure for a file system 435
hints and tips for problem determination 247 mounting cluster does not have direct access to the
hosts file issue 352 disks 390
IBM Storage Protect messages 403 multiple file system manager failures 391, 392
incorrect output from mmpmon 498 negative values occur in the 'predicted pool utilizations',
installation and configuration issues 339, 350–352, 395
354, 357, 358, 360–362, 364–366, 382 net use on Windows fails 457
installation toolkit issues 347–350, 352, 369, 370 network issues 375
installing 282 Network issues
installing on Linux nodes 282 mmnetverify command 375
key rewrap 436 newly mounted windows file system is not displayed
limitations 253 249
logical volume 248 NFS
logical volumes are properly defined for GPFS use 413 client access exported data 449
logs client cannot mount NFS exports 449
GPFS log 256, 258 client I/O temporarily stalled 449
NFS logs 263 error events 447

Index 947
IBM Storage Scale (continued) IBM Storage Scale (continued)
NFS (continued) remote file system I/O fails with "Function not
error scenarios 449 implemented" error 387
errors 447, 449 remote file system I/O failure 387
NFS client remote mounts fail with the "permission denied"error
client access exported data 449 391
client cannot mount NFS exports 449 remote node expelled from cluster 382
client I/O temporarily stalled 449 replicated metadata 423
NFS failover 452 replicated metadata and data 422
NFS is not active (nfs_not_active) 447 replication setting 494
NFS on Linux 304 requeuing of messages in AFM 404
NFS problems 443 RPC statd process is not running (statd_down) 447
NFS V4 issues 405 security issues 436, 438
nfs_not_active 447 set up 281
nfsd_down 447 setup issues while using mmpmon 498
no replication 423 SHA digest 329
NO_SPACE error 393 SMB
NSD and underlying disk subsystem failures 407 access issues 459, 460
NSD creation fails 410 error events 458
NSD disk does not have an NSD server specified 390 errors 458
NSD server 380 SMB client on Linux fails 455
NSD server failure 486 SMB service logs 263
NSD-to-server mapping 484 snapshot 495
offline mmfsck command failure 250 snapshot directory name conflict 402
old NFS inode data 443 snapshot problems 400
operating system error logs 274–277 snapshot status errors 401
operating system logs 274–277 snapshot usage errors 400, 401
other problem determination tools 331 some files are 'ill-placed' 395
partial disk failure 413 SSSD process not running (sssd_down) 437
Password invalid 455 stale inode data 443
performance issues 496 statd_down 447
permission denied error message 391 storage pools usage errors 398
permission denied failure 436 strict replication 423
Persistent Reserve errors 425 support for troubleshooting 555
physical disk association 248 System error 59 457
policies 395 System error 86 457
Portmapper port 111 is not active (portmapper_down) system health monitoring 50
447 system load 249
portmapper_down 447 threshold monitoring 22, 24
prerequisites 339 timeout executing function error message 250
problem determination 247, 339, 342, 343, 346 trace facility 283
problems while working with Samba 454 trace reports 283
problems with locating a snapshot 400 traces 255
problems with non-IBM disks 415 tracing the mmpmon command 499
protocol service logs troubleshooting
object logs 266 best practices 246, 247
QoSIO operation classes 483 collecting issue details 255, 256
quorum loss 364 getting started 245
quorum nodes 248 UID mapping 387
quorum nodes in cluster 248 unable to access disks 411
RAS events 559, 565, 570, 573, 578, 582, 584, 593, unable to determine if a file system is mounted 391
597, 602, 613, 615, 634, 645, 646, 648, 650, 652, 660, unable to resolve contact node address 388
671, 674, 675, 685, 687, 691, 693, 708, 710 understanding Persistent Reserve 425
RDMA atomic operation issues 371 unused underlying multipath device by GPFS 428
recovery procedures 543 upgrade failure 544
remote cluster name 389 upgrade issues 370
remote cluster name does not match with the cluster upgrade recovery 544
name 389 usage errors 395
remote command issues 353, 354 user does not exists 455
remote file system 387, 388 value to large failure 435
remote file system does not mount 387, 388 VERBS RDMA
remote file system does not mount due to differing GPFS inactive 490
cluster security configurations 388 volume group on each node 414
volume group varyon problems 414

948 IBM Storage Scale 5.1.9: Problem Determination Guide


IBM Storage Scale (continued) K
waring events 49
warranty and maintenance 247 KB_ALLOCATED attribute 324
Winbind process not running (wnbd_down) 437 kdb 331
winbind service logs 269 KDB kernel debugger 331
Windows 301 kernel module
Windows issues 366, 367 mmfslinux 357
Wrong version of SMB client 455 kernel panic 429
YPBIND process not running (yp_down) 437 kernel threads
IBM Storage Scale collector node at time of system hang or panic 331
administration 215 key rewrap failure 436
IBM Storage Scale commands
mmpmon 62
IBM Storage Scale information units xxi
L
IBM Storage Scale mmdiag command 313 Linux kernel
IBM Storage Scale support for SNMP 211 configuration considerations 352
IBM Storage Scale time stamps 255 crash dump facility 331
IBM Storage Scalecommand Linux on Z 352
mmafmctl Device getstate 313 logical volume
IBM Z 352 location 248
ill-placed files 395, 399 Logical Volume Manager (LVM) 412
ILM logical volumes 413
problems 393 logs
improper mapping 484 GPFS log 256, 258
initial sensor data poll configuration 113 NFS logs 263
inode data object logs 266
stale 443 protocol service logs 263
inode limit 277 SMB logs 263
installation and configuration issues 339 syslog 258
installation problems 350 Winbind logs 269
installation toolkit issues logs IBM Storage Scale
Ansible issue 347 performance monitoring logs 278
config populate 350 long waiters
ESS related 370 increasing the number of inodes 496
PKey error 348 LROC 241
precheck failure 370 lslpp command 556
Python issue 347 lslv command 248
python-dns conflict while deploying object packages lsof command 318, 381, 382
352 lspv command 414
setuptools issue 369 lsvg command 413
ssh agent error 349 lxtrace command 283, 312
systemctl timeout 349
Ubuntu 22.04 347
Ubuntu apt-get lock 349 M
Ubuntu dpkg database lock 349
maintenance commands
version lock issue 370
mmadddisk 492
yum warnings 348
mmapplypolicy 492
installing GPFS on Linux nodes
mmdeldisk 492
procedure for 282
mmrestripefs 492
installing MIB files on the collector and management node,
maintenance operations
SNMP 214
execution 493
installing Net-SNMP 211
management and monitoring, SNMP subagent 216
investigation performance problems 170
management node, installing MIB files 214
io_s 66
manually enabling or disabling Persistent Reserve 428
issues
maxblocksize parameter 379
mmprotocoltrace 293
MAXNUMMP 403
memory footprint
J change 110
memory shortage 275, 352
JSON message 6027-648 249
issues 523 message severity tags 728
junctions messages
deleting 397 6027-1941 352

Index 949
metadata mmdf command 393, 413, 496
replicated 422, 423 mmdiag command 313
metadata block mmdsh command 354
replica mismatches 417 mmdumpperfdata 310
metadata block replica mismatches 416, 417 mmedquota command fails 249
metrics mmexpelnode command 316
performance monitoring 117, 120, 136, 140, 143, 144, mmfileid command 327, 404, 415
146, 148 MMFS_ABNORMAL_SHUTDOWN
performance monitoring tool error logs 275
defining 154 MMFS_DISKFAIL
MIB files, installing on the collector and management node error logs 275
214 MMFS_ENVIRON
MIB objects, SNMP 218 error logs 275
MIGRATE rule 320, 322 MMFS_FSSTRUCT
migration error logs 275
file system does not mount 378 MMFS_GENERIC
new commands do not run 363 error logs 275
mmadddisk command 392, 398, 413, 415, 422 MMFS_LONGDISKIO
mmaddnode command 248, 351 error logs 276
mmafmctl command 404 MMFS_QUOTA
mmafmctl Device getstate command 313 error log 276
mmapplypolicy -L 0 320 error logs 276, 325
mmapplypolicy -L 1 321 MMFS_SYSTEM_UNMOUNT
mmapplypolicy -L 2 321 error logs 277
mmapplypolicy -L 3 322 MMFS_SYSTEM_WARNING
mmapplypolicy -L 4 323 error logs 277
mmapplypolicy -L 5 324 mmfs.log 358, 359, 361, 377, 380, 382, 386–390, 555
mmapplypolicy -L 6 324 mmfsadm command 287, 312, 360, 404, 415, 497
mmapplypolicy command 319, 395, 396, 399, 436 mmfsck command
mmaudit failure 250
failure 523 mmfsd
file system 523 will not start 358
mmauth command 329, 388 mmfslinux
mmbackup command 403 kernel module 357
mmccr command mmgetstate command 313, 359, 363
failure 250 mmhealth
mmchcluster command 353 monitoring 25, 44, 47
mmchconfig 417 states 233
mmchconfig command 315, 358, 382, 391 mmlock directory 354
mmchdisk command 378, 392, 398, 399, 407, 410, 415, mmlsattr command 396, 397
425 mmlscluster 215
mmcheckquota command 276, 325, 366, 383 mmlscluster command 248, 314, 351, 389
mmchfs command 277, 357, 363, 378, 380, 383, 385, 454, mmlsconfig command 283, 315, 385
497 mmlsdisk command 364, 377, 378, 383, 392, 407, 410,
mmchnode 215, 216 415, 425, 556
mmchnode command 248 mmlsfileset command 397
mmchnsd command 407 mmlsfs command 379, 422, 424, 556
mmchpolicy mmlsmgr command 312, 378
issues with adding encryption policy 435 mmlsmount command 318, 358, 365, 377, 381, 382, 407
mmcommon 385, 386 mmlsnsd command 326, 408, 413
mmcommon breakDeadlock 337 mmlspolicy command 396
mmcommon recoverfs command 393 mmlsquota command 365, 366
mmcommon showLocks command 355 mmlssnapshot command 400–402
mmcrcluster command 248, 315, 351, 353, 358 mmmount command 318, 377, 383, 415
mmcrfs command 363, 364, 407, 415, 454 mmperfmon 110, 150, 158, 170
mmcrnsd command 407, 410 mmperfmon command 154
mmcrsnapshot command 400, 402 mmpmon
mmdefedquota command fails 249 abend 498
mmdeldisk command 392, 398, 413, 415 adding nodes to a node list 68
mmdelfileset command 397 altering input file 498
mmdelfs command 424 concurrent processing 62
mmdelnode command 351, 363 concurrent usage 498
mmdelnsd command 410, 424 counters 104
mmdelsnapshot command 401 counters wrap 498

950 IBM Storage Scale 5.1.9: Problem Determination Guide


mmpmon (continued) mmpmon (continued)
deleting a node list 69 return codes 104
deleting nodes from node list 71 rhist 76
disabling the request histogram facility 81 rhist nr 79, 80
displaying a node list 70 rhist off 81
displaying statistics 62, 65, 86, 89, 91 rhist on 82
dump 499 rhist p 82, 83
enabling the request histogram facility 82 rhist r 87
examples rhist reset 85
failure 73 rhist s 86
fs_io_s 64, 72 rpc_s 88–90
io_s 66 rpc_s size 91, 92
nlist add 68 setup problems 498
nlist del 69 size ranges 77
nlist s 71 source 98
node shutdown and quorum loss 74 specifying new ranges 79
once 98 unsupported features 498
reset 76 version 93, 94
rhist nr 80 mmpmon command
rhist off 81 trace 499
rhist on 82 mmprotocoltrace command
rhist p 83 issues 293
rhist r 87 mmquotaoff command 366
rhist reset 85 mmquotaon command 366
rpc_s 90 mmrefresh command 315, 378, 385
rpc_s size 92 mmremotecluster command 329, 388–390
source 98 mmremotefs command 385, 389
ver 94 mmrepquota command 366
failure 73 mmrestorefs command 401, 402
fs_io_s mmrestripefile 417
aggregate and analyze results 95 mmrestripefile command 396, 399
hang 498 mmrestripefs command 399, 413, 423
histogram reset 85 mmrpldisk command 392, 398, 415
I/O histograms 77 mmsdrcommand
I/O latency ranges 78 Broken cluster recovery 553
incorrect input 498 mmsdrrestore command 316
incorrect output 498 mmshutdown 216
input 60 mmshutdown command 314, 315, 359–361, 386
interpreting results 94 mmsnapdir command 400, 402
interpreting rhist results 97 mmstartup command 358, 386
io_s mmtracectl command
aggregate and analyze results 95 generating GPFS trace reports 283
latency ranges 78 mmumount command 381, 383, 413
miscellaneous information 103 mmunlinkfileset command 397
multiple node processing 62 mmwatch
new node list 70 troubleshooting 531
nlist 62, 67, 71 mmwatch status 238
nlist add 68 mmwindisk command 326
nlist del 69 mode of AFM fileset, changing 250
nlist failures 74 MODIFICATION_TIME attribute 324
nlist new 70 module is incompatible 358
nlist s 70, 71 monitor 237, 241
nlist sub 71 monitor file audit logging
node list facility 67, 71 GUI 235
node list failure values 74 monitoring
node list show 70 AFM and AFM DR using GUI 200
node shutdown and quorum loss 74 clustered watch folder 237
once 98 local read-only cache 241
output considerations 103 mmhealth 234
overview 59 performance 55
request histogram 76 mount
request histogram facility pattern 82 problems 380
reset 75, 76 mount command 377, 378, 380, 415, 424
restrictions 498 mount error (127)

Index 951
mount error (127) (continued) NFS V4
Permission denied 456 problems 405
mount error (13) NFS, SMB, and Object logs 263
Permission denied 456 no replication 423
mount failure 435 NO SUCH DIRECTORY error code 361
mount on Mac fails with authentication error NO SUCH FILE error code 361
mount_smbfs: server rejected the connection: NO_SPACE
Authentication error 456 error 393
mount.cifs on Linux fails with mount error (13) node
Permission denied 456 crash 557
mounting cluster 390 hang 557
Mounting file system rejoin 381
error messages 379 node configuration information, SNMP 220
Multi-Media LAN Server 255 node crash 351
Multiple threshold rule node failure 429
Use case 29, 33, 36, 37, 39, 43 Node health state monitoring
use case 25
node reinstall 351
N node status information, SNMP 220
Net-SNMP nodes
configuring 212 cannot be added to GPFS cluster 351
installing 211 non-quorum node 248
running under SNMP master agent 216 NSD
traps 225 creating 410
network deleting 410
performance displaying information of 408
Remote Procedure Calls (RPCs) 55 extended information 409
network failure 375 failure 407
Network failure NSD build 413
mmnetverify command 375 NSD disks
network problems 275 creating 407
NFS using 407
failover NSD failure 407
warning event 452 NSD server
problems 443 failover to secondary server 486, 487
warning nsdServerWaitTimeForMount
nfs_exported_fs_down 49 changing 380
nfs_exports_down 49 nsdServerWaitTimeWindowOnMount
warning event changing 380
nfs_unresponsive 452 NT STATUS LOGON FAILURE
NFS client SMB client on Linux fails 455
with stale inode data 443
NFS client cannot mount exports O
mount exports, NFS client cannot mount 444
NFS error events 447 object
NFS error scenarios 449 logs 266
NFS errors Object 303
Ganesha NFSD process not running (nfsd_down) 447 object IDs
NFS is not active (nfs_not_active) 447 SNMP 218
nfs_not_active 447 object metrics
nfsd_down 447 proxy server 106, 114, 152
Portmapper port 111 is not active (portmapper_down) Object metrics
447 Performance monitoring 150
portmapper_down 447 open source tool
NFS logs 263 Grafana 158
NFS mount on client times out OpenSSH connection delays
NFS mount on server fails 444 Windows 375
Permission denied 444 orphaned file 398
time out error 444
NFS mount on server fails
access type is one 444
P
NFS mount on server fails partitioning information, viewing 326
no such file or directory 444 password must change 456
protocol version not supported by server 444

952 IBM Storage Scale 5.1.9: Problem Determination Guide


performance permission denied failure (key rewrap) 436
monitoring 55, 59 Persistent Reserve
network checking 426
Remote Procedure Calls (RPCs) 55 clearing a leftover reservation 426
performance issues errors 425
caused by the low-level system components 479 manually enabling or disabling 428
due to high utilization of the system-level components understanding 425
479 ping command 354
due to improper system level settings 481 PMR 557
due to long waiters 479 policies
due to networking issues 480 DEFAULT clause 395
due to suboptimal setup or configuration 481 deleting referenced objects 395
performance monitoring errors 396
API keys 153 file placement 395
Grafana 158, 176 incorrect file placement 396
GUI performance monitoring issues 504 LIMIT clause 395
log 278 long runtime 396
metrics 117, 120, 136, 140, 143, 144, 146, 148 MIGRATE rule 395
mmperfmon query 158 problems 393
performance monitoring through GUI 158 rule evaluation 395
queries 171 usage errors 395
Performance monitoring verifying 319
Object metrics 150 policy file
pmsensor node 111 detecting errors 320
singleton node size limit 395
Automatic update 111 totals 321
performance monitoring bridge 176 policy rules
performance monitoring sensors 112 runtime problems 396
performance monitoring tool POOL_NAME attribute 324
AFM metrics 136 possible GPFS problems 339, 377, 407
cloud service metrics 146 predicted pool utilization
configuring incorrect 395
Automated configuration 107 primary NSD server 380
Manual configuration 109 problem
configuring the sensor locating a snapshot 400
Automated configuration 107 not directly related to snapshot 400
File-managed configuration 107 snapshot 400
GPFS-managed configuration 107 snapshot directory name 402
Manual configuration 107 snapshot restore 402
cross protocol metrics 146 snapshot status 401
CTDB metrics 144 snapshot usage 400
GPFS metrics 120 snapshot usage errors 400
GPFSFCM metrics 148 problem determination
Linux metrics 117 cluster state information 312
manual restart 154 remote file system I/O fails with the "Function not
metrics implemented" error message when UID mapping is
defining 154 enabled 387
NFS metrics 140 tools 317
object metrics 140 tracing 283
overview 105 Problem Management Record 557
pmcollector problems
migrating 116 configuration 350
protocol metrics 140 installation 350
queries 171 mmbackup 403
SMB metrics 143 problems running as administrator, Windows 367
start 153, 154 protocol (CIFS serving), Windows SMB2 367
stop 153, 154 protocol authentication log 270
Performance monitoring tool database protocol service logs
cleanup 154 NFS logs 263
expired keys 154 object logs 266
view expired keys 154 SMB logs 263
performance problems investigation 170 winbind logs 269
permission denied Protocols 436
remote mounts failure 391 proxies

Index 953
proxies (continued) recovery
performance monitoring tool 105 cluster configuration data 355
proxy server recovery log 429
object metrics 106, 114, 152 recovery procedure
python 235 restore data and system configuration 543
recovery procedures 543
recreation of GPFS storage file
Q mmchcluster -p LATEST 355
QoSIO operation classes remote command problems 353
low values 483 remote file copy command
queries default 353
performance monitoring 171 remote file system
quorum mount 388
disk 364 remote file system I/O fails with "Function not implemented"
loss 364 error 387
quorum node 248 remote mounts fail with permission denied 391
quota remote node
cannot write to quota file 383 expelled 382
denied 365 remote node expelled 382
error number 357 Remote Procedure Calls (RPCs)
quota files 325 network performance 55
quota problems 276 remote shell
default 353
remotely mounted file systems 524
R Removing a sensor 110
removing the setuid bit 362
RAID controller 412
replica
raise custom events 19
mismatch 419
RAS events
mismatches 416–420
AFM events 559, 615
replica mismatches 416, 418–420
authentication events 565
replicated
Call Home events 570
metadata 423
CES Network events 573
replicated data 422
CESIP events 578
replicated metadata 422
cluster state events 578
replication
disk events 582
of data 412
Enclosure events 584
replication setting 494
Encryption events 593
replication, none 423
File audit logging events 597
report problems 247
file system events 602
reporting a problem to IBM 312
Filesysmgr events 613
request returns the current values for all latency ranges
GPFS events 615
which have a nonzero count.IBM Storage Scale 87
GUI events 634
resetting of setuid/setgids at AFM home 251
hadoop connector events 645
resolve events 246
HDFS data node events 645
restore data and system configuration 543
HDFS name node events 646
restricted mode mount 318
keystone events 648
return codes, mmpmon 104
Local cache events 650
RPC
network events 652
method 371
NFS events 660
RPC statistics
Nvme events 671
aggregation of execution time 89
NvmeoF events 674
RPC execution time 91
object events 675
rpc_s size 88
performance events 685
RPCs (Remote Procedure Calls)
Server RAID events 687
network performance 55
SMB events 687
rpm command 556
stretch cluster events 691
rsh
TCT events 693
problems using 353
Threshold events 708
rsh command 353, 363
Watchfolder events 710
rsyslog 236
rcp command 353
RDMA atomic operation issues
RDMA issue 371 S
read-only mode mount 318
Samba

954 IBM Storage Scale 5.1.9: Problem Determination Guide


Samba (continued) SNMP (continued)
client failure 454 management and monitoring subagent 216
scp command 353 MIB objects 218
Secure Hash Algorithm digest 329 Net-SNMP traps 225
selinux 236 node configuration information 220
sensors node status information 220
performance monitoring tool object IDs 218
configuring 106, 114 starting and stopping the SNMP subagent 216
serving (CIFS), Windows SMB2 protocol 367 storage pool information 223
set up SNMP IBM Storage Scale 218
core dumps 281 spectrumscale installation toolkit
Setting up Ubuntu for capturing crash files 283 configuration for debugging 282
setuid bit, removing 362 core dump data 282
setuid/setgid bits at AFM home, resetting of 251 failure analysis 339
severity tags prerequisites 339
messages 728 supported configurations 343, 346
SHA digest 329, 388 supported setups 343
shared segments upgrade support 346
problems 360 ssh command 353
singleton node statistics
Automatic update 111 network performance 55
SLES upgrade status
file conflict 370 mmwatch 238
slow SMB access due to contended access to same files or status description
directories 460 Cloud services 713
SMB steps to follow
logs 263 GPFS daemon does not come up 358
warning storage pool information, SNMP 223
smb_exported_fs_down 49 storage pools
smb_exports_down 49 deleting 395, 399
SMB access issues 459, 460 errors 399
SMB client on Linux fails 456 failure groups 398
SMB error events 458 problems 393
SMB errors 458 slow access time 399
SMB on Linux 303 usage errors 398
SMB server 452 strict replication 423
SMB2 protocol (CIFS serving), Windows 367 subnets attribute 382
snapshot suboptimal performance 481–486, 492, 494
directory name conflict 402 suboptimal system performance 486–488, 490, 492–495
error messages 400, 401 support for troubleshooting
invalid state 401 call home 558
restoring 402 contacting IBM support center
status error 401 how to contact IBM support center 557
usage error 400 information to be collected before contacting IBM
valid 400 support center 555
snapshot problems 400 support notifications 246
SNMP syslog 258
cluster configuration information 219 syslog facility
cluster status information 219 Linux 274
collector node administration 215 syslogd 387
configuring Net-SNMP to work with IBM Storage Scale system health
212 GUI
configuring SNMP-based management applications 213 overview 1
disk configuration information 224 system load 249
disk performance information 224 system snapshots 295
disk status information 223 system storage pool 395, 398
file system performance information 222
file system status information 221
GPFS support 211
T
IBM Storage Scale support 211 tail 233
installing MIB files on the collector and management The swift-object-info output does not display 471
node 214 threads
installing Net-SNMP on the collector node of the IBM tuning 352
Storage Scale 211 waiting 497

Index 955
threshold monitoring trace (continued)
active threshold monitor 24 SMB locks 286
predefined thresholds 22 SP message handling 286
prerequisites 22 super operations 286
user-defined thresholds 22 tasking system 286
Threshold monitoring token manager 286
use case 29, 33, 36, 37, 39, 43 ts commands 284
Threshold rules vdisk 286
Create 29, 33, 36, 37, 39, 43 vdisk debugger 286
tiering vdisk hospital 286
audit events 726 vnode layer 286
time stamps 255 trace classes 284
tip events 539 trace facility
trace mmfsadm command 312
active file management 284 trace level 287
allocation manager 284 trace reports, generating 283
basic classes 284 tracing
behaviorals 286 active 494
byte range locks 284 transparent cloud tiering
call to routines in SharkMsg.h 285 common issues and workarounds 519
checksum services 284 troubleshooting 519
cleanup routines 284 transparent cloud tiering logs
cluster configuration repository 284 collecting 278
cluster security 286 traps, Net-SNMP 225
concise vnop description 286 traversing a directory that has not been cached 251
daemon routine entry/exit 284 troubleshooting
daemon specific code 286 AFM DR issues 515
data shipping 285 best practices
defragmentation 284 report problems 247
dentry operations 284 resolve events 246
disk lease 285 support notifications 246
disk space allocation 284 update software 246
DMAPI 285 capacity information is not available in GUI pages 509
error logging 285 CES 272
events exporter 285 CES NFS core dump 283
file operations 285 Cloud services 713
file system 285 collecting issue details 255
generic kernel vfs information 285 disaster recovery issues
inode allocation 285 setup problems 477
interprocess locking 285 getting started 245
kernel operations 285 GPUDirect Storage 431
kernel routine entry/exit 285 GUI fails to restart 502
low-level vfs locking 285 GUI fails to start 501, 503
mailbox message handling 285 GUI is displaying outdated information 506
malloc/free in shared segment 285 GUI issues 501
miscellaneous tracing and debugging 286 GUI login page does not open 503
mmpmon 285 GUI logs 294
mnode operations 285 GUI performance monitoring issues 504
mutexes and condition variables 285 logs
network shared disk 285 GPFS log 256, 258
online multinode fsck 285 syslog 258
operations in Thread class 286 mmwatch 531
page allocator 286 performance issues
parallel inode tracing 286 caused by the low-level system components 479
performance monitors 285 due to high utilization of the system-level
physical disk I/O 285 components 479
physical I/O 285 due to improper system level settings 481
pinning to real memory 286 due to long waiters 479
quota management 286 due to networking issues 480
rdma 286 due to suboptimal setup or configuration 481
recovery log 285 recovery procedures 543
SANergy 286 server was unable to process the request 506
scsi services 286 support for troubleshooting
shared segments 286 call home 558

956 IBM Storage Scale 5.1.9: Problem Determination Guide


troubleshooting (continued)
transparent cloud tiering 519
warranty and maintenance 247
troubleshooting errors 366
troubleshooting Windows errors 366
tuning 352

U
UID mapping 387
umount command 382, 383
unable to start GPFS 360
underlying multipath device 428
understanding, Persistent Reserve 425
unsuccessful GPFS commands 362
upgrade
NSD nodes not connecting 370
regular expression evaluation 370
usage errors
policies 395
useNSDserver attribute 412
USER_ID attribute 324
using the gpfs.snap command 295

V
v 353
value too large failure 435
varyon problems 414
varyonvg command 415
VERBS RDMA
inactive 490
viewing disks and partitioning information 326
volume group 414

W
warranty and maintenance 247
Webhook
JSON 47
Winbind
logs 269
Windows
data always gathered 301
file system mounted on the wrong drive letter 250
gpfs.snap 301
Home and .ssh directory ownership and permissions
366
mounted file systems, Windows 249
OpenSSH connection delays 375
problem seeing newly mounted file systems 249
problem seeing newly mounted Windows file systems
249
problems running as administrator 367
Windows 250
Windows issues 366
Windows SMB2 protocol (CIFS serving) 367

Index 957
958 IBM Storage Scale 5.1.9: Problem Determination Guide
IBM®

Product Number: 5641-DM1


5641-DM3
5641-DM5
5641-DA1
5641-DA3
5641-DA5
5737-F34
5737-I39
5765-DME
5765-DAE

SC28-3476-02

You might also like