Slow Drain Detection in Cisco
Slow Drain Detection in Cisco
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 1 of 61
Contents
Scope
Introduction
Understanding Storage Area Network Congestion
Congestion from End Devices
Congestion Between Switches
Congestion in Switches
Introduction to Slow Drain
Cisco Solution
Background: Flow Control in Fibre Channel
Example: Slow Drain
Slow-Drain Detection
Credit Unavailability at MicrosecondsTxWait Period for Frames
Credit Unavailability at Millisecondsslowport-monitor
Credit Unavailability at 100 ms
Link Event LR Rcvd B2B on Fibre Channel Ports
Credits and Remaining Credits
Credit Transition to Zero
Defining Slow Port
Defining Stuck Port
Automatic Recovery from Slow Drain
Virtual Output Queues
SNMP Trap
congestion-drop Timeout
no-credit-drop Timeout
Recommended Timeout Values for congestion-drop and no-credit-drop
Credit Loss Recovery
Port Flap or Error-Disable
Slow-Drain Detection and Automatic Recovery Advantage of 16-Gbps Platforms
Troubleshooting Slow Drain
Information About Dropped Frames
Display Frame Queued on Ingress Ports
Display Arbitration Timeouts
Display Timeout Discards
Onboard Failure Logging
TxWait History Graph
Slow-Drain Troubleshooting Methodology
Levels of Performance Degradation
Finding Congestion Sources
Generic Guidelines
Detecting and Troubleshooting Slow Drain with Cisco Data Center Network Manager
Using DCNM to Detect Slow-Drain Devices
Summary
Conclusion
Appendix A: Slow-Drain Detection and Automatic Recovery with Port Monitor
Configuration Example
Appendix B: Difference Between TxWait, slowport-monitor, and Credit Unavailability at 100 ms
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 2 of 61
Appendix C: Cisco MDS 9000 Family Slow-Drain detection and Troubleshooting Commands
TxWait Period for Frames
slowport-monitor
Credit Unavailability at 100 ms
LR Rcvd B2B
Credits and Remaining Credits
Credit Transition to Zero
Dropped-Frame Information
Display Frames in Ingress Queue
Arbitration Timeouts
Check for Transmit Frame Drops (Timeout Discard)
Credit Loss Recovery
Appendix D: Cisco MDS 9000 Family Slow-DrainSpecific SNMP MIBs
Appendix E: Cisco MDS 9000 Family Slow-Drain Feature Support Matrix
Appendix F: Cisco MDS 9000 Family Counter Names and Descriptions
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 3 of 61
The concept of congestion in SAN, especially slow drain, which is the most strenuous type of congestion.
Understanding the basic concepts helps you effectively solve the problem.
The architectural benefits of Cisco MDS 9000 Family switches. These are the only Fibre Channel switches
in the industry that provide consistent and predictable performance and prevent microcongestion problems,
such as head-of-line blocking. Cisco MDS 9000 Family switches build robust Fibre Channel networks.
Cisco MDS 9000 Family switches are the only Fibre Channel switches in the industry that provide automatic
recovery from slow drain even in large environments. You learn the holistic approach taken by Cisco to
detect, troubleshoot, and automatically recover from slow drain.
Troubleshooting methodology developed by Cisco while working on large SAN environment over more than
a decade.
Enhancements to Cisco Data Center Network Manager (DCNM) for fabricwide detection and
troubleshooting of slow drain to help you find and fix problems within minutes using an intuitive web
interface.
Scope
This document covers all 16-Gbps Fibre Channel products under the Cisco MDS 9000 Family switches. Advanced
8 Gbps and 8-Gbps line cards on Cisco MDS 9500 directors are also covered. Details are listed in Table 1.
Appendix C contains various commands that can used on these platforms. Feature support matrix across platforms
and Cisco NX-OS Software releases are available under Appendix E. At the time of writing, the document
recommends NX-OS Release 6.2(13), or later, and DCNM release 7.2(1), or later.
Table 1.
Platforms Discussed and Supported in This Document Under Cisco MDS 9000 Family Switches
Model
16-Gbps Platforms
Cisco MDS 9700 Series Multilayer Directors with DS-X9448-768K9 line card
MDS 9396S
MDS 9148S
MDS 9250i
Cisco MDS 9500 Series Multilayer Directors with DS-X92xx-256K9 line cards
8-Gbps Platforms
Cisco MDS 9500 Series Multilayer Directors with DS-X9248-48K9 and DS-X92xx-96K9 line cards
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 4 of 61
Introduction
We are in the era of digitalization, mobility, social networking, and Internet of Everything (IoE). More and more
applications are being developed to support businesses. Many of the newer generation organizations are
functioning only through applications. These applications must perform at highest capacity, all the time. Data
(processing, reading, and writing) is the most important attribute of an application. Applications are hosted on
servers. Data is stored on storage arrays. Connectivity between servers and storage arrays is provided by SANs.
Fibre Channel (FC) is the most commonly deployed technology to build SANs. FC SANs must be free of
congestion so that application performance is at peak. If not, businesses are prone to huge revenue risk due to
stalled or poor-performing applications.
Congestion in FC SANs has always been the highest concern for SAN administrators. The concern has become
even more severe due to the following reasons:
Adoption of 16-Gbps FC leading to heterogeneous speeds: The last few years have seen increased
adoption of 16-Gbps FC. While newer devices at 16 Gbps are connected, older devices at 1-, 2-, 4-, or 8Gbps FC still remain as part of the same fabric. Servers and storage ports at different speeds sending data
to each other tend to congest network links.
Data explosion leading to scaled out architectures: Application and data explosion is resulting in more
servers and storage ports. FC SANs are being scaled out. Collapsed core architectures are being scaled to
edge-core architectures. Edge-core architectures are being scaled to edge-core-edge architectures. Larger
networks have more applications that are impacted due to SAN congestion.
Legacy application and infrastructure: While newer high-performing applications and servers are being
deployed, the older and slower servers running legacy applications are still being used. This results in a
common network being shared by fast and slow applications. SAN performance acceptable by a slower
application may be completely unacceptable by a faster application.
Increased pressure on operating expenses (OpEx): Businesses are trying to find ways to increase their
bottom lines. The pressure on OpEx has never been more. Stress is increasing to fully use the existing
infrastructure. SANs must be free of congestion to keep applications performance at peak.
Adoption of flash storage: More and more businesses are deploying flash storage for better application
performance. Flash storage is several times faster than a spinning disk. It is pushing SANs to their limits.
The existing links may not be capable enough of sustaining the bandwidth.
Cisco MDS 9000 Family switches have purposefully designed architecture, hardware, and software to keep FC
SANs free of congestion. High performance is delivered by integrating the features directly with the switch ports
application specific integrated circuit (ASIC). Operational simplicity is achieved by software enhancements made
on Cisco NX-OS Software. Problems are being solved within minutes by the web-based, fabricwide and singlepane-of-glass visibility by Cisco Data Center Network Manager (DCNM). Overall, Cisco has taken a holistic
approach to build robust and self-healing FC SAN. The details are provided in the following sections.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 5 of 61
Page 6 of 61
port channel. To alleviate such problems, Cisco recommends that ISLs should always be of higher or similar
bandwidth than that of the links connected to the end devices.
Lack of B2B Credits for the Length of the ISL
Number of buffer-to-buffer (B2B) credits should be carefully accounted on long-distance ISLs.
Note:
B2B credits and Fibre Channel flow control has been described in the following section, Flow Control in
Fibre Channel.
The requirement of B2B credits on a Fibre Channel link increases with:
Increase in distance
Increase in speed
Table 2 provides numbers of B2B credits required for per-kilometer length of ISL at different speeds and frame
sizes. Notice that one B2B credit is needed per frame irrespective of the frame size. Fewer numbers of B2B credits
can be the reason for performance impact if the received B2B credits available on a port is close to the value as
calculated by Table 2. Consider using extended B2B credits or moving long-distance ISLs to other platforms under
Cisco MDS 9000 Family switches that have more B2B credits.
Table 2.
Frame Size
1 Gbps
2 Gbps
4 Gbps
8 Gbps
10 Gbps
16 Gbps
512 bytes
2 BB/km
4 BB/km
8 BB/km
16 BB/km
24 BB/km
32 BB/km
1024 bytes
1 BB/km
2 BB/km
4 BB/km
8 BB/km
12 BB/km
16 BB/km
2112 bytes
0.5 BB/km
1 BB/km
2 BB/km
4 BB/km
6 BB/km
8 BB/km
Congestion in Switches
Fibre Channel switches available on the market today have hundreds of ports. These switches are expected to
receive, process, and forward frames from any port to any port at line rate. Different vendors have different
architectures. Some vendor switches have physical ports at 16 Gbps but cant switch frame at that speed on all
ports and at all frame sizes. This results in severe performance degradation of applications. SAN administrators
must understand the internal architecture before making a buying decision. They must ensure that the switches
have been architected for:
Predictable performance between all ports, irrespective of what features are enabled
No head-of-line blocking
Centralized coordinated forwarding between all ports, rather than each port acting on its own
If these factors are not considered well in advance, SAN administrators risk their networks to severe congestion
within a switch. Such problems cannot be solved in a production network. The only solution would be the
expensive approach of buying more switches or contracting professional services.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 7 of 61
Cisco MDS 9000 Family switches have been architected to provide these benefits. Following are the unique
advantages that ensure that switches are always free of congestion.
No limitations on per-slot bandwidth: The Cisco MDS 9700 Series Multilayer Directors supports up to
1.5-Tbps per-slot bandwidth today. This is two times of the capacity which is required to support 48 line-rate
ports at 16 Gbps. All ports on all slots are capable of sending line rate traffic to all other ports using
nonblocking and non-oversubscribed design.
Centralized coordinated forwarding: Cisco MDS 9000 Family switches use a centrally arbitrated crossbar
architecture for frame forwarding between ports. Central arbitration ensures that frames are handed over to
an egress port only when it has enough transmit buffers available. After the arbiter grants the request, a
crossbar provides a dedicated data link between ingress and egress ports. There is never a situation when
frames are unexpectedly dropped in the switch.
Consistent and predictable performance: All frames from all ports are subject to central arbitrated
crossbar forwarding. This ensures that the end applications receive consistent performance irrespective of
where the server and storage ports are connected on a switch. There is no limitation of connecting the ports
to the same port group on the same module to receive lower latency. Also, performance is not degraded if
more features are enabled. Consistency and predictability lead to better designed and operated networks.
Store-and-forward architecture: Frames are received and stored completely in the ingress port buffers
before they are transmitted. This enables Cisco MDS 9000 Family switches to inspect the cyclic redundancy
check (CRC) field of a Fibre Channel frame and eventually drop them if the frames are corrupted. This
intrinsic behavior limits the failure domain to a port. Corrupt frames are not spread over the fabric. End
devices are not bombarded with corrupt frames.
Virtual output queues (VOQs): VOQ is the mechanism that prevents head-of-line blocking inside a Cisco
MDS 9000 Family switch. Head-of-line occurs when the frame at the head of the queue cannot be sent
because of congestion at its output port, while the frames behind this frame are blocked from being sent to
their destination, even though their respective output ports are not congested. Instead of a single queue,
separate VOQs are maintained at all ingress ports. Frames destined to different ports are queued to
separate VOQs. Individual VOQs can be blocked, but traffic queued for different (nonblocked) destinations
can continue to flow without being delayed behind frames waiting for the blocking to clear on a congested
output port (Figure 2). Cisco MDS 9000 Family switches support up to 4096 VOQs per port, allowing to
address up to 1024 destination ports per chassis, with 4 QoS levels.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 8 of 61
Figure 2.
All these attributes are unique only to the architecture of Cisco MDS 9000 Family switches.
The architecture of Cisco MDS 9000 Family switches has been explained in details in a white paper available on
cisco.com: Cisco MDS 9000 Family Switch Architecture. This document is focused on congestion from end
devices, especially slow drain.
Nongraceful virtual machines exit on a virtualized server, resulting in frames held in HBA buffers.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 9 of 61
ISLs
Cisco Solution
Cisco has taken a holistic approach by providing features to detect, troubleshoot, and automatically recover from
slow drain situations. Detecting a slow-drain device is the first step, followed by troubleshooting, which enables
SAN administrators to take manual action of disconnecting an offending device. However, manual actions are
cumbersome and involve delay. To alleviate this limitation, Cisco MDS 9000 Family switches have intelligence to
constantly monitor the network for symptoms of slow drain and send alerts or take automatic recovery actions.
These actions include:
All the 16-Gbps MDS platforms (MDS 9700, MDS 9396S, MDS 9148S and MDS 9250i) provide hardware
enhanced slow-drain features. These enhanced features are a direct benefit of advanced capabilities of port ASIC.
Following is the summary of advantages of hardware enhanced slow drain features:
New feature called slowport-monitor, which maintains history of transmit credit unavailability duration on all
the ports at as low as 1 millisecond (ms)
Graphical display of credit unavailability duration on all the ports on a switch over last 60 seconds, 60
minutes, and 72 hours
Immediate automatic recovery from a slow-drain situation without any software delay
In addition to the hardware-enhanced slow-drain features on Cisco MDS 9000 Family switches, Cisco DCNM
provides slow-drain diagnostics from Release 7.1(1) and later. DCNM automates the monitoring of thousands of
ports in a fabric in a single pane of glass and provides visual representation in form of graphs showing fluctuation
in counters. This feature leads to faster detection of slow-drain devices, reduced false positives, and reduced
troubleshooting time from weeks to minutes.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 10 of 61
When a link is established between two Fibre Channel devices, both neighbors inform each other about the
available number of receive buffers they have. N_Port connected to F_Port exchange B2B credit information
through Fabric Login (FLOGI). E_Port connected to another E_port exchange B2B credit information through
Exchange Link Parameter (ELP). Before transmission of data frames, the transmitter sets the transmit (Tx) B2B
credits equal to the receive (Rx) B2B credits informed by the neighbor. This mechanism ensures that the
transmitter never overruns the receive buffers of the receiver. For every transmitted frame, remaining Tx B2B
credits decrement by one. The receiver, after receiving the frame, is expected to return an explicit B2B credit in the
form of R_RDY (Receiver_Ready, Fibre Channel Primitive) to the transmitter. A receiver typically does that after it
has processed the frame, and the receive buffer is now available for reuse. The transmitter increments the
remaining Tx B2B credits by one after receiving R_RDY. The transmitter does not increment the remaining Tx B2B
credit if R_RDY is not received. This can be due to the receiver not sending R_RDY or R_RDY being lost on the
link. Multiple occurrences of such event eventually lead to a situation where the remaining Tx B2B credit on a port
reaches zero. As a result, no further frames can be transmitted. Tx port resumes sending an additional frame only
after receiving a R_RDY. This strategy prevents frames from getting lost when the Rx port runs out of buffers (Rx
B2B credits) and ensures that the receiver is always in control (Figure 4).
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 11 of 61
Figure 4.
Note:
Frames Not Transmitted in Fibre Channel if receiver does not have enough buffers
The terms credit unavailability and zero remaining Tx/Rx B2B credits signify the same situation. This is
also represented by the term delay on a port (which means delay in receiving R_RDY to a port or delay in
forwarding frames out of a port). these terms have been used interchangeably in the document to convey the same
meaning.
Fibre Channel defines two types of flow control (Figure 5):
Figure 5.
End-to-end flow control was never widely implemented. Buffer-to-buffer (B2B) flow control between every pair of
neighbor ensures end-to-end lossless fabric.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 12 of 61
The remaining Tx B2B credits eventually fall to zero. As a result, Switch 2 does not send any further frames out of
Port F2. The data frames occupy all the egress buffers on Port F2. This generates an internal backpressure
towards Port E2 on switch 2. The data frames occupy all the ingress buffers on Port E2, and soon there are no
remaining Rx B2B credits available with Port E2 on switch 2 (Figure 7).
Figure 7.
Port E2 does not send R_RDY to Port E1 on switch 1. Data frames start occupying the egress buffers on Port E1,
which generates an internal backpressure towards Port F1 on switch 1. Data frames consume all the ingress
buffers on Port F1 leading to zero remaining Rx B2B credits. Port F1 stops sending R_RDY to Target 1 (Figure 8).
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 13 of 61
Figure 8.
Overall, the R_RDY and internal backpressure of switches have signaled Target 1 to slow down. This is desirable
behavior in a Fibre Channel network to make it lossless. However, it brings a side effect. In this example, when the
remaining Tx B2B credits fall to zero on Port E1 on switch 1, it generates backpressure to Port F1 as well as to
Port F11. Hence, not only the Target 1 Host 1 flow slows down, Target 2 Host 2 flow also slows down
(Figure 9).
Figure 9.
Slow-Drain Situation
As a final situation, just because one end device in the fabric became slow, it impacted all the flows that were
sharing the same switches and ISLs. This situation is known as slow drain. Host 1 in the shown topology is called a
slow-drain device.
Slow-drain situations can be compared to a traffic jams on a freeway due to an internal jam in an adjacent cities.
Consider a freeway that connects multiple cities. If one of the adjacent cities observes an internal traffic jam that is
not resolved fast enough, soon the traffic creates congestion on the freeway, consuming all the available lanes.
The obvious effect of this jam is seen on the traffic going to and coming from the congested city. However,
because the freeway is jammed, the effect is seen on the traffic that is going to and coming from all other cities
using the same freeway but may not be internally congested.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 14 of 61
Slow-Drain Detection
Cisco MDS 9000 Family switches provide numerous features to detect slow-drain devices. This section explains
these features along with Cisco slow-drain terminology. Figure 10 provides summary of slow-drain detection
capabilities on Cisco MDS 9000 Family switches.
Figure 10.
Percentage Tx B2B credits unavailability over last 1 second, 1 minute, 1 hour, and 72 hours.
History of TxWait is maintained for longer duration at On Board Failure Logging (OBFL) with time stamps.
Slowport-monitor and TxWait are new hardware-assisted features that are available on Cisco MDS 9000
Family switches. Both features are extremely powerful and should be preferred over other detection features.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 15 of 61
TxWait, slowport-monitor, and credit unavailability at 100 ms are complementary features. All of them
should be used together for best results. Detailed comparison of these features is available in Appendix B.
However, if at least one frame is still queued (Figure 12), the Cisco MDS 9000 Family switch starts a 90 ms LR
Rcvd B2B timer. If the Fibre Channel frames can be transmitted to the egress port, then the LR Rcvd B2B timer is
canceled and an LRR message is sent back to the adjacent Fibre Channel device.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 16 of 61
Figure 12.
Exchange of LR and LRR Primitives When at Least One Frame Is Still Occupying the Ingress Buffer
However, if the egress port remains congested and Fibre Channel frames are still queued at the ingress port, the
LR Rcvd B2B timer expires (Figure 13). No LRR is transmitted back to the adjacent Fibre Channel device, and both
the ingress port and the adjacent Fibre Channel device initiate a link failure by transmitting a Not Operational
Sequence (NOS) (a type of primitive sequence in Fibre Channel).
Figure 13.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 17 of 61
This event is logged as LR Rcvd B2B or Link Reset failed nonempty recv queue. This event indicates severe slowdrain congestion, but the cause is not the port that failed. The potential problem lies with the port to which this port
is switching the frames out on the same switch. If multiple ports on a MDS switch display this behavior, then most
likely all of them are switching frames to the same egress port that is facing server congestion. It can be an F_port
or an E_port in a multiswitch environment.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 18 of 61
SNMP Trap
Cisco MDS 9000 Family switches provide a Port Monitor feature, which monitors multiple counters at low
granularity. An SNMP trap is generated if any of these counters exceeds configured thresholds over a specified
duration. A Simple Network Management Protocol (SNMP) trap can trigger an external network management
system (NMS) to take an action or inform a SAN administrator for manual recovery. For more details about Port
Monitor, see Appendix A.
congestion-drop Timeout
A congested Fibre Channel fabric cannot deliver frames to the destination in a timely fashion. In this situation the
time spent by a frame in a Cisco MDS 9000 Family switch is much longer than the usual switching latency.
However, frames do not remain in a switch forever. Cisco MDS 9000 Family switches drop frames that have not
been delivered to their egress ports within a congestion-drop timeout. By default, congestion-drop timeout is
enabled and the value is set to 500 ms. Changing the congestion-drop timeout to a lower value can help drop
frames that have been stuck in the system more quickly. This action frees up the buffers faster in the presence of a
slow-drain device.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 19 of 61
This value can be set at the switch level for port type E and F as described here:
MDS9700(config)# system timeout congestion-drop <value> mode (F) / (E)
MDS9700(config)# system timeout congestion-drop default mode (F) / (E)
Congestion-drop timeout is a switchwide recovery feature with the following attributes:
There is no differentiation between the frames that are destined to slow devices and the frames that are
destined to healthy devices but are impacted due to the congestion. Both of these frames may not be
delivered to the destination timely and are subjected to congestion-drop timeout. The next level of recovery
(provided by no-credit-drop timeout) drops the frames destined only to a slow-drain device.
Dropping frames at a congestion-drop timeout is a reactive approach. The frames must be in the switch for
duration longer than the configured congestion-drop timeout value. The next level of recovery (provided by
no-credit-drop timeout) makes a frame-drop decision based on Tx B2B credit unavailability (which in turn
leads to the frame residing in a switch for longer duration) on a port instead of waiting for a timeout.
no-credit-drop Timeout
No-credit-drop timeout is a proactive mechanism available on Cisco MDS 9000 Family switches to automatically
recover from slow drain. If Tx B2B credits are continuously unavailable on a port for duration longer than the
configured no-credit-drop timeout value, all frames consuming the egress buffers of the port are dropped
immediately and all frames queued at ingress ports that are destined for the port are dropped immediately and
while the port remains at zero Tx B2B credits, any new frames received by other ports on the switch to be
transmitted out of this port are dropped.
These three actions free buffer resources more quickly than in the normal congestion-drop timeout scenarios and
alleviate the problem on an ISL in the presence of a slow-drain device. Transmission of data frames resumes on
the port when Tx B2B credits are available.
The efficiency of automatic recovery due to no-credit-drop timeout depends on the following factors:
How early can it be detected that the Tx B2B credits are unavailable on a port for duration longer than nocredit-drop timeout? Only after detection, can action be taken. In other words, how soon is the action
(dropping of frames) triggered after detection?
What can be the minimum timeout value that can be detected? At higher Fibre Channel speeds, even 100
ms is a long duration. In other words, how soon can the Tx B2B credit unavailability be detected?
How soon can it be detected that the Tx B2B credits are available on a port after a period of unavailability?
This determines how soon the data traffic can be resumed on the port.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 20 of 61
Table 3 shows details of these factors on different platforms on Cisco MDS 9000 Family switches.
Table 3.
Level
MDS 9500
16 Gbps platforms
Up to 99 ms
Immediate
100 ms
1 ms
100 ms
1 ms
How early can it be detected that the Tx B2B credits are available on a port
after a period of unavailability?
Up to 99 ms
Immediate
On MDS 9700, MDS 9396S, MDS 9148S, and MDS 9250i (16-Gbps platforms), no-credit-drop timeout functionality
has been enhanced by special hardware capabilities on port ASICs. No-credit-drop timeout can be configured at as
low as 1 ms. The timeout value can be increased up to 500 ms with granularity of 1 ms. If no-credit-drop timeout is
configured, the drop action is taken immediately by port ASIC without any software delay. These advanced
hardware assisted capabilities on Cisco MDS 16-Gbps platforms fully recover from slow-drain situations by
pinpointing and limiting the effect only to the flows that are destined to slow-drain devices.
By default, no-credit-drop timeout is off. It can be configured at the switch level for all F_ports.
switch(config)# system timeout no-credit-drop <value> mode F
switch(config)# system timeout no-credit-drop default mode F
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 21 of 61
credit-drop timeout. For example, benchmarking of a healthy fabric over the last 3 months provides a slowportmonitor operational delay of less than 20 ms across all ports on a Cisco MDS 9700 Multilayer Director. Sometimes,
a few of the ports display a delay value of 30 ms. Serious performance degradation is observed when any port has
zero remaining Tx B2B credits continuously for 50 ms or more. For this particular Cisco MDS 9700 switch, nocredit-drop timeout of 50 ms can be used. Notice that the values are used as illustration to explain the process of
using slowport-monitor to find a no-credit-drop timeout. The exact timeout values can differ on different fabrics.
Credit loss recovery can fail if LRR is not received within 100 ms of transmitting the link reset. This leads
to port flap. See the Link Event LR Rcvd B2B on Fibre Channel Ports section on page 16 for more details.
Note:
Credit loss recovery is automatic and does not require any configuration by the user.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 22 of 61
This approach gives a good snapshot of what the system is currently experiencing. This approach uses additional
system resources and imposes a limit on the frequency of problem detection:
The supervisor needs to constantly make a decision about whether to trigger an action or recovery on the
basis of a predefined policy.
Because this feature is a snapshot mechanism, the software can miss some of the transient conditions
under corner cases. In rare situations when the control plane CPU is busy, the software poll can be
delayed. Due to the delay, the problematic condition may not be detected at the exact time interval or the
associated action might get delayed.
The 16-Gbps platforms in the Cisco MDS 9000 Family (MDS 9700, MDS 9396S, MDS 9148S, MDS 9250i) use a
hardware-based slow-drain detection and automatic recovery algorithm. In this approach, slow-drain detection and
recovery is built in to the port ASIC, and instead of relying on the software for polling, the hardware can
automatically detect every time a credit is not available and take appropriate action without any intervention from
the supervisor (Figure 16).
Figure 16.
Here are some of the benefits of hardware-based slow-drain detection and automatic recovery algorithm:
New feature called Slowport-monitor, which maintains history of Tx credit unavailability duration on all ports
at as low as 1 ms
Immediate automatic recovery from slow-drain situation by port ASIC without any software delay
Note:
Few of the enhanced features (such as Slowport-monitor) have been enabled on Cisco MDS 9500 (8
Gbps and advanced 8-Gbps platforms) from NX-OS Release 6.2(13), and later. However, the functionality is
limited by the hardware capability of the MDS 9500.
Note:
The software process (Creditmon) responsible for polling ports on the MDS 9500 still exists on 16-Gbps
platforms. However, the process has been optimized by offloading most of the functionality to port ASICs.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 23 of 61
Page 24 of 61
ingress port considers the request timed out after few milliseconds (the exact value depends upon the platform).
The number of such request timeouts are displayed on Cisco MDS 9000 Family switches.
The egress port should be investigated because this behavior is an indication of transmission delays. These
arbitration-request timeout events can be viewed on a peregress port basis along with the ingress port numbers
and a time stamp of the earliest and latest events.
Note:
Arbitration timeouts are not frame drops. Arbitration requests are retried and, if the request is granted, the
frame can be transmitted to the egress port successfully. If a frame never receives a grant, it will eventually be
dropped at the congestion-drop timeout and be counted as a timeout discards.
Error-stats: contains information on timeout discards, credit loss recovery, link failures, and other errors.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 25 of 61
Level
Host Symptoms
Latency
Frame queuing
SCSI retransmission
Frame dropping
Extreme delay
Link reset
Cisco recommended order of troubleshooting is Extreme Delay (Level 3) followed by retransmission (Level 2)
followed by Latency (Level 1) as shown in Figure 18. Only after the entire extreme-delay situation is resolved,
should troubleshooting be focused on retransmission. Troubleshooting a high-latency situation should be the
final step.
Figure 18.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 26 of 61
Level 1: Latency
High latency in the fabric means SCSI exchanges are taking longer than normal. There may not be any errors or
retransmission involved. High-latency situations is subtle and more difficult to detect.
Table 5 provides a quick reference of features along with NX-OS show commands that can be used to troubleshoot
different levels of performance degradation. The details of the features have been explained in previous sections,
and NX-OS show command details are in Appendix C. See Appendix E for more about the supported platforms
and Cisco NX-OS Software versions on which these commands are available.
Table 5.
Level
Level 1: Latency
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 27 of 61
2.
Check the VSAN zoning database to see with which devices the adjacent Fibre Channel device is zoned. Map
these to egress E_ports or F_ports. To map to egress E_ports, use the show fspf internal route
vsan <vsan> domain <dom> command. To map to local F_ports, use the show flogi database vsan
<vsan> command. If more than one link is failing and displays LR Rcvd B2B, then combine the egress
E_ports or F_ports that are found and check for overlap. Overlapping ports are likely the port that caused the
link failure.
3.
Check the ports found in step 2 for indications of Tx B2B credit unavailability. Examples are credit loss
recovery, (<component_id>_CNTR_CREDIT_LOSS), 100 ms Tx B2B zero
(<component_id>_CNTR_TX_WT_AVG_B2B_ZERO), TxWait, slowport-monitor, and timeout discard
(<component_id>_TIMEOUT_DROP_CNT).
4.
If the failure port is determined to be an E_port, then continue slow-drain troubleshooting on the adjacent
switch indicated by the Fabric Shortest Path First (FSPF) next-hop interface.
5.
If the port is determined to be a Fibre Channel over IP (FCIP) link, then check the FCIP links for signs of TCP
retransmission or other problems, such as link failures. The command show ips stats all can be used to
check for problems.
Figure 19 provides a process flow chart to follow congestion to source with slow-drain situations.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 28 of 61
Figure 19.
Table 6 provides a list of features that can be used to troubleshoot RX congestion and TX Congestion, and to find
the Tx port from Rx port on a Cisco MDS 9000 Family switch.
Table 6.
Troubleshooting Rx Congestion
Troubleshooting Tx Congestion
Linking Rx to Tx ports
LR Rcvd B2B
Slowport-monitor
Arbitration timeout
Generic Guidelines
In identifying a slow-drain device, be aware of the following:
Logs are detailed and can roll over an active port. Though events are stored at OBFL, troubleshooting
should begin quickly when slow-drain problems are detected.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 29 of 61
If credit loss recovery and/or transmit frame drops occur on an ISL, then traffic destined to any egress port
on the switch can be affected, and so multiple edge devices may report errors. If either condition is seen on
ISLs, then the investigation should continue on the ISL peer switch. If an edge switch shows signs of frame
drops, then each edge port on the edge switch should be checked.
Detecting and Troubleshooting Slow Drain with Cisco Data Center Network Manager
Cisco Data Center Network Manager (DCNM) provides slow drain diagnostics in release 7.1(1), and later. DCNM
provides a fabricwide view of all switches, ISLs and end devices. DCNM can watch thousands of ports in a fabric in
a slow-drain situation. The health of these ports is displayed in single pane of glass to easily identify top offenders.
DCNM slow-drain diagnostics reduces troubleshooting time from days to minutes by pinpointing the problem.
Often, SAN administrators struggle to find a starting point to locate a slow-drain device in large and complex
topologies. Cisco recommends using DCNM slow-drain diagnostics as the very first troubleshooting step when a
slow- drain device is suspected in a fabric.
Following is a description of using slow-drain diagnostics on Cisco DCNM, as shown in Figure 20.
Figure 20.
6
7
9c 12
9b
8
9c
11
1.
10
9a
2.
To access the diagnostics, choose Health > Diagnostics > Slow Drain Analysis.
11
Define the scope of analysis by selecting a fabric from the drop-down list.
3.
Choose a duration for which the slow-drain analysis. From DCNM release 7.2(1), and later, the duration can
be extended up to 150 hours. Analysis can be started or stopped using the buttons provided.
4.
A history of previous analysis or the status of currently running analysis can be watched by choosing from the
Current Jobs drop-down menu. Analysis can be run in parallel for different fabrics at the same time by using
two browser sessions. Counters are polled every 10 seconds for the duration specified. The analysis can be
stopped manually by clicking the stop button.
5.
To display the output of the analysis while it is running, click the Show Results icon under Current jobs dropdown menu.
The change of multiple values is displayed for the complete fabric. For detailed analysis, counters can be
zoomed to 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or the maximum duration.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 30 of 61
6.
To display core granularities, use the time slider, or select From and To time stamps.
7.
Use multiple options to pinpoint slow-drain devices from thousands of ports in a large and complex fabric in
minutes:
a. The counters are color coded. Counters in red indicate drastic change, counters in orange indicate
optimum change, and so on. Troubleshooting should be started on ports showing counters in red.
b.
Ports with only nonzero counter values can be filtered by selecting the Data Rows Only radio button.
c. You can filter ports further with the filtering options available with all fields. For example, if one of the
switches is suspect, it can be inspected by filtering under the Switch Name column. If one of the end-devices
is suspected, it can be inspected by filtering under the ConnectTo column. Switch ports can be filtered based
on their connectivity to end devices (F-port) or ISLs (E-port). F_ports can further be filtered by connection to
Host or Storage. Specific counters can be filtered by providing a minimum threshold value in the text box. Only
ports that have with counter values higher than the provided value are displayed.
8.
To open the end-to-end topology showing the end device in the fabric, click the icon of the connected device
just before the device name under the ConnectTo column.
9.
To display graphical display of counters over the analyzed period, click the graph icon just before the interface
name. To display the counter at that particular time stamp, position the cursor over the graph.
Graphical representation of the counter is an extremely powerful feature that enables locating an abnormal
condition, reducing false positives. Large values of many counters (such as RxB2Bto0, TxB2Bto0, or TxWait)
may be acceptable in a fabric. However, unexpected sudden change in these counters can indicate a problem:
for example, if a stable fabric rises 1000 in TxWait every 5 minutes (numbers are just for illustration). This
change can be treated as a typical expected value. However, a problem may exist if TxWait increments in
millions over the next 5-minute interval. Locating such sudden spikes becomes extremely intuitive and fast
using graph.
Output of a job on a fabric can be exported to Microsoft Excel format for sharing, deep inspecting, and archiving.
As of DCNM release 7.2(1), counters listed in Table 7 can be monitored.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 31 of 61
Table 7.
Description
Reference section
TxCreditLoss
TxLinkReset
RxLinkReset
TxTimeoutDiscard
Timeout Discards
TxDiscard
Timeout Discards
TxWtAvg100ms
RxB2Bto0
TxB2Bto0
Number of times when remaining Tx B2B credits fall to zero, even for Credit transition to zero
an instant.
TxWait (2.5us)
Note:
DCNM slow-drain diagnostics uses SNMP object identifiers (OIDs) for analysis. The actual counters must
be supported by the managed switch. See the support matrix in Appendix E for supported features across various
platforms and NX-OS releases.
Note:
For DCNM release 7.2(1), and later, the displayed value of a counter is the collective change value across
all previous jobs on a given fabric. Counters for particular job can be seen by zooming to a particular time window.
Smaller jobs at 10 minutes or 30 minutes duration should be run. If a longer task is under progress, live
counters should be monitored.
Before starting a job, consider deleting previous collections from DCNM. This deletion helps to reduce the
time slider by removing old data. Exporting data in Microsoft Excel format is a good practice before deleting
any data. If deleting previous collections is not desired then use the zoom functionality or time slider to
monitor only the latest collected data.
Select the Data Rows Only radio button as a very first filtering step.
Look for counters in the same order as the levels of congestion they represent: TxCreditLoss
TxLinkReset, RxLinkReset TxDiscard, TxTimeoutDiscard TxWtAvg100ms TxWait TxB2Bto0,
RxB2Bto0.
Look for counters in red. Click the Show Filter icon and enter a large number to enable filtering. A large
number should be chosen so you can display only a handful of devices. If not enough devices are shown
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 32 of 61
after applying the filter or displayed ports are not suspect, consider reducing the filter value so that more
ports can be displayed.
Filter ports based on their connectivity to Host, Storage, or Switch. Ports connected to a host should be
analyzed before a port is connected to storage and switches. Troubleshooting ISL ports (ports connected to
switch) should be the last step.
Display the graph for filtered ports. Watch for ports with high values and spikes.
Display the topology and pay special attention to degraded data rate on the suspected ports.
A port displaying low-transmit data rate and increments in slow-drain counters at high rate is a suspect port
that might be connected to a slow-drain device.
Proactive Approach
Slow-drain diagnostics enables prevention of slow drain, which is a complementary functionality on top of
detection, troubleshooting, and automatic recovery features described throughout this document. A well-performing
fabric can face a slow-drain situation even if one end device malfunctions. The malfunctioning slow-drain end
device can display patterns or interim intervals of poor performance before it starts causing severe performance
degradation to an application.
To understand it better, lets take example of an HBA with an average TxWait of 1 second over an interval of 5
minutes. This means in a window of 5 minutes, frames had to wait for 1 second, because Tx B2B credits were
unavailable. The TxWait of 1 second is a cumulative value spread across 5 minutes with quanta of 2.5 s. If the
application on the host with this HBA has been performing very well over the last 3 months (or any other long
duration), it is safe to assume that TxWait of 1 second in a 5-minute window is a typical expected value for the F
port. This is called slow-drain port profiling or benchmarking.
If the HBA malfunctions and becomes a slow-drain device, TxWait can increase to a value where the application
performance is severely impacted. The impact is seen on other hosts sharing the same switches and ISLs. Lets
assume that this increased TxWait value is 20 seconds in a window of 5 minutes. Any such spike in the counters in
DCNM needs attention and should be carefully analyzed. This should be done even before the application end
users complain about performance degradation. In other situations, the delay value may not rise to 20 seconds and
stay there permanently. There may be an interim 5-minute window when the delay value gets to 1 second while in
other windows the delay value might be higher. Any such random spike in the delay value might be a peek into the
future that the HBA is about to malfunction. Fabricwide benchmarking on all F-ports enable a SAN administrator to
maintain a history of acceptable delay values. Ports with random spikes in delay values above the acceptable
value should be kept under a watch list. If the number of spikes in TxWait on a port is increasing, then probably the
connected end device is about to malfunction completely.
This proactive approach by fabric profiling or benchmarking can be done natively on Cisco MDS 9000 Family
switches using slowport-monitoring and TxWait. The centralized fabricwide visibly of DCNM makes this exercise
simpler and faster.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 33 of 61
Summary
Congestion in SANs cripples application performance instantly. Problems such as slow drain originate from one
misbehaving end device but can impact all other end devices that are sharing the same pair of switches and ISLs.
Following is the summary of recommendations to detect, troubleshoot, and automatically recover from slow drain
on Cisco MDS 9000 16-Gbps platforms.
Always do the following:
Configure slowport-monitor at 1025 ms for both E and F_ports. The value can be further reduced without
any side effects.
Configure no-credit-drop timeout on F_port at 50 ms. If the SAN administrator find this value to be
aggressive, 100 ms is a good start. Use the output of slowport-monitor to refine the value of no-credit-drop
timeout.
2.
Use the TxWait counter effectively by monitoring the port health by graphical display and percent of
congestion available as output of Cisco NX-OS show commands.
3.
4.
Follow the Cisco recommended methodology of moving towards the source of congestion.
5.
Last but not the least, it is highly recommended to benchmark the credit unavailability duration on all fabric ports to
forecast the events.
Conclusion
Cisco MDS 9000 Family switches have been architected to build robust and self-healing SANs. All ports function at
full line rate with predictable and consistent performance. The holistic approach taken by Cisco to detect,
troubleshoot and automatically recover from slow drain keeps the performance of your SAN at peak. Slow drain
diagnostics on Cisco DCNM reduced the troubleshooting time to minutes from weeks. Overall, Cisco MDS 9000
Family switches are the best choice for building SANs to support most critical business applications.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 34 of 61
Port flap: Leads to link down followed by link up, similar to using the NX-OS shutdown command followed
by no shutdown on an interface.
Error-disable: Leads to link down until a manual user intervention brings the link back up, similar to using
the shutdown NX-OS command on an interface. User must manually execute the no shutdown command
on the interface to bring the link back up.
Consider a port with TxWait of 100 ms over a poll interval of 1 second. This means the Tx B2B credits on the port
were unavailable for 100 ms in the monitored duration of 1 second. An administrator may want to receive an
automated alert or even shutdown the port if the TxWait exceeds more than 300 ms over the poll interval of 1
second. The administrator can configure a Port Monitor policy to achieve this (Figure 21).
Figure 21.
Port Monitor Functionality Using TxWait on Cisco MDS 9000 Family Switches
An advantage of Port Monitor is its unique ability to monitor hardware-based counters at extremely low granular
time intervals. For example, an SNMP trap can be generated at as low as 1 ms of credit unavailability duration in a
span of 1 second using slowport-monitor counters under Port Monitor. The Port Monitor feature provides more than
19 different counters. However, the scope of this document is limited only to the counters that are specific to slow
drain. Table 8 lists counters that apply to the slow-drain solution.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 35 of 61
Table 8.
Port Monitor Counters Applicable for Slow-Drain Detection and Automatic Recovery
Description
credit-loss-reco
lr-tx
lr-rx
timeout-discards
Timeout Discards
tx-discard
Timeout Discards
Tx-credit-not-available
tx-wait
tx-slowport-oper-delay
Slowport-monitor
tx-slowport-count
Slowport-monitor
These counters can also be monitored using SNMP OIDs. See Appendix D for details.
Note:
A slow-drain Port Monitor policy can be created for access ports, trunk ports, or all ports. Only one policy can be
active for each port type at a time. If the port type is all ports, then there can be only one active policy. A brief
introduction about the Port Monitor configuration is provided by taking the example of tx-credit-not-available (credit
unavailability at 100 ms). The configuration of other counters is similar.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 36 of 61
Configuration Example
A Port Monitor policy can be configured at the switch level or port level (F_port, E_port, or all ports):
switch(config)# port-monitor name Cisco
switch(config-port-monitor)# port-type <access/all/trunk>
access-port
all
trunks
port-type: Allows users to customize the specific policy to access or trunk ports or all ports.
poll-interval: Indicates the polling interval in which slow-drain statistics are collected; measured in seconds
(configured to 1 second in this example).
Threshold type: Determines the method for comparing the counter with the threshold. If the type is set to
absolute, the value at the end of the interval is compared to the threshold. If the type is set to delta, the
change in value of the counter during the polling interval is compared to the threshold. For tx-credit-notavailable, delta should be used.
rising-threshold: Generates an alert if the counter value is lower than the rising-threshold value in the
last polling interval and is greater than or equal to this threshold at this interval. Another alert is not
generated until the counter is less than or equal to a falling threshold at the end of another polling interval
event: Indicates the event number to be included when the rising-threshold value is reached. This event
can be syslog or a SNMP trap. Counters can be assigned different event numbers to indicate different
severity levels.
falling-threshold: Generates an alert if the counter is higher than the rising-threshold value prior in a last
polling interval and lower than or equal to the falling-threshold value at this interval.
portguard: Advanced option that can be set to apply error-disable or flap the affected port.
For example, in the sample command for counter tx-credit-not-available, the poll interval is 1 second and the rising
threshold is set to 10 percent (which translates to 100 ms). The rising-threshold event is triggered if Tx B2B credits
are unavailable for continuous duration of 100 ms in a polling interval of 1 second. This event results in a SNMP
trap, and the port is put into error disable state. It remains in that state until someone manually issues a shut and
no shut command on that port.
Table 9 provides a support matrix of various slow-drainspecific counters in Port Monitor. The recommended
values can be configured as a starting threshold values. Monitoring over weeks or months provides more
information to further refine the thresholds.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 37 of 61
Table 9.
Counter Name
Supported Platforms
Minimum
NX-OS
Version
Recommended Thresholds
Threshold
Type
Interval
Rising
Threshold
Falling
Threshold
credit-loss-reco
ALL
5.x.x or 6.x.x
Delta
60
lr-rx
ALL
5.x.x or 6.x.x
Delta
60
lr-tx
ALL
5.x.x or 6.x.x
Delta
60
timeout-discards
ALL
5.x.x or 6.x.x
Delta
60
50
10
tx-credit-notavailable
ALL
5.x.x or 6.x.x
Delta
10%
0%
tx-discards
ALL
5.x.x or 6.x.x
Delta
60
50
10
slowport-count1
6.2(13)
Delta
6.2(13)
Absolute
50 ms
(for 16-Gbps
platforms)
DS-X9248-48K9,
DS-X9224-96K9 or
DS-X9248-96K9 line cards
slowport-oper-delay2
MDS 9700,
MDS 9396S,
MDS 9148S,
MDS 9250i and
80 ms
(for other
platforms)
MDS 9700,
6.2(13)
Delta
20%
MDS 9396S,
MDS 9148S,
MDS 9250i and
MDS 9500 only with
DS-X9232-256K9 or
DS-X9248-256K9 line cards
1:
Slowport-monitor must be enabled for this counter to work and increments only if the slowport-monitor admin
delay (configured value) is less than the duration for which remaining Tx B2B credits remain zero.
2:
Slowport-monitor must be enabled for this counter to work. Threshold is exceeded only if the slowport-monitor
admin delay (configured value) is less than and the reported operation delay (oper-delay) is more than the
configured rising threshold.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 38 of 61
Following is the NX-OS configuration based on the recommended thresholds from Table 9 with an alert-only action.
SAN administrators can make changes according to their requirements.
port-monitor name Custom_SlowDrain_AllPorts
port-type all
counter tx-discards poll-interval 60 delta rising-threshold 50 event 3 fallingthreshold 10 event 3
counter lr-rx poll-interval 60 delta rising-threshold 5 event 2 fallingthreshold 1 event 2
counter lr-tx poll-interval 60 delta rising-threshold 5 event 2 fallingthreshold 1 event 2
counter timeout-discards poll-interval 60 delta rising-threshold 50 event 3
falling-threshold 10 event 3
counter credit-loss-reco poll-interval 60 delta rising-threshold 1 event 2
falling-threshold 0 event 2
counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event
4 falling-threshold 0 event 4
counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 4
falling-threshold 0 event 4
counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 50
event 4 falling-threshold 0 event 4
counter txwait poll-interval 1 delta rising-threshold 20 event 4 fallingthreshold 0 event 4
port-monitor activate AllPorts
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 39 of 61
Figure 22.
Consider the credit unavailability scenario in Figure 23. Remaining Tx B2B credits fall to 0 at 250 ms and do not
recover until 390ms. The overall duration is still more than 100 ms but tx-credit-not-available does not flag this
event. Software polls are executed every 100 ms. There were no two consecutive polls when the remaining Tx B2B
credits were zero continuously. The hardware-based implementation of tx-slowport-oper-delay and txwait helps to
flag this condition.
Figure 23.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 40 of 61
Consider the credit unavailability scenario in Figure 24. Remaining Tx B2B credits fall to 0 multiple times in a pollinterval of 1 second. None of the credit unavailability durations is longer than 100 ms on its own but the sum of
these durations is longer than 100 ms. TxWait is the only counter that flags such condition.
Figure 24.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 41 of 61
As shown in Figure 24, it is clear that txwait helps to find transient conditions of credit unavailability. This is an
added advantage over tx-slowport-oper-delay, which finds continuous duration of credit unavailability.
Table 10 provides comparison of the three counters as by the Port Monitor feature.
Table 10.
Attribute
Tx-wait
tx-slowport-oper-delay
tx-credit-not-available
Monitored by default
Yes
No
Yes
Supported Actions
syslog, trap
1 second
1 second
1 second
Threshold unit
Delay in ms
Yes
Yes
Yes
Yes
No
No
Minimum granularity
10 ms
1 ms
100 ms
Implemented in Software/Hardware
Hardware
Hardware
Software
Appendix C: Cisco MDS 9000 Family Slow-Drain detection and Troubleshooting Commands
TxWait Period for Frames
TxWait is a counter that increments if a port has zero remaining Tx B2B credit for 2.5 s. This counter reports
credit unavailability duration on a port by multiple intuitive ways.
Note:
TxWait is reported only on 16-Gbps and advanced 8-Gbps platforms. The remainder of platforms in the
Apply the previous formula to the TxWait value displayed in the previous output:
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 42 of 61
(6252650) x 2.5
1,000,000
15.63
This indicates that MDS was not able to transmit frames for more than 15 seconds since the counters were last
cleared.
Low-level troubleshooting may require clearing the counters (use the clear counter NX-OS command) followed by
displaying the TxWait value multiple times.
Displaying Percentage Tx Credits Not Available
Low-level troubleshooting has further been simplified by displaying the percentage of duration for which remaining
Tx B2B credits were zero for last 1 second, 1 minute, 1 hour, and 72 hours. The percentage value is derived from
TxWait counter. This information is an added advantage over the raw TxWait counter to provide a longer snapshot
of the health of the port.
| Congestion | Timestamp
|
|
----------------------------------------------------------------------------|
fc1/11
3435973
| 08
42%
fc1/11
6871947
| 17
85%
The Delta TxWait counter displays the increment in the Txwait counter over the 20-seconds window. The value is
also displayed in seconds. The duration for which remaining Tx B2B credits were zero over the span of 20 seconds
is displayed in percentages. For example, the first row has TxWait incremented by 3435973 over last 20 seconds
on Sunday, Sep 30, 2015 at 05:23:05. 3435973 ticks at 2.5 s translates to 8.6 seconds which is 42 percent of 20
seconds.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 43 of 61
Note:
79993
999999
08887
58882
9899999
000000000000299870000000000000000029994000000000000362999500
1000
###
###
######
####
###
######
800
####
####
######
700
#####
####
######
600
#####
####
######
500
#####
####
######
400
#####
####
######
300
#####
#####
######
200
#####
#####
######
100
#####
#####
#######
900
0....5....1....1....2....2....3....3....4....4....5....5....6
0
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 44 of 61
slowport-monitor
Slowport-monitor can be enabled by the system timeout slowport-monitor config level command.
MDS9700(config)# system timeout slowport-monitor ?
<1-500>
default
mode
mode
After this command is enabled, events are logged whenever a port has zero remaining Tx B2B credits for duration
longer than the configured value. The configured value is called admin delay. The default admin delay is 50 ms.
The command allows admin delay values from 1 to 500 ms with 1 ms granularity. These values help to capture
slow-drain devices with enhanced precision. In addition, separate admin delay values can be configured for F and
E_ports. There is no extra control place CPU overheard due to slowport-monitor. Hence, the admin delay value of
as low as 1 ms can be configured safely. However, this may result in a flood of information depending on the health
of the fabric. A good starting admin delay value can be 10 ms. If many ports are showing delay values higher than
10 ms, then troubleshooting should be performed on those ports. If few ports show delay values higher than 10 ms,
then admin delay value can be reduced further.
Note:
The admin delay value in slowport-monitor must always be less than no-credit-drop timeout value. If no-
credit-drop timeout is 1 ms, the slowport-monitor admin delay value should also be 1 ms. If no-credit-drop timeout
is 50 ms then the slowport-monitor admin delay can be 50 ms or less.
show process creditmon slowport-monitor-events
Use show process creditmon slowport-monitor-events to display the last 10 events per port.
MDS9700# show process creditmon slowport-monitor-events
Module: 01
=========================================================================
Interface = fc1/13
---------------------------------------------------------------| admin
| slowport
| delay
| detection | delay |
| oper
| (ms)
| count
| (ms)
Timestamp
---------------------------------------------------------------|
1300 |
20
| 1. 04/01/15 23:03:38.823
1296 |
19
| 2. 04/01/15 23:03:38.724
1291 |
19
| 3. 04/01/15 23:03:38.623
--------------------------------------------------------------- 2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 45 of 61
Admin delay is the value configured by the system timeout slowport-monitor command. Though the
events are captured by the port ASIC at as low as 1 ms, but a window of 100 ms is used to display the output. The
slowport detection count is the number of times when remaining Tx B2B credits are zero for a duration longer than
the configured admin delay value. It is an absolute counter. Increments in the last 100 ms can be obtained by
subtracting the value displayed in the row below. The operation delay (oper delay) is the actual duration for which
remaining Tx B2B credits on the port are zero at the displayed time stamp. If there was more than one event in the
100 ms period, then it is the average of all events. In the previous example, at time 23:03:38.823, Tx B2B credits
were unavailable continuously for 20 ms. This event occurred 4 times (13001296). So there was a total of 80 ms
(approx. 20 ms x 4) of time in the previous 100 ms interval when credits were not available.
Difference Between 16-Gbps and Advanced 8-Gbps Platforms
Note:
On 16-Gbps platforms, oper delay is displayed only if the remaining Tx B2B credits are zero for continuous span of
duration that is longer than the admin delay value in the 100 ms window. On advanced 8-Gbps platforms, oper
delay is the cumulative duration in the 100 ms window when remaining Tx B2B credits are zero. The difference is
the continuity of the duration: for example, if the admin delay is configured to be 5 ms and remaining Tx B2B
credits are zero for 4 ms two times in a window of 100 ms. 16-Gbps platforms do not display this event in the oper
delay value. Advanced 8-Gbps platforms display an oper delay of 8 ms. Due to this difference in capability, the
following applies to the output of show process creditmon slowport-monitor-events on advanced 8Gbps platforms:
The slowport-monitor detection count increments by only 1 in a window of 100 ms. On 16-Gbps platforms,
the count can increment by more than 1.
Txwait oper delay has been used in place of oper delay under the output of the show command. This is to
signify the true meaning of the delay value, which actually is the cumulative duration.
8-Gbps platforms have basic slowport-monitor capability. In a span of 100 ms, it can only be determined if the
remaining Tx B2B credits are zero for a duration longer than the configured admin delay value. The exact duration
and number of times of this event cannot be determined. Following is the output of show process creditmon
slowport-monitor-events on 8-Gbps platforms:
MDS9500# show process creditmon slowport-monitor-events module 2
Module: 02
=========================================================================
Interface = fc2/1
-------------------------------------------------------| admin
| slowport
| delay
| detection |
| (ms)
| count
Timestamp
--------------------------------------------------------
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 46 of 61
10
10
10
10
10
10
10
10
10
10
-------------------------------------------------------Notice that oper delay column is not displayed. Also, the slowport detection count can increment only by 1 in the
100 ms time stamp window.
OBFL slowport-monitor-events
Only the last 10 events are displayed under show process creditmon slowport-monitor-events. More
events are captured at OBFL and can be displayed using show logging onboard slowport-monitor-events.
MDS9700# show logging onboard slowport-monitor-events
--------------------------------Module: 1 slowport-monitor-events
---------------------------------------------------------------------------------------------------------| admin
| slowport
| oper
| delay
| detection | delay |
| (ms)
| count
| (ms)
Timestamp
| Interface
-------------------------------------------------------------------------|
20
49 |
489
| 05/11/15 21:04:46.779
| fc1/13
20
48 |
489
| 05/11/15 21:04:46.272
| fc1/13
20
47 |
489
| 05/11/15 21:04:45.779
| fc1/13
20
46 |
489
| 05/11/15 21:04:45.272
| fc1/13
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 47 of 61
-----------------------------------------------------------------------------------------Port
Threshold
Interval(s)
Event Time
Type
Rising/Falling
Duration of
time not
available
-----------------------------------------------------------------------------------------fc1/1
10/0(%)
Falling
0%
fc1/1
10/0(%)
Rising
10%
fc1/1
10/0(%)
Falling
0%
fc1/1
10/0(%)
Rising
10%
fc1/1
10/0(%)
Falling
0%
fc1/1
10/0(%)
Rising
10%
fc1/1
10/0(%)
Falling
0%
fc1/1
10/0(%)
Rising
20%
fc1/1
10/0(%)
Falling
0%
fc1/1
10/0(%)
Rising
10%
The output is displayed in the percentage of a 1-second interval when the Tx B2B credits are not available at the
time stamp. In the previous output, Tx B2B credits were unavailable for 100 ms (which is 10 percent of 1 second)
on Tuesday, May 19, 2015 at 16:13:19. Similarly, Tx B2B credits were unavailable for 200 ms (which is 20 percent
of 1 second) on Tuesday May 19, 2015 at 16:15:12.
The interval(s) column displays the number of times the threshold of 10 percent or 0 percent was crossed during
the 1 second interval. Consider the scenario in Figure 25. The red line shows Tx B2B credit availability on a port
plotted against time. The port is observing variable delay in receiving R_RDY. At the start of a second, the
remaining Tx B2B credits on a port fall to zero. At the 150th millisecond, R_RDY is returned, resulting in nonzero
remaining Tx B2B credits. At the 500th millisecond, the remaining Tx B2B credits, again, fall to zero. At the 600th
millisecond, R_RDY is returned, resulting in nonzero remaining Tx B2B credits. This sequence of events displays 2
under the Interval(s) counter for that particular second.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 48 of 61
Figure 25.
Range
|
Error Stat Counter Name
Count
Time Stamp
|MM/DD/YY HH:MM:SS
|
-------------------------------------------------------------------------------fc1/13
|F16_TMM_TOLB_TIMEOUT_DROP_CNT
|1496855
|04/07/15 22:44:23
fc1/13
|FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO
|217
|04/07/15 22:44:23
fc1/13
|FCP_SW_CNTR_CREDIT_LOSS
|19
|04/07/15 22:44:23
fc1/13
|F16_TMM_TOLB_TIMEOUT_DROP_CNT
|1486654
|04/07/15 22:44:03
fc1/13
|FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO
|108
|04/07/15 22:44:03
fc1/13
|FCP_SW_CNTR_CREDIT_LOSS
|9
|04/07/15 22:44:03
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 49 of 61
LR Rcvd B2B
Show Logging Log File
Display the logs generated and watch for the message, Link Reset failed nonempty recv queue.
MDS9710-1# show logging logfile
%PORT-2-IF_DOWN_LINK_FAILURE: %$VSAN 100%$ Interface fc5/32 is down (Link
failure)
%PORT-5-IF_DOWN_LINK_FAILURE: %$VSAN 100%$ Interface fc5/32 is down (Link failure
Link Reset failed nonempty recv queue)
This command is available on all Cisco MDS 9000 Family switches.
Command: show port-config internal link-events
The show port-config internal link-events command is available after attaching to a module. Watch
for LR Rcvd B2B events.
MDS9700# attach module 1
Attaching to module 1 ...
To exit type 'exit', to abort type '$.'
Wind River Linux glibc_small (standard) 3.0
module-1# show port-config internal link-events
*************** Port Config Link Events Log ***************
----
------
-----
-----
------
Time
PortNo
Speed
Event
Reason
----
------
-----
-----
------
fc1/25
---
...
Jul 28 00:46:39 2012
00670297
DOWN
LR Rcvd B2B
Page 50 of 61
Dropped-Frame Information
The last 32 frames dropped by MDS 9700 and MDS 9396S can be displayed by the show hardware internal
fcmac inst <#> tmm_timeout_stat_buffer command. This is a module-level command which can be
executed by attaching to that particular module with the attach module NX-OS exec command.
MDS9700# attach module 1
Attaching to module 1 ...
To exit type 'exit', to abort type '$.'
Wind River Linux glibc_small (standard) 3.0
module-1#
module-1# show hardware internal fcmac inst 0 tmm_timeout_stat_buffer
Port Group num: 0 TMM TIMEOUT BUFFERS
--------------------------------------------TO_RD:22 TO_WR:6 NUM PKTS:32
-------------------------------------------------------------TMM TIMEOUT Packet :0
CHIPTIME :14227(0x3793)
SID:330040
ZERO:0
DID:170040
TSTMP_VALID:1 HDRTSTMP:14176(0x3760)
DI:2
AT:0
FCTYPE:0
RCTL:0
HDRCTL:6144
SI:12
PORTNUM:1
ZERO:0
DID:170040
TSTMP_VALID:1 HDRTSTMP:14176(0x3760)
DI:2
AT:0
FCTYPE:0
RCTL:0
HDRCTL:6144
SI:12
PORTNUM:1
<output truncated>
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 51 of 61
module-1# exit
rlogin: connection closed.
MDS9700# show system internal fcfwd idxmap port-to-interface
Port to Interface Table:(All values in hex)
-------------------------------------------------------------------------------glob|
idx |
|idx|type|
| node| mode|
| 0| 00| 01 | 00 | 00 | 0102| 08
| 00
1| 01001000 fc1/2
| 0| 01| 01 | 00 | 01 | 0102| 00
| 00
2| 01002000 fc1/3
| 0| 02| 01 | 00 | 02 | 0102| 00
| 00
| 0| 12| 01 | 00 | 12 | 0102| 00
| 00
<output truncated>
12| 01012000 fc1/13
The first command (show hardware internal fcmac inst <#> tmm_timeout_stat_buffer) shows the
last 32 dropped frames starting from 0. SID and DID fields show the source FCID and destination FCID,
respectively, of the frame. SI and DI stand for Source Index and Destination Index, respectively. SI and DI
represent the ingress and the egress port on the switch. The mapping of SI and DI can be obtained by the show
system internal fcfwd idxmap port-to-interface NX-OS exec level command. In the previous output,
SI of 12 represents fc1/13 while DI of 2 represents fc1/3.
These values help to trace the end-to-end path of a frame. The frame might be destined to a slow-drain device or it
might just be a victim. The information about the dropped frames should be collected multiple times. Common DID
may represent a slow-drain device.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 52 of 61
+----------+--------+--------+--------+--------+
^
ingress port
This command-line interface (CLI) captures the mapping of the destination index to the global index (DI to GI
mapping), revealing where the frames are destined if they are queued for an extended time. This mapping
indicates that one, or more, frames are queued to destination index 0xC, which is associated with port fc4/13. This
mapping can be determined using the command shown here:
MDS9500# show system internal fcfwd idxmap port-to-interface
Port to Interface Table:(All values in hex)
-------------------------------------------------------------------------------glob|
idx |
|idx|type|
| node| mode|
-----|--------------------------|--|---|----|----|----|-----|-----|------------
b| 0100b000 fc1/12
| 0| 0b| 01 | 00 | 0b | 0102| 00
| 00
| 0| 13| 01 | 0c | 13 | 0d02| 00
| 00
| 0| 0c| 01 | 00 | 0c | 0102| 00
| 00
| 0| 14| 01 | 0c | 14 | 0d02| 00
| 00
c| 0100c000 fc4/13
195| 01614000 fc13/21
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 53 of 61
Note:
The information displayed is real-time data, not historical data. Consequently, it should be done while the
Arbitration Timeouts
Use show logging onboard flow-control request-timeout to display arbitration timeouts.
module# show logging onboard flow-control request-timeout
---------------------------Module: 1
-----------------------------------------------------------------------------------------------------------|
Dest
Source
|Events|
Timestamp
Timestamp |
Intf
Intf
| Count|
Latest
Earliest |
-------------------------------------------------------------------------------|fc3/1
|fc1/10,
|fc1/11,
|fc1/12,
|fc1/22,
|fc1/23,
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 54 of 61
Range
|
Error Stat Counter Name
|
Count
Time Stamp
|MM/DD/YY HH:MM:SS
|
-------------------------------------------------------------------------------fc1/13
|F16_TMM_TOLB_TIMEOUT_DROP_CNT
|1496855
|04/07/15 22:44:23
fc1/13
|FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO
|217
|04/07/15 22:44:23
fc1/13
|FCP_SW_CNTR_CREDIT_LOSS
|19
|04/07/15 22:44:23
fc1/13
|F16_TMM_TOLB_TIMEOUT_DROP_CNT
|1486654
|04/07/15 22:44:03
fc1/13
|FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO
|108
|04/07/15 22:44:03
fc1/13
|FCP_SW_CNTR_CREDIT_LOSS
|9
|04/07/15 22:44:03
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 55 of 61
---------------------------------------------------| Interface |
|
Total |
Timestamp
| Events |
---------------------------------------------------| fc1/13
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 56 of 61
OBFL error-stats
Use show logging onboard error-stats to list when remaining Tx B2B credits are zero for 1 second on
F_port and 1.5 second on E_port.
MDS9700# show logging onboard error-stats
-------------------------------------------------------------------------------ERROR STATISTICS INFORMATION FOR DEVICE DEVICE: FCMAC
-------------------------------------------------------------------------------Interface
Range
|
Error Stat Counter Name
|
Count
Time Stamp
|MM/DD/YY HH:MM:SS
|
-------------------------------------------------------------------------------fc1/13
|F16_TMM_TOLB_TIMEOUT_DROP_CNT
|1496855
|04/07/15 22:44:23
fc1/13
|FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO
|217
|04/07/15 22:44:23
fc1/13
|FCP_SW_CNTR_CREDIT_LOSS
|19
|04/07/15 22:44:23
fc1/13
|F16_TMM_TOLB_TIMEOUT_DROP_CNT
|1486654
|04/07/15 22:44:03
fc1/13
|FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO
|108
|04/07/15 22:44:03
fc1/13
|FCP_SW_CNTR_CREDIT_LOSS
|9
|04/07/15 22:44:03
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 57 of 61
SNMP Object
Description
fcIfTxWaitCount
TxWait counter
(1.3.6.1.4.1.9.9.289.1.2.1.1.15)
fcHCIfBBCreditTransistionFromZero
(1.3.6.1.4.1.9.9.289.1.2.1.1.40)
fcIfBBCreditTransistionToZero
(1.3.6.1.4.1.9.9.289.1.2.1.1.41)
fcIfTxWtAvgBBCreditTransitionToZero
(1.3.6.1.4.1.9.9.289.1.2.1.1.38)
fcIfCreditLoss
(1.3.6.1.4.1.9.9.289.1.2.1.1.37)
fcIfTimeOutDiscards
Timeout discards
(1.3.6.1.4.1.9.9.289.1.2.1.1.35)
fcIfOutDiscard
Total number of frames discarded in egress direction, which includes timeout discards
(1.3.6.1.4.1.9.9.289.1.2.1.1.36)
fcIfLinkResetIns
Number of link reset protocol errors received by the FC port from the attached FC port
(1.3.6.1.4.1.9.9.289.1.2.1.1.9)
fcIfLinkResetOuts
Number of link reset protocol errors issued by the FC port to the attached FC port.
(1.3.6.1.4.1.9.9.289.1.2.1.1.10)
fcIfSlowportCount
(1.3.6.1.4.1.9.9.289.1.2.1.1.44)
fcIfSlowportOperDelay
(1.3.6.1.4.1.9.9.289.1.2.1.1.45)
Number of times for which Tx B2B credits were unavailable on a port for a duration
longer than the configured admin-delay value in slowport-monitor
For more information about each OID, such as the object name and MIB, use the Cisco SNMP Object Navigator.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 58 of 61
Feature Name
Details
Supported Platforms
TxWait (2.5 s)
Slowport-monitor (1 ms)
6.2(9)
Extended to MDS 9396S and MDS 9500 only with DSX9232-256K9, DS-X9248-256K9, DS-X9248-48K9, DSX9224-96K9 & DS-X9248-96K9 line cards
6.2(13)
Credit unavailability
at 100 ms
All
5.x.x or 6.x.x
LR Rcvd B2B
All
5.x.x or 6.x.x
All
5.x.x or 6.x.x
Credit transition to
zero
All
5.x.x or 6.x.x
Table 13.
Feature Name
Supported Platforms
Information about
dropped frames
6.x.x
Display frames in
ingress Q
5.x.x or 6.x.x
Display arbitration
timeout
All
5.x.x or 6.x.x
Timeout discards
MDS 9700, MDS 9396S, MDS 9148S, MDS 9250i & MDS
9500 only with DS-X9232-256K9 and DS-X9248-256K9
line cards
6.2(13)
All
All
6.x.x
show tech-support
slowdrain
All
6.2(13)
Table 14.
Details
Feature Name
Supported Platforms
congestion-drop timeout
Details
All
5.x.x or 6.x.x
no-credit-drop timeout
5.x.x or 6.x.x
6.2(9)
MDS 9396S
6.2(13)
All
5.x.x or 6.x.x
credit-loss-reco
All
5.x.x or 6.x.x
lr-rx
All
5.x.x or 6.x.x
lr-tx
All
5.x.x or 6.x.x
timeout-discards
All
5.x.x or 6.x.x
tx-credit-not-available All
5.x.x or 6.x.x
tx-discards
5.x.x or 6.x.x
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
All
Page 59 of 61
slowport-count
6.2(13)
6.2(13)
6.2(13)
Counters
Counter Name
Counter Description
FCP_CNTR_RCM_CH0_LACK_OF_CREDIT2
F16_RCM_RCP0_RBBZ_CH05
Total count of transitions to zero for Rx B2B credits on ch0; these transitions
typically indicate that the switch is applying back pressure to the attached device
because of perceived congestion, and this perceived congestion can be the result
of a lack of Tx B2B credits being returned on an interface over which this device is
communicating.
FCP_CNTR_LAF_TOTAL_TIMEOUT_FRAMES2
AK_FCP_CNTR_RCM_CH0_LACK_OF_CREDIT3
THB_RCM_RCP0_RBBZ_CH04
AK_FCP_CNTR_LAF_TOTAL_TIMEOUT_FRAMES3
THB_TMM_TOLB_TIMEOUT_DROP_CNT4
F16_TMM_TOLB_TIMEOUT_DROP_CNT5
FCP_CNTR_QMM_CH0_LACK_OF_TRANSMIT_CREDIT2
AK_FCP_CNTR_QMM_CH0_LACK_OF_TRANSMIT_CREDIT3
THB_TMM_PORT_TBBZ_CH04
Total count of transitions to zero for Tx B2B credits on ch0; these transitions are
typically the result of the attached devices withholding of R_RDY primitive from the
switch due to congestion in that device.
F16_RCM_RCP0_TBBZ_CH05
None2
None3
Number of frames dropped in tolb_path or np path; these drops include all types
of frame drops: timeout, offline, abort drops at egress, etc.
THB_TMM_PORT_FRM_DROP_CNT4
F16_TMM_PORT_FRM_DROP_CNT5
None2
None3
THB_TMM_PORT_TWAIT_CNT4
F16_TMM_PORT_TWAIT_CNT5
FCP_CNTR_LAF_C3_TIMEOUT_FRAMES_DISCARD2
AK_FCP_CNTR_LAF_C3_TIMEOUT_FRAMES_DISCARD3
THB_TMM_TO_CNT_CLASS_34
F16_TMM_TO_CNT_CLASS_35
FCP_CNTR_RX_WT_AVG_B2B_ZERO2
AK_FCP_CNTR_RX_WT_AVG_B2B_ZERO3
FCP_SW_CNTR_RX_WT_AVG_B2B_ZERO4
Count of the number of times an interface was at zero Rx B2B credits for 100 ms;
this status typically indicates that the switch is withholding R_RDY primitive to the
device attached on that interface due to congestion in the path to devices with
which it is communicating
FCP_SW_CNTR_RX_WT_AVG_B2B_ZERO5(unable to
generate)
FCP_CNTR_TX_WT_AVG_B2B_ZERO2
AK_FCP_CNTR_TX_WT_AVG_B2B_ZERO3
FCP_SW_CNTR_TX_WT_AVG_B2B_ZERO4,5
FCP_CNTR_FORCE_TIMEOUT_ON2
AK_FCP_CNTR_FORCE_TIMEOUT_ON3
FCP_SW_CNTR_FORCE_TIMEOUT_ON4,5
FCP_CNTR_FORCE_TIMEOUT_OFF2
AK_FCP_CNTR_FORCE_TIMEOUT_OFF3
FCP_SW_CNTR_FORCE_TIMEOUT_OFF4,5
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Count of the number of times that an interface was at zero Tx B2B credits for 100
ms; this status typically indicates congestion at the device attached on that
interface.
Count of the number of times the system timeout no-credit-drop threshold has been
reached by this port; when a port is at zero Tx B2B credits for the time specified,
the port starts to drop frames at line rate.
Count of the number of times that the port has recovered from the system timeout
no-credit-drop condition; this status typically means that R_RDY primitive has been
returned or possibly that an LR and LRR even has occurred.
Page 60 of 61
FCP_CNTR_LAF_CF_TIMEOUT_FRAMES_DISCARD2
AK_FCP_CNTR_LAF_CF_TIMEOUT_FRAMES_DISCARD3
THB_TMM_TO_CNT_CLASS_F4
F16_TMM_TO_CNT_CLASS_F5
FCP_CNTR_CREDIT_LOSS2
AK_FCP_CNTR_CREDIT_LOSS3
Count of the number of times that creditmon credit loss recovery has been invoked
on a port.
FCP_SW_CNTR_CREDIT_LOSS4,5
: Generation 1 modules are no longer supported by NX-OS 5.0, and later, and are not covered by this white paper.
: DS-X9112, DS-X9124, and DS-X9148 and DS-X9304-18K9 - modules are not covered by this document.
Printed in USA
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
C11-737315-00
07/16
Page 61 of 61