CCNA Data CenterStorage Networking Protocol
CCNA Data CenterStorage Networking Protocol
Cisco Press
800 East 96th Street
Indianapolis, Indiana 46240 USA
ii
May 2006
Trademark Acknowledgments
All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Cisco Press or Cisco Systems, Inc. cannot attest to the accuracy of this information. Use of a term in this book
should not be regarded as affecting the validity of any trademark or service mark.
Feedback Information
At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is crafted
with care and precision, undergoing rigorous development that involves the unique expertise of members from the
professional technical community.
Readers feedback is a natural continuation of this process. If you have any comments regarding how we could
improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through email at
[email protected]. Please make sure to include the book title and ISBN in your message.
We greatly appreciate your assistance.
iii
Publisher
Cisco Representative
Cisco Press Program Manager
Executive Editor
Production Manager
Development Editor
Project Editor
Copy Editor
Technical Editors
Book and Cover Designer
Composition
Indexer
Paul Boger
Anthony Wolfenden
Jeff Brady
Mary Beth Ray
Patrick Kanouse
Andrew Cupp
Interactive Composition Corporation
Interactive Composition Corporation
Philip Lowden, Thomas Nosella, Rob Peglar
Louisa Adair
Interactive Composition Corporation
Tim Wright
iv
Dedication
This book is posthumously dedicated to Don Jones. Don was a good man, a good friend, and a good mentor.
vi
Acknowledgments
The quality of this book is directly attributable to the many people that assisted during the writing process. In particular, I would like to thank Mike Blair for his contribution to the SBCCS/ESCON/FICON section, Tom Burgee for
his contribution to the optical section, Joel Christner for his contribution to the le-level protocols section, and Alan
Conley for his contribution to the management protocols chapter. Additionally, I would like to thank Tuqiang Cao,
Mike Frase, and Mark Bakke for their support. A special thank you goes to Tom Nosella, Phil Lowden, and Robert
Peglar for serving as technical reviewers. Finally, I am very grateful to Henry White for hiring me at Cisco. Without
Henrys condence in my potential, this book would not have been possible.
vii
viii
Contents at a Glance
Foreword
xxi
Introduction
xxii
Part I
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Part II
OSI Layers
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Part III
349
Chapter 9
351
Chapter 10
Chapter 11
Load Balancing
Chapter 12
Chapter 13
Chapter 14
Part IV
3
5
55
93
107
109
193
217
245
367
377
385
395
405
417
Appendix A
419
Appendix B
441
Appendix C
Glossary
Index
39
463
479
499
ix
Table of Contents
Foreword
xxi
Introduction
Part I
xxii
25
32
Summary
36
Review Questions 37
39
47
48
45
50
52
Review Questions 53
Chapter 3 Overview of Network Operating Principles
Conceptual Underpinnings 55
Throughput 55
Topologies 56
Service and Device Discovery
55
61
66
Ethernet 66
Low Overhead Paradigm 67
Ethernet Throughput 70
Ethernet Topologies 72
Ethernet Service and Device Discovery
TCP/IP Suite 73
Value of Ubiquitous Connectivity
TCP/IP Throughput 75
73
74
xi
TCP/IP Topologies 77
TCP/IP Service and Device Discovery
Discovery Contexts 78
78
Fibre Channel 84
Merger of Channels and Packet Switching
FC Throughput 85
FC Topologies 87
FC Service and Device Discovery 89
Summary
84
90
Review Questions 90
Chapter 4 Overview of Modern SCSI Networking Protocols
93
iSCSI 93
iSCSI Functional Overview 93
iSCSI Procedural Overview 96
FCP 97
FCP Functional Overview 97
FCP Procedural Overview 99
FCIP 100
FCIP Functional Overview 100
FCIP Procedural Overview 101
iFCP 102
iFCP Functional Overview 102
iFCP Procedural Overview 104
Summary
105
OSI Layers
107
109
xii
121
xiii
Summary
189
193
214
214
217
242
xiv
245
346
xv
Part III
349
351
351
364
367
374
xvi
377
377
383
385
385
389
389
TCP Security
390
iSCSI Security
391
392
FCIP Security
Summary
391
392
393
395
Conceptual Underpinnings of
Storage Management Protocols 395
TCP/IP Management
FC Management
397
400
xvii
SCSI Management
Summary
402
403
405
405
405
415
Appendixes
417
Appendix A
419
Appendix B
441
Appendix C
Glossary
Index
463
479
499
xviii
IP
FC
IP
FC
iSCSI
HBA
FC Storage
Array
FC Tape
Library
TOE
FC-Attached
Host
iSCSI Storage
Array
iSCSI Tape
Library
iSCSI-Attached
Host
SPI
iFCP
FICON
iFCP
Gateway
Cisco MDS
9500, FICON
FC
iSCSI
Cisco MDS
9500, FC
FCIP
Cisco MDS
9500, iSCSI
Cisco MDS
9500, FCIP
FC-BB
Bridge
FC-HBA
Backplane
Ethernet Hub,
Internal
Token Ring
MAU, Internal
SPI Tape
Library
FC-HBA
FC-attached
ATA Storage
Array
FC-attached
SCSI Storage
Array
FC-HBA
NAS+
iSCSI+FC
NAS
iSNS
Network Attached
Storage
Network Attached
Storage with iSCSI
and FC
Mainframe
Disk Control
Unit
Terminator
iSNS Server
Terminator
Magnetic Disk
Rx
Tx
Tx
Rx
NIC
NIC
xix
Host
Bus #0
SAM device
containing one
SAM port
FC-HBA
Device Name
20:00:00:00:c9:2e:70:2f
Port Name
10:00:00:00:c9:2e:70:2f
Port ID
ad 01 1f
FC Adapter
Device Name
50:06:01:60:90:20:1e:7a
Port Name
50:06:01:68:10:20:1e:7a
Port ID
ad b4 c8
LUN 0
Cache
Fan
Port ID 7
Bus #1
Vendor-Specific Backplane
Device Name
50:06:01:60:90:20:1e:7a
Port Name
50:06:01:61:10:20:1e:7a
Port ID
ad 21 7b
Port ID 2
Port ID 1
LUN 0
LUN 0
Magnetic
Disk Drive
Magnetic
Disk Drive
Port ID 7
Bus #2
Port ID 2
Port ID 1
LUN 0
FC Adapter
Fibre Channel
Switch
Power
Supply
CPU
LUN 0
Tape Drive
Tape Drive
SAM Port
SAM Second Level LUN
Storage Array
xx
Medium Transport
Element
LUN
Robotic Arm
LUN
Tape Drive 1
Port
Import/Export
Element
Data Transfer
Elements
Backplane
LUN
Tape Drive 2
LUN
Tape Drive 3
Storage
Elements
Port
xxi
Foreword
It is a great pleasure to write the foreword for this book. Storage networking technologies have been used for
computer data storage in almost every major corporation worldwide since the late 90s. Banks, hospitals, credit
card companies, airlines, and universities are just a few examples of the organizations that use these technologies.
Storage networking gave birth to the concept of the Storage Area Network (SAN), a concept based fundamentally
on separating storage and processing resources, with the ability to provision and manage storage as a separate
service to the processing resources.
A SAN consists of not only the storage subsystems, but also of the interconnection infrastructure, the data protection subsystems, migration and virtualization technologies, and more. A SAN can have very different characteristics; for example, it can be all contained in a single rack or it might span an entire continent in order to create
the most dependable disaster-tolerant conguration. In all cases, a SAN is meant to be more exible than direct
attached storage in accommodating the growth and the changes in an organization.
Several products, based on diverse technologies, are available today as building blocks to design Storage Area
Networks. These technologies are quite different and each has its own distinctive properties, advantages, and
disadvantages. It is important for an information technology professional to be able to evaluate them and pick the
most appropriate for the various customer or organization needs. This is not an easy task, because each technology
has a different history, sees the problems from its own point of view, and uses a unique terminology. The associated
complex standards documentation, designed for products developers, usually produces more confusion than
illumination for the casual reader.
In this book, James takes on the challenge of comparing todays most deployed storage networking architectures.
To perform his analysis, he uses a powerful tool, the OSI reference model. By comparing each architecture with the
OSI model, James conducts the reader through the nuances of each technology, layer by layer. An appropriate set of
parameters is introduced for each layer, and used across the presented technologies to analyze and compare them.
In this way, readers familiar with networking have a way to understand the world of storage controllers, while
people familiar with controllers will nd a different perspective on what they already know.
The rst part of the book introduces the world of storage and storage networking. The basics of storage technology,
including block storage protocols and le access protocols, are presented, followed by a historical evolution of how
they evolved to their current status. The seven OSI layers are also introduced, giving to the reader the tools for the
subsequent analysis.
The second part of the book is a deep comparative analysis of todays technologies for networked storage, including
iSCSI and Fibre Channel. Each protocol suite is analyzed at the physical and data-link layers; at the network layer;
at the transport layer; and nally at the session, presentation, and application layers.
The third and nal part of the book relates to advanced functionalities of these technologies, such as quality of
service, load-balancing functions, security, and management. In particular, security is an element of continuously
increasing importance for storage networking. Because more and more digital data are vital to businesses, keeping
these data secure from unauthorized access is crucial. At the same time, this growing mass of data needs to be
properly managed, but managing a heterogeneous set of devices is not an easy task. Several underlying protocols
for storage management have been dened or are being dened.
Storage networking is a critical concept for todays businesses, and this book provides a unique and helpful way to
better understand it. Storage networking is also continuously evolving, and as such this book may be seen as an
introduction to the information technology infrastructures of the future.
Claudio DeSanti
Technical Leader Data Center BU, Cisco Systems
Vice-Chairman of the ANSI INCITS T11 Technical Committee
xxii
Introduction
The modern business environment is characterized by pervasive use of computer and communication
technologies. Corporations increasingly depend on such technologies to remain competitive in the
global economy. Customer relationship management, enterprise resource planning, and electronic
mail are just a few of the many applications that generate new data every day. All that data must
be stored, managed, and accessed effectively if a business is to survive. This is one of the primary
business challenges in the information age, and storage networking is a crucial component of the
solution.
Objectives
This book has four objectives: document details, explain concepts, dispel misconceptions, and compare protocols. The details of the major protocol suites are documented for reference. To that end, this
book aims primarily to disclose rather than assess. In that respect, we give extra effort to objectivity.
Additionally, I attempt to explain how each of the major protocol suites operates, and to identify
common understandings. Discussions of how the protocols work are included, but you are encouraged to reference the original standards and specications for a complete understanding of each protocol. This recommendation also ensures you have the latest information. Since many of the standards
and specications referenced in this book are draft versions, they are subject to change. Thus, it is
reasonable to expect some of the content in this book will become inaccurate as in-progress specications are nalized. Finally, comparisons are drawn between the major protocol suites to help you
understand the implications of your network design choices. To achieve these objectives, a large
amount of reference data must be included. However, this book is written so that you can read it from
cover to cover. We have tried to integrate reference data so it is easily and quickly accessible. For this
reason, I use tables and bulleted lists extensively.
In support of the stated objectives, we have made every effort to improve clarity. Colloquialisms are
avoided throughout the book. Moreover, special attention is paid to the use of the words may, might,
must, and should. The word may implies permissibility. The word might implies possibility. The word
must imposes a requirement. The word should implies a desirable behavior but does not impose a
requirement.
Intended Audiences
This book has two primary audiences. The rst audience includes storage administrators who need to
learn more about networking. We have included much networking history, and have explained many
networking concepts to help acclimate storage administrators to the world of networking. The second
audience includes network administrators who need to learn more about storage. This book examines
networking technologies in the context of SCSI so that network administrators can fully understand the
network requirements imposed by open systems storage applications. Many storage concepts, terms,
and technologies exist that network administrators need to know and understand. Although this book
provides some storage knowledge for network administrators, other resources should be consulted for a
full understanding of storage. One such resource is Storage Networking Fundamentals: An Introduction
to Storage Devices, Subsystems, Applications, Management, and File Systems (by Marc Farley,
ISBN: 1-58705-162-1, Cisco Press).
xxiii
Organization
This book discusses and compares the networking protocols that underlie modern open systems,
block-oriented storage networks. To facilitate a methodical analysis, the book is divided into three parts.
The rst part introduces readers to the eld of storage networking and the Open Systems Interconnection (OSI) reference model. The second part examines in detail each of the major protocol suites layerby-layer beginning with the lowest layer of the OSI reference model. The third part introduces readers
to several advanced networking topics. As the book progresses, each chapter builds upon the previous
chapters. Thus, you will benet most by reading this book from front to back. However, all chapters can
be leveraged in any order for reference material. Some of the content in this book is based upon emerging standards and in-progress specications. Thus, you are encouraged to consult the latest version of
in-progress specications for recent updates and changes.
PART
Chapter 2
Chapter 3
Chapter 4
Recount some of the key historical events and current business drivers in the storage
market
Recognize the key characteristics and major benets of each type of storage network
Describe how the technologies discussed in this chapter relate to one another
Locate additional information on the major topics of this chapter
CHAPTER
Some of these features were supported by other storage media, but the hard disk was the
rst medium to support all of these features. Consequently, the hard disk helped make many
Storage networks provide the potential to realize numerous competitive advantages such as
the following:
Improved exibilityAdds, moves, and changes are a fact of life in the operation
of any computing or communication infrastructure. Storage networks allow storage
administrators to make adds, moves, and changes more easily and with less downtime
than otherwise possible in the DAS model.
Data mobilityStorage networks increase the mobility of stored data for various
purposes, such as migration to new hardware and replication to a secondary data
center.
Another driver for change is the evolution of relevant standards. An understanding of the
standards bodies that represent all aspects of storage functionality enables better understanding of historical storage designs and limitations, and of future directions. In addition
to the aforementioned SNIA, other standards bodies are important to storage networking.
The American National Standards Institute (ANSI) oversees the activities of the InterNational
Committee for Information Technology Standards (INCITS). Two subcommittees of the
INCITS are responsible for standards that are central to understanding storage networks.
The INCITS T10 subcommittee owns SCSI standards, and the INCITS T11 subcommittee
owns Fibre Channel standards. The Internet Engineering Task Force (IETF) has recently
emerged as another driving force in storage. The IETF created the IP storage working
Group (IPS-WG), which owns all IP-based storage standards. Standards bodies are
discussed in more detail in Chapter 2, OSI Reference Model Versus Other Network
Models.
NOTE
The ANSI X3 committee was known ofcially as the Accredited Standards Committee X3
from 1961 to 1996. The committee was renamed the INCITS in 1996.
FC-SAN
TCP/IP LAN
FC
HBA
Fibre
Channel
Switch
FC
Ethernet
Switch
HBA
FC
FC
HBA
Common use of the phrase storage area network and the acronym SAN historically has
referred to the Fibre Channel SAN model. However, these terms are somewhat ambiguous
in light of recent developments such as the ratication of the Internet SCSI (iSCSI)
protocol. The iSCSI protocol enables the use of the Transmission Control Protocol (TCP)
on the Internet Protocol (IP) on Ethernet in place of Fibre Channel to transport SCSI trafc.
An IP network dedicated to the transport of iSCSI trafc is commonly referred to as an
IP-SAN. Note that any IP network can transport SCSI trafc; however, a multipurpose IP
network that carries SCSI trafc is not called an IP-SAN. Likewise, the acronym FC-SAN
is becoming common for the Fibre Channel SAN model. The unqualied term SAN is
10
increasingly used to generically refer to both IP-SANs and FC-SANs. Figure 1-2 illustrates
a simple IP-SAN.
Figure 1-2
Simple IP-SAN
IP-SAN
TCP/IP LAN
ISCSI
TOE
Ethernet
Switch
ISCSI
Ethernet
Switch
TOE
IP
ISCSI
TOE
NOTE
There have long been discussions of so-called Internet attached storage (IAS) devices
characterized by the use of HTTP to retrieve les from and store les to the storage devices,
which are directly attached to an IP network. No standards or other specications have been
produced to dene how the operation of these devices would differ from the operation of a
normal web server. Because most NAS lers support the use of HTTP for le-level access,
IAS can be considered another name for NAS.
Figure 1-3
11
Filer Network
Filer Network
TCP/IP LAN
NAS
Ethernet
Switch
Ethernet
Switch
NAS
One loose denition for a storage network is any network that transports any le-level or
block-level storage protocol and associated data. Although valid, this denition is of little
practical use because it essentially includes all the preceding denitions and every traditional
computer network on the planet. For example, virtually every Windows-based network
transports CIFS, which is the le-level storage protocol native to Microsoft operating
systems. Likewise, virtually every UNIX-based network transports NFS, which is the
le-level storage protocol native to UNIX operating systems. Even the Internet is included
in this denition because the Internet transports FTP, HTTP, and various other protocols
designed to transfer les. These points notwithstanding, it might be useful to have a name
for traditional data networks that also transport block-level storage protocols. This is because
special design considerations for timeslot and wavelength allocations, topology modications,
and quality of service (QoS) mechanisms might be required in addition to hardware and
software upgrades for IP and optical networking devices and network management tools,
to successfully integrate block-level storage protocols with traditional data and real-time
data such as voice over IP (VoIP) and video conferencing. Such changes might be
signicant enough to warrant a distinction between traditional data networks and these
new enhanced networks. A seemingly appropriate, albeit generic, name is storage-enabled
network; however, the industry hasnt adopted this or any other name yet.
12
Block-oriented protocols (also known as block-level protocols) read and write individual
xed-length blocks of data. For example, when a client computer uses a le-level storage
protocol to write a le to a disk contained in a server, the server rst receives the le via
the le-level protocol and then invokes the block-level protocol to segment the le into
blocks and write the blocks to disk. File-level protocols are discussed in more detail in a
subsequent section of this chapter. The three principal block-level storage protocols in use
today are advanced technology attachment (ATA), small computer system interface (SCSI),
and single-byte command code set (SBCCS).
ATA
ATA is an open-systems standard originally started by the Common Access Method (CAM)
committee and later standardized by the ANSI X3 committee in 1994. Several subsequent
ATA standards have been published by ANSI. The Small Form Factor (SFF) Committee has
published several enhancements, which have been included in subsequent ANSI standards.
Each ANSI ATA standard species a block-level protocol, a parallel electrical interface, and
a parallel physical interface. ATA operates as a bus topology and allows up to two devices
per bus. Many computers contain two or more ATA buses. The rst ANSI ATA standard is
sometimes referred to as ATA-1 or Integrated Drive Electronics (IDE). Many updates to
ATA-1 have focused on electrical and physical interface enhancements. The ANSI ATA
standards include ATA-1, ATA-2 (also known as enhanced IDE [EIDE]), ATA-3, ATA/
ATAPI-4, ATA/ATAPI-5, and ATA/ATAPI-6. The current work in progress is ATA/ATAPI-7.
Early ATA standards only supported hard disk commands, but the ATA Packet Interface
(ATAPI) introduced SCSI-like commands that allow CD-ROM and tape drives to operate
on an ATA electrical interface. Sometimes we refer to these standards collectively as
parallel ATA (PATA).
The serial ATA (SATA) Working Group, an industry consortium, published the SATA 1.0
specication in 2001. ANSI is incorporating SATA 1.0 into ATA/ATAPI-7. The SATA
Working Group continues other efforts by including minor enhancements to SATA 1.0 that
might not be included in ATA/ATAPI-7, the development of SATA II, and, most notably, a
collaborative effort with the ANSI T10 subcommittee to align SATA II with the Serial
Attached SCSI (SAS) specication. The future of serial ATA standards is unclear in light
of so many efforts, but it is clear that serial ATA technologies will proliferate.
ATA devices have integrated controller functionality. Computers that contain ATA devices
communicate with the devices via an unintelligent electrical interface (sometimes mistakenly
called a controller) that essentially converts electrical signals between the system bus and
the ATA bus. The ATA/ATAPI command set is implemented in software. This means that the
host central processing unit (CPU) shoulders the processing burden associated with storage
I/O. The hard disks in most desktop and laptop computers implement ATA. ATA does not
support as many device types or as many advanced communication features as SCSI.
13
The ATA protocol typically is not used in storage networks, but ATA disks often are used
in storage subsystems that connect to storage networks. These storage devices act as storage
protocol converters by speaking SCSI on their SAN interfaces and ATA on their internal
ATA bus interfaces. The primary benet of using ATA disks in storage subsystems is cost
savings. ATA disk drives historically have cost less than SCSI disk drives for several
reasons. SCSI disk drives typically have a higher mean time between failures (MTBF)
rating, which means that they are more reliable. Also, SCSI disk drives historically have
provided higher performance and higher capacity. ATA disk drives have gained signicant
ground in these areas, but still tend to lag behind SCSI disk drives. Of course, these features
drive the cost of SCSI disk drives higher. Because the value of data varies from one
application to the next, it makes good business sense to store less valuable data on less
costly storage devices. Thus, ATA disks increasingly are deployed in SAN environments to
provide primary storage to comparatively low-value applications. ATA disks also are being
used as rst-tier media in new backup/restore solutions, whereby tapes are used as secondtier media for long-term archival or off-site storage. The enhanced backup solutions
Initiative (EBSI) is an industry effort to develop advanced backup techniques that leverage
the relatively low cost of ATA disks. Figure 1-4 illustrates an ATA-based storage subsystem
connected to an FC-SAN.
Figure 1-4
SCSI
FC-HBA
FC Attached
ATA Subsystem
FC-HBA
Fibre Channel
Switch
FC Attached
Server
SCSI
SCSI is an open-systems standard originally started by Shugart Associates as the Shugart
Associates Systems Interface (SASI) in 1981 and later standardized by the ANSI X3
committee in 1986. Each early SCSI standard specied a block-level protocol, a parallel
electrical interface, and a parallel physical interface. These standards are known as SCSI-1
and SCSI-2. Each operates as a bus topology capable of connecting 8 and 16 devices,
respectively.
The SCSI-3 family of standards separated the physical interface, electrical interface,
and protocol into separate specications. The protocol commands are separated into two
categories: primary and device-specic. Primary commands are common to all types of
devices, whereas device-specic commands enable operations unique to each type of
device. The SCSI-3 protocol supports a wide variety of device types and transmission
technologies. The supported transmission technologies include updated versions of the
14
SCSI-2 parallel electrical and physical interfaces in addition to many serial interfaces.
Even though most of the mapping specications for transport of SCSI-3 over a given
transmission technology are included in the SCSI-3 family of standards, some are not. An
example is the iSCSI protocol, which is specied by the IETF.
Most server and workstation computers that employ the DAS model contain either SCSI
devices attached via an internal SCSI bus, or they access SCSI devices contained in
specialized external enclosures. In the case of external DAS, SCSI bus and Fibre Channel
point-to-point connections are common. Computers that access SCSI devices typically
implement the SCSI protocol in specialized hardware generically referred to as a storage
controller. When the SCSI protocol is transported over a traditional parallel SCSI bus, a
storage controller is called a SCSI adapter or SCSI controller. If a SCSI adapter has an
onboard CPU and memory, it can control the system bus temporarily, and is called a SCSI
host bus adapter (HBA). When the SCSI protocol is transported over a Fibre Channel
connection, the storage controller always has a CPU and memory, and is called a Fibre
Channel HBA. When a SCSI HBA or Fibre Channel HBA is used, most storage I/O
processing is ofoaded from the host CPU. When the SCSI protocol is transported over
TCP/IP, the storage controller may be implemented via software drivers using a standard
network interface card (NIC) or a specialized NIC called a TCP ofoad engine (TOE),
which has a CPU and memory. As its name implies, a TOE ofoads TCP processing from
the host CPU. Some TOEs also implement iSCSI logic to ofoad storage I/O processing
from the host CPU.
NOTE
All SCSI devices are intelligent, but SCSI operates as a master/slave model. One SCSI
device (the initiator) initiates communication with another SCSI device (the target) by
issuing a command, to which a response is expected. Thus, the SCSI protocol is half-duplex
by design and is considered a command/response protocol. The initiating device is usually
a SCSI controller, so SCSI controllers typically are called initiators. SCSI storage devices
typically are called targets. That said, a SCSI controller in a modern storage array acts
as a target externally and as an initiator internally. Also note that array-based replication
software requires a storage controller in the initiating storage array to act as initiator both
externally and internally. So it is important to consider the context when discussing SCSI
controllers.
The SCSI parallel bus topology is a shared medium implementation, so only one initiator/
target session can use the bus at any one time. Separate sessions must alternate accessing
the bus. This limitation is removed by newer serial transmission facilities that employ
15
SCSI
FC-HBA
FC Attached
SCSI Subsystem
FC-HBA
Fibre Channel
Switch
FC Attached
Server
SCSI parallel bus interfaces have one important characteristic, their ability to operate
asynchronously or synchronously. Asynchronous mode requires an acknowledgment for
each outstanding command before another command can be sent. Synchronous mode
allows multiple commands to be issued before receiving an acknowledgment for the rst
command issued. The maximum number of outstanding commands is negotiated between
the initiator and the target. Synchronous mode allows much higher throughput. Despite the
similarity of this mechanism to the windowing mechanism of TCP, this mechanism is
implemented by the SCSI electrical interface (not the SCSI protocol).
Another important point is the contrasting meaning of the word synchronous in the context of the SCSI parallel bus versus its meaning in the context of long-distance storage
replication. In the latter context, synchronous refers not to the mode of communication, but
to the states of the primary and secondary disk images. The states are guaranteed to be
synchronized when the replication software is operating in synchronous mode. When a host
(acting as SCSI initiator) sends a write command to the primary storage device, the primary
storage device (acting as SCSI target) caches the data. The primary storage device (acting
as SCSI initiator) then forwards the data to the secondary storage device at another site. The
secondary storage device (acting as SCSI target) writes the data to disk and
then sends acknowledgment to the primary storage device, indicating that the command
completed successfully. Only after receiving acknowledgment does the primary storage
device (acting as SCSI target) write the data to disk and send acknowledgment of successful
completion to the initiating host. Because packets can be lost or damaged in transit over
long distances, the best way to ensure that both disk images are synchronized is to expect
an acknowledgment for each request before sending another request. Using this method, the
two disk images can never be more than one request out of sync at any point in time.
16
TIP
The terms synchronous and asynchronous should always be interpreted in context. The
meanings of these terms often reverse from one context to another.
SBCCS
SBCCS is a generic term describing the mechanism by which IBM mainframe computers
perform I/O using single-byte commands. IBM mainframes conduct I/O via a channel
architecture. A channel architecture comprises many hardware and software components
including channel adapters, adapter cables, interface assemblies, device drivers, I/O
programming interfaces, I/O units, channel protocols, and so on. IBM channels come in
two avors: byte multiplexer and block multiplexer. Channels used for storage employ
block multiplexer communication. I/O units used for storage are called disk control units
(CU). Mainframes communicate with CUs via the channel protocol, and CUs translate
channel protocol commands into storage I/O commands that the storage device (for
example, disk or tape drive) can understand. This contrasts with ATA and SCSI operations,
wherein the host initiates the storage I/O commands understood by the storage devices.
Figure 1-6 illustrates this contrast.
Figure 1-6
Channel Protocol
FC-HBA
Disk
Control Unit
FICON
Director
Mainframe
DCI requires a response for each outstanding command before another command can
be sent. This is conceptually analogous to asynchronous mode on a SCSI parallel
bus interface. DS can issue multiple commands while waiting for responses. This is
conceptually analogous to synchronous mode on a SCSI parallel bus interface. A block
channel protocol consists of command, control, and status frames. Commands are known
as channel command words (CCW), and each CCW is a single byte. Supported CCWs vary
depending on which CU hardware model is used. Some congurations allow an AIX host
to appear as a CU to the mainframe. Control frames are exchanged between the channel
adapter and the CU during the execution of each command. Upon completion of data transfer,
the CU sends ending status to the channel adapter in the mainframe. Command and status
17
frames are acknowledged. Applications must inform the operating system of the I/O request
details via operation request blocks (ORB). ORBs contain the memory address where
the CCWs to be executed have been stored. The operating system then initiates each I/O
operation by invoking the channel subsystem (CSS). The CSS determines which channel
adapter to use for communication to the designated CU and then transmits CCWs to the
designated CU.
SBCCS is a published protocol that may be implemented without paying royalties to IBM.
However, IBM might not support SBCCS if implemented without the control unit port
(CUP) feature, and CUP is not available for royalty-free implementation. Furthermore,
only IBM and IBM-compatible mainframes, peripherals, and channel extension devices
implement SBCCS. So, the open nature of SBCCS is cloudy, and the term pseudoproprietary seems appropriate. There are two versions of SBCCS; one for enterprise
systems connection (ESCON) and one for Fibre Channel connection (FICON).
Mainframes have always played a signicant role in the computing industry. In response to
advances made in open computing systems and to customer demands, mainframes have
evolved by incorporating new hardware, software, and networking technologies. Because
this book does not cover mainframes in detail, the following section provides a brief
overview of mainframe storage networking for the sake of completeness.
ESCON
ESCON is a proprietary IBM technology introduced in 1990 to overcome the limitations of
the Bus-and-Tag parallel channel architecture. ESCON converters were made available to
preserve Bus-and-Tag investments by bridging between the two architectures. Today, very
little Bus-and-Tag remains in the market. Whereas Bus-and-Tag employs copper cabling
and parallel transmission, the ESCON architecture employs optical cabling and serial
transmission. The ESCON architecture is roughly equivalent to Layers 1 and 2 of the Open
Systems Interconnection (OSI) reference model published by the International Organization
for Standardization. We discuss the OSI reference model in detail in Chapter 2, OSI
Reference Model Versus Other Network Models.
In addition to transporting channel protocol (SBCCS) frames, ESCON denes a new frame
type at the link level for controlling and maintaining the transmission facilities. ESCON
operates as half-duplex communication in a point-to-point topology and supports the
18
FICON
FICON was introduced in 1998 to overcome the limitations of the ESCON architecture. FICON
is the term given to the pseudo-proprietary IBM SBCCS operating on a standard Fibre Channel
infrastructure. The version of SBCCS used in FICON is less chatty than the version used in
ESCON. The ANSI FC-SB specication series maps the newer version of SBCCS to Fibre
Channel. All aspects of FICON infrastructure are based on ANSI FC standards. FICON
operates in two modes: bridged (also known as FCV mode) and native (also known as FC
mode). Some FICON hardware supports SCSI (instead of SBCCS) on Fibre Channel. This
is sometimes called FICON Fibre Channel Protocol (FCP) mode, but there is nothing about
it that warrants use of the term FICON. FICON FCP mode is just mainstream open-systems
storage networking applied to mainframes.
19
In bridged mode, hosts use FICON channel adapters to connect to ESCON directors. A FICON
bridge adapter is installed in an ESCON director to facilitate communication. A FICON bridge
adapter time-division multiplexes up to eight ESCON signals onto a single FICON signal.
Investments in late-model ESCON directors and CUs are preserved allowing a phased migration path to FICON over time. Early ESCON directors do not support the FICON bridge adapter.
The ESCON SBCCS is used for storage I/O operations in bridged mode.
In native mode, FICON uses a modied version of the SBCCS and replaces the ESCON
transmission facilities with the Fibre Channel transmission facilities. FICON native mode
resembles ESCON in that it:
Unlike ESCON, FICON native mode operates in full-duplex mode. Unlike ESCON
directors, Fibre Channel switches employ packet switching to create connections between
hosts and CUs. FICON native mode takes advantage of the packet-switching nature of Fibre
Channel to allow up to 32 simultaneously active logical connections per physical channel
adapter. Like ESCON, FICON is limited to 256 ports per director, but virtual fabrics can
extend this scale limitation. FICON native mode retains the ESCON limit of two directors
cascaded between a host channel adapter and a CU, though some mainframes support only
a single intermediate director.
Fibre Channel transmissions are encoded via 8b/10b signaling. Operating at 1.0625
gigabits per second (Gbps), the maximum supported distance per Fibre Channel link is
500 meters using 50 micron MMF or 10 km using SMF, and the maximum link level
throughput is 106.25 MBps. Operating at 2.125 Gbps, the maximum supported distance per
Fibre Channel link is 300 meters using 50 micron MMF or 10 km using SMF, and the
maximum link-level throughput is 212.5 MBps. By optionally using IBMs modeconditioning patch (MCP) cable, the maximum distance using 50 micron MMF is extended
to 550 meters operating at 1.0625 Gbps. The MCP cable transparently mates the transmit
SMF strand to a MMF strand. Both ends of the link must use an MCP cable. This provides
a migration path from multi-mode adapters to single-mode adapters prior to cable plant
conversion from MMF to SMF. The MCP cable is not currently supported operating at
2.125 Gbps. By using a SMF cable plant end-to-end and two FICON directors, the
maximum distance from host channel adapter to CU can be extended up to 30 km. Linklevel buffering in FICON equipment (not found in ESCON equipment) enables maximum
20
CIFS
In Microsoft Windows environments, clients historically requested les from servers via
the Server Message Block (SMB) protocol. In 1984, IBM published the basis for the SMB
protocol. Based on IBMs work, Microsoft and Intel subsequently published the OpenNET
File Sharing Protocol. As the protocol evolved, Intel withdrew from the effort, and the
protocol was renamed the SMB File Sharing Protocol; however, SMB provides more than
just le-sharing services.
21
SMB relies upon the services of the Network Basic Input Output System (NetBIOS) rather
than Windows Sockets (Winsock) services. NetBIOS on IP networks was standardized by the
IETF via request for comment (RFC) 1001 and RFC 1002 in 1987. Those RFCs enabled
the use of SMB over IP networks and greatly expanded the market for SMB as IP networks
began to rapidly proliferate in the 1990s. SMB remained proprietary until 1992, when the
X/Open committee (now known as the Open Group) standardized SMB via the common
application environment (CAE) specication (document 209) enabling interoperability
with UNIX computers. Even though SMB is supported on various UNIX and Linux operating
systems via the open-source software package known as Samba, it is used predominantly
by Windows clients to gain access to data on UNIX and Linux servers. Even with the
X/Open standardization effort, SMB can be considered proprietary because Microsoft has
continued developing the protocol independent of the Open Groups efforts. SMB eventually
evolved enough for Microsoft to rename it again. SMBs new name is the CIFS le sharing
protocol (commonly called CIFS).
Microsoft originally published the CIFS specication in 1996. With the release of Windows
2000, CIFS replaced SMB in Microsoft operating systems. CIFS typically operates on
NetBIOS, but CIFS also can operate directly on TCP. CIFS is proprietary to the extent that
Microsoft retains all rights to the CIFS specication. CIFS is open to the extent that Microsoft
has published the specication, and permits other for-prot companies to implement CIFS
without paying royalties to Microsoft. This royalty-free licensing agreement allows NAS
vendors to implement CIFS economically in their own products. CIFS integration enables
NAS devices to serve Windows clients without requiring new client software. Unfortunately,
Microsoft has not extended its royalty-free CIFS license to the open-source community.
To the contrary, open-source implementations are strictly prohibited. This somewhat negates
the status of CIFS as an open protocol.
Note that open should not be confused with standard. Heterogeneous NAS implementations
of CIFS, combined with Microsofts claims that CIFS is a standard protocol, have led to
confusion about the true status of the CIFS specication. Microsoft submitted CIFS to the
IETF in 1997 and again in 1998, but the CIFS specication was never published as an RFC
(not even as informational). The SNIA formed a CIFS documentation working group in
2001 to ensure interoperability among heterogeneous vendor implementations. However,
the working groups charter does not include standardization efforts. The working group
published a CIFS technical reference, which documents existing CIFS implementations.
The SNIA CIFS technical reference serves an equivalent function to an IETF informational
RFC and is not a standard. Even though CIFS is not a de jure standard, it clearly is a de
facto standard by virtue of its ubiquity in the marketplace.
A CIFS server makes a local le system, or some portion of a local le system, available to
clients by sharing it. A client accesses a remote le system by mapping a drive letter to the
share or by browsing to the share. When browsing with the Windows Explorer application, a
uniform naming convention (UNC) address species the location of the share. The UNC
address includes the name of the server and the name of the share. CIFS supports le and
folder change notication, le and record locking, read-ahead and write-behind caching, and
many other functions.
22
NOTE
NFS
Sun Microsystems created NFS in 1984 to allow UNIX operating systems to share les.
Sun immediately made NFS available to the computer industry at large via a royalty-free
license. In 1986, Sun introduced PC-NFS to extend NFS functionality to PC operating
systems. The IETF rst published the NFS v2 specication in 1989 as an informational
RFC (1094). NFS v3 was published via an informational RFC (1813) in 1995. Both NFS
v2 and NFS v3 were widely regarded as standards even though NFS was not published via
a standards track RFC until 2000, when NFS v4 was introduced in RFC 3010. RFC 3530
is the latest specication of NFS v4, which appears to be gaining momentum in the
marketplace. NFS v4 improves upon earlier NFS versions in several different areas
including security, caching, locking, and message communication efciency. Even though
NFS is available for PC operating systems, it always has been and continues to be most
widely used by UNIX and Linux operating systems.
An NFS server makes a local le system, or some portion of a local le system, available
to clients by exporting it. A client accesses a remote le system by mounting it into the local
le system at a client-specied mount point. All versions of NFS employ the remoteprocedure call (RPC) protocol and a data abstraction mechanism known as external data
representation (XDR). Both the RPC interface and the XDR originally were developed by
Sun and later published as standards-track RFCs.
DAFS
DAFS partially derives from NFS v4. The DAFS protocol was created by a computer
industry consortium known as the DAFS Collaborative and rst published in 2001. The
DAFS protocol was submitted to the IETF in 2001 but was never published as an RFC.
The DAFS protocol is not meant to be used in wide-area networks. DAFS is designed to
optimize shared le access in low-latency environments such as computer clusters. To
accomplish this goal, the DAFS protocol employs remote direct memory access (RDMA).
RDMA allows an application running on one host to access memory directly in another
host with minimal consumption of operating system resources in either host. The DAFS
protocol can be implemented in user mode or kernel mode, but kernel mode negates some
of the benets of RDMA.
RDMA is made possible by a class of high-speed, low-latency, high-reliability interconnect
technologies. These technologies are referred to as direct access transports (DAT), and the
most popular are the virtual interface (VI) architecture, the Sockets direct protocol (SDP),
23
iSCSI extensions for RDMA (iSER), the Datamover architecture for iSCSI (DA), and the
InniBand (IB) architecture. RDMA requires modication of applications that were written
to use traditional network le system protocols such as CIFS and NFS. For this reason, the
DAFS Collaborative published a new application programming interface (API) in 2001.
The DAFS API simplies application modications by hiding the complexities of the
DAFS protocol. The SNIA formed the DAFS Implementers Forum in 2001 to facilitate the
development of interoperable products based on the DAFS protocol.
Another computer industry consortium known as the Direct Access Transport (DAT)
Collaborative developed two APIs in 2002. One API is used by kernel mode processes, and
the other is used by user mode processes. These APIs provide a consistent interface to DAT
services regardless of the underlying DAT. The DAFS protocol can use either of these APIs
to avoid grappling with DAT-specic APIs.
NDMP
NDMP is a standard protocol for network-based backup of le servers. There is some
confusion about this, as some people believe NDMP is intended strictly for NAS devices.
This is true only to the extent that NAS devices are, in fact, highly specialized le servers.
But there is nothing about NDMP that limits its use to NAS devices. Network Appliance
is the leading vendor in the NAS market. Because Network Appliance was one of two
companies responsible for the creation of NDMP, its proprietary NAS ler operating system
has supported NDMP since before it became a standard. This has fueled the misconception
that NDMP is designed specically for NAS devices.
The purpose of NDMP is to provide a common interface for backup applications. This
allows backup software vendors to concentrate on their core competencies instead of
wasting development resources on the never-ending task of agent software maintenance.
File server and NAS ler operating systems that implement NDMP can be backed up using
third-party software. The third-party backup software vendor does need to explicitly
support the operating systems with custom agent software. This makes NDMP an important
aspect of heterogeneous data backup.
NDMP separates control trafc from data trafc, which allows centralized control. A central
console initiates and controls backup and restore operations by signaling to servers and
24
lers. The source host then dumps data to a locally attached tape drive or to another NDMPenabled host with an attached tape drive. Control trafc ows between the console and the
source/destination hosts. Data trafc ows within a host from disk drive to tape drive or
between the source host and destination host to which the tape drive is attached. For largescale environments, centralized backup and restore operations are easier and more costeffective to plan, implement, and operate than distributed backup solutions. Figure 1-7
shows the NDMP control and data trafc ows.
Figure 1-7
Backup
Server
Control Traffic
Destination
Server
Ethernet
Switch
Figure 1-8
25
Source
FC
Data
Path
FC-HBA
FC
Destination
Ethernet
Switch
NIC
Fibre
Channel
Switch
SONET/SDH
SONET is the transmission standard for long-haul and metropolitan carrier networks in
North America. There are approximately 135,000 metropolitan area SONET rings deployed
in North America today. SDH is the equivalent standard used throughout the rest of the
world. Both are time division multiplexing (TDM) schemes designed to operate on ber
optic cables, and both provide highly reliable transport services over very long distances.
Most storage replication solutions traverse SONET or SDH circuits, though the underlying
SONET/SDH infrastructure may be hidden by a network protocol such as Point-to-Point
Protocol (PPP) or Asynchronous Transfer Mode (ATM). SONET and SDH are often
collectively called SONET/SDH because they are nearly identical. This overview discusses
only SONET on the basis that SDH is not signicantly different in the context of storage
networking.
26
SONET evolved (among other reasons) as a means of extending the North American
Digital Signal Services (DS) hierarchy, which is also TDM oriented. The DS hierarchy uses
one 64 kilobits per second (kbps) timeslot as the base unit. This base unit is known as digital
signal 0 (DS-0). The base unit derives from the original pulse code modulation (PCM)
method of encoding analog human voice signals into digital format.
Twenty-four DS-0s can be multiplexed into a DS-1 (also known as T1) with some framing
overhead. Likewise, DS-1s can be multiplexed into a DS-2, DS-2s into a DS-3 (also known
as T3) and DS-3s into a DS-4. Additional framing overhead is incurred at each level in the
hierarchy. There are several limitations to this hierarchy. One of the main problems is nonlinear framing overhead, wherein the percentage of bandwidth wasted on protocol overhead
increases as bandwidth increases. This can be somewhat mitigated by concatenation
techniques, but cannot be completely avoided. Other problems such as electromagnetic
interference (EMI) and grounding requirements stem from the electrical nature of the
transmission facilities. SONET was designed to overcome these and other DS hierarchy
limitations while providing a multi-gigabit wide area data transport infrastructure.
SONET can transport many data network protocols in addition to voice trafc originating
from the DS hierarchy. The base unit in SONET is the synchronous transport signal level 1
(STS-1), which refers to the electrical signal generated within the SONET equipment.
STS-1 operates at 51.84 Mbps. Each STS level has an associated optical carrier (OC) level,
which refers to the optical waveform generated by the optical transceiver. STS-1 and
OC-1 both refer to 51.84 Mbps transmission, but OC terminology is more common. OC-3
operates at three times the transmission rate of OC-1. Most SONET implementations are
OC-3 or higher. This supports transport of common LAN protocols (such as Fast Ethernet)
and enables aggregation of lower-speed interfaces (such as DS-3) for more efcient use of
long distance ber infrastructure. The most popular SONET transmission rates are OC-3
(155.52 Mbps), OC-12 (622.08 Mbps), OC-48 (2.49 Gbps), and OC-192 (9.95 Gbps).
Many other OC signal levels are dened, but the communications industry generally
produces SONET products based on OC multiples of four.
The OC-3 frame structure is a modied version of the OC-1 frame. In fact, framing
differences exist between all SONET transmission rates. These differences permit
consistent framing overhead at all transmission rates. SONET has a framing overhead
penalty of 3.45 percent regardless of the transmission rate. At low rates, this overhead
seems excessive. For example, a T1 operates at 1.544 Mbps and has approximately 0.52
percent framing overhead. However, at higher rates, the efciencies of SONET framing
become clear. For example, the framing overhead of a DS-3 is 3.36 percent. Considering
that the DS-3 transmission rate of 44.736 Mbps is approximately 14 percent slower than the
OC-1 transmission rate, it is obvious that the DS framing technique cannot efciently scale
to gigabit speeds. Table 1-1 summarizes common DS hierarchy and SONET transmission
rates and framing overhead.
Table 1-1
27
Framing Overhead
DS-1
1.544 Mbps
0.52%
DS-3
44.736 Mbps
3.36%
OC-3
155.52 Mbps
3.45%
OC-12
622.08 Mbps
3.45%
OC-48
2.49 Gbps
3.45%
OC-192
9.95 Gbps
3.45%
For a signal to traverse great distances, the signal must be amplied periodically. For optical
signals, there are two general categories of ampliers: in-ber optical ampliers (IOA) and
semiconductor optical ampliers (SOA). The problem with amplication is that noise gets
amplied along with the signal. Part of the beauty of digital signal processing is that a
signal can be repeated instead of amplied. Noise is eliminated at each repeater, which
preserves the signal-to-noise ratio (SNR) and allows the signal to traverse extended
distances. However, to take advantage of signal repeating techniques, an optical signal must
be converted to an electrical signal, digitally interpreted, regenerated, and then converted
back to an optical signal for retransmission. An electro-optical repeater (EOR), which is a
type of SOA, is required for this. SONET uses EORs to extend the distance between
SONET equipment installations. A connection between two SONET equipment
installations is called a line in SONET terminology and a span in common terminology.
Using EORs, a SONET span can cover any distance. However, some data network
protocols have timeout values that limit the practical distance of a SONET span. The
connection between two EORs is called a section in SONET terminology and a link in
common terminology. Each SONET link is typically 50 km or less.
DWDM/CWDM
WDM refers to the process of multiplexing optical signals onto a single ber. Each optical
signal is called a lambda (). It typically falls into the 15001600 nanometer (nm) range.
This range is called the WDM window. WDM allows existing networks to scale in
bandwidth without requiring additional ber pairs. This can reduce the recurring cost of
operations for metropolitan- and wide-area networks signicantly by deferring ber
installation costs. WDM can also enable solutions otherwise impossible to implement in
situations where additional ber installation is not possible.
Wavelength and frequency are bound by the following formula:
c = wavelength * frequency
where c stands for constant and refers to the speed of light in a vacuum; therefore wavelength
cannot be changed without also changing frequency. Because of this, many people confuse
28
WDM with frequency division multiplexing (FDM). Two factors distinguish WDM from
FDM. First, FDM generally describes older multiplexing systems that process electrical
signals. WDM refers to newer multiplexing systems that process optical signals. Second,
each frequency multiplexed in an FDM system represents a single transmission source. By
contrast, one of the primary WDM applications is the multiplexing of SONET signals, each
of which may carry multiple transmissions from multiple sources via TDM. So, WDM
combines TDM and FDM techniques to achieve higher bandwidth utilization.
DWDM refers to closely spaced wavelengths; the closer the spacing, the higher the number
of channels (bandwidth) per ber. The International Telecommunication Union (ITU)
G.694.1 standard establishes nominal wavelength spacing for DWDM systems. Spacing
options are specied via a frequency grid ranging from 12.5 gigahertz (GHz), which
equates to approximately 0.1 nm, to 100 GHz, which is approximately 0.8 nm. Many
DWDM systems historically have supported only 100 GHz spacing (or a multiple of
100 GHz) because of technical challenges associated with closer spacing. Newer DWDM
systems support spacing closer than 100 GHz. Current products typically support
transmission rates of 2.5-10 Gbps, and the 40-Gbps market is expected to emerge in 2006.
You can use two methods to transmit through a DWDM system. One of the methods is
transparent. This means that the DWDM system will accept any client signal without
special protocol mappings or frame encapsulation techniques. Using this method, a client
device is connected to a transparent interface in the DWDM equipment. The DWDM
devices accept the clients optical signal and shift the wavelength into the WDM window.
The shifted optical signal is then multiplexed with other shifted signals onto a DWDM
trunk. Some DWDM-transparent interfaces can accept a broad range of optical signals,
whereas others can accept only a narrow range. Some DWDM-transparent interfaces are
protocol aware, meaning that the interface understands the client protocol and can monitor
the client signal. When using the transparent method, the entire end-to-end DWDM
infrastructure is invisible to the client. All link-level operations are conducted end-to-end
through the DWDM infrastructure.
Using the second method, a client device is connected to a native interface in the DWDM
equipment. For example, a Fibre Channel switch port is connected to a Fibre Channel port
on a line card in a DWDM chassis. The DWDM device terminates the incoming client
signal by supporting the clients protocol and actively participating as an end node on the
clients network. For example, a Fibre Channel port in a DWDM device would exchange
low-level Fibre Channel signals with a Fibre Channel switch and would appear as a bridge
port (B_Port) to the Fibre Channel switch. This non-transparent DWDM transport service
has the benet of localizing some or all link-level operations on each side of the DWDM
infrastructure. Non-transparent DWDM service also permits aggregation at the point of
ingress into the DWDM network. For example, eight 1-Gbps Ethernet (GE) ports could be
aggregated onto a single 10-Gbps lambda. The DWDM device must generate a new optical
signal for each client signal that it terminates. The newly generated optical signals are in
29
the WDM window and are multiplexed onto a DWDM trunk. Non-transparent DWDM
service also supports monitoring of the client protocol signals.
DWDM systems often employ IOAs. IOAs operate on the analog signal (that is, the optical
waveform) carried within the ber. IOAs generally operate on signals in the 15301570 nm
range, which overlaps the WDM window. As the name suggests, amplication occurs
within the ber. A typical IOA is a box containing a length of special ber that has been
doped during manufacture with a rare earth element. The most common type of IOA is the
erbium-doped ber amplier (EDFA). The normal ber is spliced into the special ber on
each side of the EDFA. Contained within the EDFA is an optical carrier generator that
operates at 980 nm or 1480 nm. This carrier is injected into the erbium-doped ber, which
excites the erbium. The erbium transfers its energy to optical signals in the 15301570 nm
range as they pass through the ber, thus amplifying signals in the center of the WDM
window. IOAs can enable analog signals to travel longer distances than unamplied signals,
but noise is amplied along with the signal. The noise accumulates and eventually reaches
an SNR at which the signal is no longer recognizable. This limits the total distance per span
that can be traversed using IOAs. Fortunately, advancements in optical bers, lasers and
lters (also known as graters) have made IOAs feasible for much longer distances than
previously possible. Unfortunately, much of the worlds metropolitan and long-haul ber
infrastructure was installed before these advancements were commercially viable. So, realworld DWDM spans often are shorter than the theoretical distances supported by EDFA
technology. DWDM distances typically are grouped into three categories: inter-ofce
(0300 km), long-haul (300600 km), and extended long-haul (6002000 km). Figure 1-9
shows a metropolitan area DWDM ring.
The operating principles of CWDM are essentially the same as DWDM, but the two are
quite different from an implementation perspective. CWDM spaces wavelengths farther
apart than DWDM. This characteristic leads to many factors (discussed in the following
paragraphs) that lower CWDM costs by an order of magnitude. CWDM requires no special
skill sets for deployment, operation, or support. Although some CWDM devices support
non-transparent service, transparent CWDM devices are more common.
Transparent CWDM involves the use of specialized gigabit interface converters (GBIC)
or small form-factor pluggable GBICs (SFP). These are called colored GBICs and SFPs
because each lambda represents a different color in the spectrum. The native GBIC or SFP
in the client device is replaced with a colored GBIC or SFP. The electrical interface in the
client passes signals to the colored GBIC/SFP in the usual manner. The colored GBIC/SFP
converts the electrical signal to an optical wavelength in the WDM window instead of the
optical wavelength natively associated with the client protocol (typically 850 nm or 1310 nm).
The client device is connected to a transparent interface in the CWDM device, and the
optical signal is multiplexed without being shifted. The colored GBIC/SFP negates the need
to perform wavelength shifting in the CWDM device. The network administrator must plan
the optical wavelength grid manually before procuring the colored GBICs/SFPs, and the
colored GBICs/SFPs must be installed according to the wavelength plan to avoid conicts
in the CWDM device.
30
Figure 1-9
Voice
Circuits
Telco
SONET
Company A
Site 1
Company A
Site 2
Gigabit
Ethernet
DWDM
Metropolitan Area
Network
Gigabit
Ethernet
SONET
Telco
Voice
Circuits
To the extent that client devices are unaware of the CWDM system, and all link-level
operations are conducted end-to-end, transparent CWDM service is essentially the same
as transparent DWDM service. Transparent CWDM mux/demux equipment is typically
passive (not powered). Passive devices cannot generate or repeat optical signals. Additionally,
IOAs operate in a small wavelength range that overlaps only three CWDM signals. Some
CWDM signals are unaffected by IOAs, so each CWDM span must terminate at a distance
determined by the unamplied signals. Therefore, no benet is realized by amplifying any
of the CWDM signals. This means that all optical signal loss introduced by CWDM mux/
demux equipment, splices, connectors, and the ber must be subtracted from the launch
power of the colored GBIC/SFP installed in the client. Thus, the client GBIC/SFP determines
the theoretical maximum distance that can be traversed. Colored GBICs/SFPs typically are
31
TIP
IOAs may be used with CWDM if only three signals (1530 nm, 1550 nm, and 1570 nm) are
multiplexed onto the ber.
Most CWDM devices operate in the 1470-1610 nm range. The ITU G.694.2 standard
species the wavelength grid for CWDM systems. Spacing is given in nanometers, not
gigahertz. The nominal spacing is 20 nm. The sparse wavelength spacing in CWDM
systems enables lower product-development costs. Providing such a wide spacing grid
enables relaxation of laser tolerances, which lowers laser fabrication costs. Temperature
changes in a laser can change the wavelength of a signal passing through the laser. So, lasers
must be cooled in DWDM systems. The grid spacing in CWDM systems allows uncooled
lasers to be used because a wavelength can change moderately without being confused with
a neighbor wavelength. Uncooled lasers are less costly to fabricate. Last, optical lters can
be less discerning and still be effective with the wide spacing of the CWDM grid. This
lowers the cost of CWDM mux/demux equipment.
RPR/802.17
Cisco Systems originally developed dynamic packet transport (DPT) in 1999. The spatial
reuse protocol (SRP) is the basis of DPT. Cisco submitted SRP to the IETF, and it was
published as an informational RFC in 2000. SRP was submitted to the IEEE for consideration
as the basis of the 802.17 specication. The IEEE decided to combine components of SRP
with components of a competing proposal to create 802.17. Final IEEE ratication occurred
in 2004. The IETF has formed a working group to produce an RFC for IP over 802.17.
Technologies such as DPT and 802.17 are commonly known as RPRs.
Ethernet and SONET interworking is possible in several ways. The IEEE 802.17 standard
attempts to take interworking one step further by merging key characteristics of Ethernet
and SONET. Traditional service provider technologies (that is, the DS hierarchy and
SONET/SDH) are TDM-based and do not provide bandwidth efciency for data networking.
Traditional LAN transport technologies are well suited to data networking, but lack some
of the resiliency features required for metropolitan area transport. Traditional LAN transport
technologies also tend to be suboptimal for multiservice trafc. The 802.17 standard attempts
to resolve these issues by combining aspects of each technology to provide a highly
resilient, data friendly, multi-service transport mechanism that mimics LAN behavior
across metropolitan areas. Previous attempts to make SONET more LAN-friendly have
been less successful than anticipated. For example, ATM and packet over SONET (POS)
both bring data-friendly transport mechanisms to SONET by employing row-oriented
synchronous payload envelopes (SPE) that take full advantage of concatenated SONET
32
frame formats. However, each of these protocols has drawbacks. ATM is complex and has
excessive overhead, which fueled the development of POS. POS is simple and has low
overhead, but supports only the point-to-point topology. The 802.17 standard attempts to
preserve the low overhead of POS while supporting the ring topology typically used in
metropolitan-area networks (MAN) without introducing excessive complexity.
The IEEE 802.17 standard partially derives from Ethernet and employs statistical
multiplexing rather than TDM. The 802.17 data frame replaces the traditional Ethernet II
and IEEE 802.3 data frame formats, but is essentially an Ethernet II data frame with
additional header elds. The maximum size of an 802.17 data frame is 9218 bytes, which
enables efcient interoperability with Ethernet II and 802.3 LANs that support jumbo
frames. The 802.17 standard also denes two new frame formats for control and fairness.
The 802.17 standard supports transport of TDM trafc (for example, voice and video) via
a new fairness algorithm based on queues and frame prioritization. Optional forwarding of
frames with bad cyclic redundancy checks (CRC) augments voice and video playback
quality. The IEEE 802.17 standard employs a dual counter-rotating ring topology and
supports optical link protection switching via wrapping and steering. Wrapping provides
sub-50 millisecond (ms) failover if a ber is cut, or if an RPR node fails. Unlike SONET,
the 802.17 standard suffers degraded performance during failure conditions because it uses
both rings during normal operation. The price paid by SONET to provide this bandwidth
guarantee is that only half of the total bandwidth is used even when all bers are intact, and
all nodes are operating normally. The IEEE 802.17 standard scales to multi-gigabit speeds.
Despite the tight coupling that exists between IEEE 802.17 operations at Layer 2 of the OSI
reference model and optical operations at Layer 1, the 802.17 standard is independent of
the Layer 1 technology. The 802.17 standard currently provides physical reconciliation
sublayers for the SONET, GE, and 10-Gbps Ethernet (10GE) Physical Layer technologies.
This means the maximum distance per 802.17 span is determined by the underlying
technology and is not limited by Ethernets carrier sense multiple access with collision
detection (CSMA/CD) algorithm. Of course, each span may be limited to shorter distances
imposed by upper-layer protocol timeout values. One of the design goals of the 802.17
specication is to support ring circumferences up to 2000 km.
33
by all who read this book, but it is useful as an introduction to the concept of storage
virtualization.
Storage can be virtualized in two ways; in-band or out-of-band. In-band techniques insert
a virtualization engine into the data path. Possible insertion points are an HBA, a switch, a
specialized server, a storage array port, or a storage array controller. All I/O trafc passes
through the in-band virtualization engine. Out-of-band techniques involve proprietary host
agents responsible for redirecting initial I/O requests to a metadata/mapping engine, which
is not in the data path. If an I/O request is granted, all subsequent I/O trafc associated with
that request goes directly from the host to the storage device. There is much religion about
which approach is best, but each has its pros and cons. Hybrid solutions are also available.
For more information about the various types of virtualization solutions, consult SNIAs
storage virtualization tutorials.
You can use many techniques to virtualize storage. An example of a le-oriented technique is
to present a portion of a local volume to clients on a network. The clients see an independent
volume and do not know it is part of a larger volume. Another le-oriented technique grafts
multiple volumes together to present a single volume to local applications. All UNIX and
Linux le systems use this technique to create a unied le name space spanning all volumes.
Virtualization also applies to block-oriented environments. With striping techniques, the
individual blocks that compose a logical volume are striped across multiple physical disks.
These are sometimes referred to as stripe sets. Alternately, contiguous blocks can be grouped
into disjointed sets. The sets can reside on different physical disks and are called extents when
concatenated to create a single logical volume. Extents are written serially so that all blocks
in a set are lled before writing to the blocks of another set. Sometimes a combination of
striping and extent-based allocation is used to construct a logical volume. Figure 1-10
illustrates the difference between a stripe set and an extent.
Figure 1-10 Stripe Set Versus Extent
Stripe-Based
Logical Volume
Block 1
Block 4
Block 7
Block 2
Block 5
Block 8
Block 3
Block 6
Block 9
Extent-Based
Logical Volume
Block 1
Block 2
Block 35
Full
Block 36
Block 37
Block 87
Full
Block 88
Block 89
The benets of abstraction are many, but the most compelling include reduced host
downtime for operational changes, ease of capacity expansion, the ability to introduce new
physical device technologies without concern for operating system or application backward
compatibility, a broader choice of migration techniques, advanced backup procedures, and
34
new disaster recovery solutions. There are several ways to virtualize physical storage
resources, and three general categories of implementations:
Host-based
Storage subsystem-based
Network-based
Each has its pros and cons, though network-based implementations seem to have more pros
than cons for large-scale storage environments. Enterprise-class virtualization products have
appeared in the market recently after years of delays. However, one could argue that enterpriseclass virtualization has been a reality for more than a decade if the denition includes
redundant array of inexpensive disks (RAID) technology, virtual arrays that incorporate
sophisticated virtualization functionality beyond RAID, or host-based volume management
techniques. One can think of the new generation of enterprise-class virtualization as the
culmination and integration of previously separate technologies such as hierarchical storage
management (HSM), volume management, disk striping, storage protocol conversion, and so
on. As the new generation of virtualization products matures, the ability to seamlessly and
transparently integrate Fibre Channel and iSCSI networks, SCSI and ATA storage subsystems,
disk media, tape media, and so on, will be realized. Effectiveness is another story. That might
rely heavily on policy-based storage management applications. Automated allocation and
recovery of switch ports, logical unit numbers (LUN), tape media, and the like will become
increasingly important as advanced virtualization techniques become possible.
Host Implementations
Host-based virtualization products have been available for a long time. RAID controllers
for internal DAS and just-a-bunch-of-disks (JBOD) external chassis are good examples of
hardware-based virtualization. RAID can be implemented without a special controller, but
the software that performs the striping calculations often places a noticeable burden on the
host CPU. Linux now natively supports advanced virtualization functionality such as
striping and extending via its logical volume manager (LVM) utility. Nearly every modern
operating system on the market natively supports mirroring, which is a very simplistic form
of virtualization. Mirroring involves block duplication on two physical disks (or two sets of
physical disks) that appear as one logical disk to applications. The software virtualization
market offers many add-on products from non-operating system vendors.
Host-based implementations really shine in homogeneous operating system environments
or companies with a small number of hosts. In large, heterogeneous environments, the
number of hosts and variety of operating systems increase the complexity and decrease the
efciency of this model. The host-oriented nature of these solutions often requires different
software vendors to be used for different operating systems. Although this prevents storage
from being virtualized in a consistent manner across the enterprise, storage can be virtualized
in a consistent manner across storage vendor boundaries for each operating system.
Large-scale storage consolidation often is seen in enterprise environments, which results in
a one-to-many relationship between each storage subsystem and the hosts. Host-based
virtualization fails to exploit the centralized storage management opportunity and imposes
35
Network Implementations
The widespread adoption of switched FC-SANs has enabled a new virtualization model.
Implementing virtualization in the network offers some advantages over subsystem and
host implementations. Relative independence from proprietary subsystem-based solutions
and host operating system requirements enables the storage administrator to virtualize
storage consistently across the enterprise. A higher level of management centralization is
realized because there are fewer switches than storage subsystems or hosts. Logical disks
36
can span multiple physical disks in separate subsystems. Other benets derive from the
transparency of storage as viewed from the host, which drives heterogeneity in the storage
subsystem market and precipitates improved interoperability. However, some storage
vendors are reluctant to adapt their proprietary host-based failover mechanisms to these
new network implementations, which might impede enterprise adoption.
Summary
This chapter presents a high-level view of storage networking technologies and select
related technologies. We provide insight to the history of storage to elucidate the current
business drivers in storage networking. Several types of storage networking technologies
are discussed, including open systems le-level, block-level, and backup protocols, and
mainframe block-level protocols. We also provide a cursory introduction to the mainstream
optical technologies to familiarize you with the long-distance storage connectivity options.
We introduce storage virtualization, and briey compare the various models. Figure 1-11
depicts some of the technologies discussed in this chapter as they relate to each other.
Figure 1-11 High Level View of Storage Networking
iSCSIRemote
Storage
Access
iSCSI
iSCSI
iSCSIiSCSIiSCSIiSCSI
iSCSI
Highly Scalable
Storage Networks
iSCSI
iSCSI
FC
FC
FC
FC
FC
Cisco
MDS
9506
FC
FC
FC
FC
FC
FC
FC
iSCSI
iSCSI
iSCSI
iSCSI
iSCSI
iSCSIEnabled
Storage
Network
iSCSI
SN5428-2
Storage
Router
Catalyst
Switches
NAS
Clients
and
Filers
ONS 15454
SONET
ONS 15327
SONET
Cisco
MDS
9216
NAS
NAS
NAS
Multiprotocol/
Multiservice
SONET/SDH
Network
Cisco MDS
9216
with IPS-8
ONS15540
DWDM
Cisco MDS 9509
with IPS-8
FC
Metropolitan DWDM
Ring
Cisco MDS
9216
Synchronous Replication FC over DWDM
Cisco MDS
9216
FC
FC
FC
FC
FC
Intelligent Workgroup
Storage Networks
FC
FC
FC
FC
FC
Review Questions
37
You can nd more information on the topics discussed in this chapter and other related
topics in standards documents such as IETF RFCs, ANSI T10 and T11 specications, ITU
G series recommendations, and IEEE 802 series specications. Appendix A lists the
specications most relevant to the topics covered in this book. The websites of industry
consortiums and product vendors also contain a wealth of valuable information. Also, the
Cisco Systems website contains myriad business and technical documents covering storage
technologies, networking technologies, and Ciscos products in a variety of formats
including white papers, design guides, and case studies. Last, the Cisco Press storage
networking series of books provides comprehensive, in-depth coverage of storage
networking topics for readers of all backgrounds.
The subsequent chapters of this book explore select block-oriented, open-systems
technologies. The book uses a comparative approach to draw parallels between the technologies. This allows each reader to leverage personal experience with any of the technologies
to more readily understand the other technologies. The OSI reference model is used to
facilitate this comparative approach.
Review Questions
1 What company invented the modern hard disk drive?
2 Who published the SMI-S?
3 List two competitive advantages enabled by storage networks.
4 What is an IP-SAN?
5 What block-level storage protocol is commonly used in desktop and laptop PCs?
6 What is the latest version of the SCSI protocol?
7 What is the term for an adapter that processes TCP packets on behalf of the host CPU?
8 List the two types of IBM channel protocols.
9 An IBM ESCON director employs what type of switching mechanism?
10 A Fibre Channel switch employs what type of switching mechanism?
11 What company originally invented CIFS?
12 What standards body made NFS an industry standard?
13 What type of multiplexing is used by SONET/SDH?
14 What is the WDM window?
15 List one reason the sparse wavelength spacing of CWDM enables lower product costs.
16 In what class of networks is IEEE 802.17?
17 List the two categories of block-oriented virtualization techniques.
Describe the layers of the Open Systems Interconnection (OSI) reference model
CHAPTER
NOTE
Notice the acronym ISO does not align with the full name of the organization. The ISO
wanted a common acronym to be used by all nations despite language differences.
International Organization for Standardization translates differently from one language to
the next, so the acronym also would vary from one language to the next. To avoid that
problem, they decided that the acronym ISO be used universally despite the language. ISO
derives from the Greek word isos, meaning equal. As a result of choosing this acronym,
the organizations name is often documented as International Standardization Organization,
which is not accurate.
40
reference model does not specify protocols, commands, signaling techniques, or other such
technical details. Those details are considered implementation-specic and are left to the
standards body that denes each specication. The OSI reference model provides common
denitions of terms, identies required network functions, establishes relationships
between processes, and provides a framework for the development of open standards that
specify how to implement the functions or a subset of the functions dened within the
model. Note that the ISO provides separate protocol specications based on the OSI
reference model. Those protocol specications are not part of the OSI reference model.
Seven-Layer Model
The OSI reference model comprises seven layers. Each layer may be further divided into
sublayers to help implement the model. The common functionality that must be supported
for a network to provide end-to-end communication between application processes is
broken down into these seven layers. Within each layer, the services that should be
implemented to achieve a given level of functionality are dened. Each layer provides
services to the layer above and subscribes to the services of the layer below. Each layer
accepts data from its upper-layer neighbor and encapsulates it in a header and optional
trailer. The result is called a protocol data unit (PDU). The PDU is then passed to the lowerlayer neighbor. A PDU received from an upper layer is called a service data unit (SDU) by
the receiving layer. The SDU is treated as data and is encapsulated into a PDU. This cycle
continues downward through the layers until the physical medium is reached. The logical
interface between the layers is called a service access point (SAP). Multiple SAPs can be
dened between each pair of layers. Each SAP is unique and is used by a single upper-layer
protocol as its interface to the lower-layer process. This enables multiplexing and
demultiplexing of upper-layer protocols into lower-layer processes. If a networking
technology operates at a single layer, it depends upon all lower layers to provide its services
to the layer above. In this way, the layers build upon each other to create a fully functional
network model. An implementation of the model is often called a protocol stack because of
this vertical layering. Figure 2-1 shows the seven OSI layers in relation to each other.
The functions dened within each layer follow:
Figure 2-1
41
Application
Software
Provider
Layer 7
Application
Subscriber
Provider
Layer 6
Presentation
Subscriber
Provider
Layer 5
Session
Subscriber
Provider
Layer 4
Transport
Subscriber
Provider
Layer 3
Network
Subscriber
Provider
Layer 2
Data-Link
Subscriber
Provider
Layer 1
Physical
Media
42
NOTE
43
Details such as network-level node addressing (also known as logical node addressing),
network addressing to identify each Layer 2 network, methods for summarization of
network addresses, methods for network address discovery, algorithms for path determination,
packet types, packet formats, packet sequences, maximum and minimum packet sizes,
network-level ow-control mechanisms, network-level quality of service mechanisms,
handshakes and negotiation methods between Layer 3 entities for support of
optional functions, packet and protocol error detection and recovery and notication
procedures, logical topology, and network protocol timeout values are specied within
this layer.
Routers operate at this layer. Network layer protocols can operate in connectionless mode
or connection-oriented mode. Network layer connections determine the end-to-end path
and facilitate delivery guarantees within an internetwork. So the procedures for establishment,
maintenance, and teardown of connections between pairs of routers that implement
connection-oriented network layer protocols are specied within this layer.
TIP
The phrase packet switching is often used to generically describe Layer 3 and Layer 2
devices. This sometimes causes confusion because Layer 2 data units are called frames or
cells, not packets.
44
Implementation Considerations
It is possible for end-to-end communication to occur between two nodes that implement
only a subset of these layers. The network functionality provided would be less than that
achievable by implementing all layers, but might be sufcient for certain applications.
Each layer implemented in one network node must also be implemented in a peer node
for communication to function properly between that pair of nodes. This is because each
layer communicates with its peer layer implemented in another network node. The OSI layer
peer relationships are illustrated in Figure 2-2.
SCSI Bus Interface and the ANSI T10 SCSI-3 Architecture Model
Figure 2-2
45
Host A
Host B
Application
Software
Application
Software
Application
Application
Presentation
Presentation
Session
Session
Transport
Transport
Network
Network
Network
Data-Link
Data-Link
Data-Link
Data-Link
Data-Link
Physical
Physical
Physical
Physical
Physical
Switch
Router
Switch
Media
Strict adherence to the OSI reference model is not required of any networking
technology. The model is provided only as a reference. Many networking technologies
implement only a subset of these layers and provide limited network functionality, thus
relying on the complement of other networking technologies to provide full functionality.
This compartmentalization of functionality enables a modular approach to the
development of networking technologies and facilitates interoperability between
disparate technologies. An example of the benets of this approach is the ability for a
connection-oriented physical layer technology to be used by a connectionless data-link
layer technology. Some storage networking technologies implement only a subset of
functions within one or more of the layers. Those implementation details are discussed
in the following sections.
46
groups all networking technologies into a layer called SCSI Interconnects. Likewise, all
protocol mappings that enable SCSI-3 to be transported via SCSI Interconnects are grouped
into a single layer called SCSI Transport Protocols. Consequently, the two SAM network
layers do not map neatly to the OSI layers. Depending on the actual Interconnect used,
the mapping to OSI layers can include one or more layers from the physical layer through the
transport layer. Likewise, depending on the actual SCSI Transport Protocol used, the mapping
to OSI layers can include one or more layers from the session layer through the application
layer. As mentioned in Chapter 1, Overview of Storage Networking, SCSI-3 is a command/
response protocol used to control devices. This can be contrasted with communication
protocols, which provide network functionality. Thus, the SCSI-3 command sets compose
the third layer of the SAM, which resides above the other two layers. The SCSI-3 command
sets are clients to the SCSI Transport Protocols. The SCSI Interconnect and SCSI transport
layers are collectively referred to as the SCSI service delivery subsystem. The SCSI-3
command sets are collectively referred to as the SCSI application layer (SAL).
The SCSI parallel bus interfaces are dened by the ANSI T10 subcommittee via the SCSI
parallel interface (SPI) series of specications. The SPI specications are classied as
Interconnects in the SAM. When using an SPI, SCSI commands are transmitted via the
Interconnect without the use of a separate SCSI Transport Protocol. The interface between
the SCSI command sets and the SPI is implemented as a procedure call similar to the
interface between the SCSI command sets and SCSI Transport Protocols.
Each variation of the SPI is a multidrop interface that uses multiple electrical signals carried
on physically separate wires to control communication. Such control is normally accomplished
via the header of a serialized frame or packet protocol. Some of the other SPI functions
include signal generation, data encoding, physical node addressing, medium arbitration,
frame generation, and frame error checking. Frame formats are dened for signals on the
data wires, one of which contains a data eld followed by an optional pad eld followed by
a cyclic redundancy check (CRC) eld. So, the SPI essentially provides OSI physical layer
functionality plus a subset of OSI data-link layer functionality. All devices connect directly
to a single shared physical medium to create the multidrop bus, and interconnection of
buses is not supported. So, there is no need for OSI data-link layer bridging functionality.
The concept of OSI data-link layer connection-oriented communication is not applicable to
non-bridged environments. Moreover, upper-layer protocol multiplexing is not required
because only the SCSI command sets use the SPI. So, many of the services dened within
the OSI data-link layer are not required by the SPI. Figure 2-3 compares the SAM to the
OSI reference model, lists some specications, and shows how the SPI specications map
to the OSI layers.
Figure 2-3
47
OSI
Reference
Model
SCSI-3
Architecture
Model
Application
Software
Application
Software
SCSI
Application
Layer
SCSI-3
Implementations
Application
Presentation
SCSI
Transport
Protocols
Session
Transport
Network
Data-Link
SCSI
Interconnects
Physical
48
is omitted from the PDU. The 802.3 service then provides full OSI physical layer functionality
plus limited OSI data-link layer functionality. The type eld enables identication of the
intended upper layer protocol at the destination host (also known as the destination EtherType).
This is important because it enables demultiplexing of OSI network layer protocols, which
is a subset of the functionality provided by the 802.2 header. Figure 2-4 compares the
IEEE 802 reference model to the OSI reference model and lists the relevant Ethernet
specications.
Figure 2-4
OSI
Reference
Model
Application
Presentation
Session
Transport
Network
LLC
Data-Link
IEEE 802
Implementation
Specifications
IEEE 802
Reference
Model
Bridging
802.2
802.1D, 802.1G, 802.1H, 802.1Q
MAC
802.3, 802.3ae (Ethernet)
Physical
NOTE
Physical
IEEE specication names are case sensitive. For example, 802.1q is not the same as
802.1Q. Lower-case letters indicate an amendment to an existing standard, whereas
upper-case letters indicate a full standard that might or might not incorporate various
amendments.
49
ICMP, ARP, and IP are specied in IETF RFCs 792, 826, and 791 respectively, and all three
operate at the OSI network layer. TCP and UDP are specied in IETF RFCs 793 and 768
respectively, and both operate at the OSI transport layer. Of course, there are many routing
protocols in the TCP/IP suite such as Enhanced Interior Gateway Routing Protocol (EIGRP),
Open Shortest Path First (OSPF), and Border Gateway Protocol (BGP). All IP routing
protocols operate at the OSI network layer.
The principal block storage protocols of the TCP/IP suite are routed protocols that operate
on TCP. These include Internet SCSI (iSCSI), Fibre Channel over TCP/IP (FCIP), and
Internet Fibre Channel Protocol (iFCP) dened in IETF RFCs 3720, 3821, and 4172
respectively. The iSCSI protocol enables delivery of SCSI-3 trafc by providing functionality
equivalent to the OSI session, presentation, and application layers. Though iSCSI spans the
top three OSI layers, many of the upper-layer protocols in the TCP/IP suite map closely to
just one or two of the upper OSI layers. Examples include FCIP and iFCP, each of which
map to the OSI session layer. The Common Internet File System (CIFS) and Network File
System (NFS) le-level protocols discussed in Chapter 1 also run on TCP. CIFS maps to
the OSI application and presentation layers and makes use of NetBIOS at the OSI session
layer. NFS spans the top three OSI layers. Figure 2-5 compares the ARPANET model to
the OSI reference model and lists the principal protocols of the TCP/IP suite.
Figure 2-5
OSI
Reference
Model
ARPANET
Model
IETF
Implementation
Specifications
Application
Application
Transport
Transport
Network
Internet
Network
Interface
Presentation
Session
Data-Link
Physical
50
layers of the Fibre Channel model are FC-0, FC-1, FC-2, FC-3, and FC-4. Whereas the
Fibre Channel model has not changed, the Fibre Channel architecture has undergone
numerous changes. The Fibre Channel architecture is currently dened by a large series of
specications published primarily by the ANSI T11 subcommittee.
51
The Fibre Channel specications do not map neatly to the Fibre Channel model. The
details of how to implement the functionality dened within each layer of the Fibre
Channel model are spread across multiple specication documents. That said, a close
look at the Fibre Channel specications reveals that considerable end-to-end functionality
is supported. So, an accurate description is that the Fibre Channel architecture operates
at the OSI physical, data-link, transport, session, presentation, and application layers.
Extrapolating this information to better understand the Fibre Channel model indicates the
following:
Originally, the FC-PH series of documents specied much of Fibre Channels functionality
and spanned multiple OSI layers. Later specications separated portions of the OSI
physical layer functionality into a separate series of documents. The OSI physical layer
functionality of Fibre Channel is now principally specied in the FC-PI and 10GFC series.
The OSI data-link layer functionality is now principally specied in the FC-FS, FC-DA,
FC-SW, FC-AL, and FC-MI series. There is no OSI network layer functionality inherent to
Fibre Channel, but the FC-BB series provides various methods to leverage external
networking technologies, some of which operate at the OSI network layer.
From the Fibre Channel perspective, external networks used to interconnect disparate Fibre
Channel storage area networks (FC-SAN) are transparent extensions of the OSI data-link
layer service. A subset of the OSI transport layer functionality is provided via the N_port
login (PLOGI) and N_port logout (LOGO) mechanisms as specied in the FC-FS, FC-DA,
and FC-LS series. A subset of the OSI session layer functionality is provided via the process
login (PRLI) and process logout (PRLO) mechanisms as specied in the FC-FS, FC-DA,
and FC-LS series. Protocol mappings enable Fibre Channel to transport many types of
trafc, including OSI network layer protocols (for example, IP), application level protocols
(for example, SCSI-3), and data associated with application-level services (for example,
Fibre Channel Name Service). These mappings provide a subset of the OSI presentation
layer and application layer functionality. Most of the mappings are dened by the ANSI
T11 subcommittee, but some are dened by other organizations. For example, the SingleByte Command Code Set (SBCCS) mappings are dened in the ANSI T11 subcommittees
FC-SB series, but the SCSI-3 mappings are dened in the ANSI T10 subcommittees FCP
series. Figure 2-6 compares the Fibre Channel model to the OSI reference model and lists
the Fibre Channel specications most relevant to storage networking.
52
Figure 2-6
OSI
Reference
Model
Fibre
Channel
Model
Fibre Channel
Implementation
Specifications
Application
FC-4
Presentation
Session
FC-2, FC-3
FC-FS, FC-DA
FC-2, FC-3
Transport
Network
Data-Link
FC-1
FC-FS, 10GFC
FC-0
FC-PI, 10GFC
Physical
Summary
The OSI reference model facilitates modular development of communication products,
thereby reducing time to market for new products and facilitating faster evolution of
existing products. Standards organizations often leverage the OSI reference model in their
efforts to specify implementation requirements for new technologies. The ISO does not
guarantee interoperability for OSI-compliant implementations, but leveraging the OSI
reference model does make interoperability easier and more economical to achieve. This is
increasingly important in the modern world of networking.
The protocols discussed in this chapter do not all map to the OSI reference model in a neat
and clear manner. However, the functionality provided by each protocol is within the scope
of the OSI reference model. The most important things to understand are the concepts
presented in the OSI reference model. That understanding will serve you throughout your
career as new networking technologies are created. The OSI reference model provides
network engineers with the proverbial yardstick required for understanding all the various networking technologies in a relative context. The OSI reference model also brings clarity to
the complex interrelation of functionality inherent to every modern networking technology.
The OSI reference model is used throughout this book to maintain that clarity.
Review Questions
53
Review Questions
1 How many layers are specied in the OSI reference model?
2 The data-link layer communicates with which OSI layer in a peer node?
3 What is the OSI term for the interface between vertically stacked OSI layers?
4 How many OSI layers operate end-to-end? List them.
5 Which OSI layer is responsible for bit-level data encoding?
6 Which is the only OSI layer not inherent to the Fibre Channel architecture?
7 Create a mnemonic device for the names and order of the OSI layers.
Discuss the history of the Small Computer System Interface (SCSI) Parallel Interface
(SPI), Ethernet, TCP/IP, and Fibre Channel (FC)
Explain the difference between baud, raw bit, and data bit rates
Quantify the actual throughput available to SCSI via the SPI, Internet SCSI (iSCSI),
and Fibre Channel Protocol (FCP)
State which logical topologies are supported by the SPI, Ethernet, IP, and FC
Dene the basic techniques for service and device discovery
Describe the discovery mechanisms used in SPI, Ethernet, TCP/IP, and FC
environments
CHAPTER
Conceptual Underpinnings
This section provides the foundational knowledge required to understand throughput,
topologies, and discovery techniques.
Throughput
When discussing the throughput of any frame-oriented protocol that operates at OSI
Layers 1 and 2, it is important to understand the difference between baud rate, raw bit
rate, data bit rate, and upper layer protocol (ULP) throughput rate. The carrier signal is
the natural (unmodied) electrical or optical signal. The baud rate is the rate at which the
carrier signal is articially modulated (transformed from one state to another state).
Multiple carrier signal transitions can occur between each pair of consecutive articial
state transitions. Baud rate is inherently expressed per second. Bits received from OSI
Layer 2 must be encoded into the signal at OSI Layer 1. The number of bits that can be
encoded per second is the raw bit rate. The encoding scheme might be able to represent
more than one bit per baud, but most schemes represent only one bit per baud. So, the
baud rate and the raw bit rate are usually equal. However, the bits encoded often include
control bits that were not received from OSI Layer 2. Control bits typically are inserted
by the encoding scheme and are used to provide clocking, maintain direct current (DC)
balance, facilitate bit error detection, and allow the receiver to achieve byte or word
alignment. The number of raw bits per second minus the number of control bits per
second yields the data bit rate. This is the number of bits generated at OSI Layer 2 that
56
can be transmitted per second. The data bit rate includes OSI Layer 2 framing bits
and payload bits. The ULP throughput rate is the number of payload bits that can be
transmitted per second. So, the number of data bits per second minus the number of
framing bits per second yields the ULP throughput rate.
Topologies
There are too many design implications associated with each physical topology for this
chapter to cover the subject exhaustively. So, this section merely introduces the physical
topologies by providing a brief discussion of each. The general points discussed in this
section are equally applicable to all networking technologies including the SPI, Ethernet,
IP, and Fibre Channel. Note that for any given topology, there might be many names.
Each community of network technologists seems to prefer a different name for each
topology.
Context should be considered when discussing topology. Discussions of very smallscale networks often include the end nodes in the topological context (see the star and
linear paragraphs). When discussing medium- and large-scale networks, communication
between network devices is the primary topological concern. So, end nodes usually
are excluded from that topological context. This section discusses topologies in both
contexts.
Another important point regarding topology is perspective. Link-state protocols based on
the Dijkstra algorithm create a logical tree topology in which each network device sees
itself as the root of the tree. By contrast, Ethernets Spanning Tree algorithm creates a tree
topology in which every switch recognizes the same switch as the root of the tree.
Perspective partially determines the behavior of a network device in a given topology.
Additionally, the physical topology (cabling) and logical topology (communication model)
often are not the same. This can cause confusion when discussing certain physical
topologies in the context of certain protocols.
There are six types of physical topologies:
Any of these six physical topologies can be combined to create a hybrid topology.
Figures 3-1 through 3-6 illustrate an example of each physical topology. Figure 3-7
illustrates the star topology versus the collapsed ring/loop topology.
Conceptual Underpinnings
Figure 3-1
Star/Hub-and-Spoke Topology
Figure 3-2
Linear/Bus/Cascade Topology
Figure 3-3
Circular/Ring/Loop Topology
57
58
Figure 3-4
Tree Topology
Figure 3-5
Figure 3-6
Conceptual Underpinnings
Figure 3-7
59
Tx
Rx
Tx
Rx
Ethernet Hub
Star Topology
Backplane
Tx
Rx
Tx
Rx
Tx
Rx
Tx
Rx
Port Bypass
Rx
Tx
Rx
Tx
Port Bypass
In the star topology, all devices are connected through a single, central network element
such as a switch or hub. Without consideration for the end nodes, the star topology is just a
single network device. Upon considering the end nodes, the star shape becomes apparent.
On the surface, the collapsed ring topology in Figure 3-7 also appears to have a star shape,
but it is a circular topology. A collapsed ring/collapsed loop topology is merely a circular
topology cabled into a centralized device.
The star topology differs from the collapsed ring/loop topology in the way signals are
propagated. The hub at the center of a star topology propagates an incoming signal to every
outbound port simultaneously. Such a hub also boosts the transmit signal. For this reason,
such hubs are most accurately described as multiport repeaters. By contrast, the multiaccess unit (MAU) at the center of the collapsed ring in Figure 3-7 passively connects the
transmit wire of one port to the receive wire of the next port. The signal must propagate
sequentially from node to node in a circular manner. So, the unqualied terms hub and
60
concentrator are ambiguous. Ethernet hubs are multiport repeaters and support the star
topology. Token ring MAUs, FDDI concentrators, and Fibre Channel arbitrated loop
(FCAL) hubs are all examples of collapsed ring/loop devices. By collapsing the ring/loop
topology, the geometric shape becomes the same as the star topology (even though signal
propagation is unchanged). The geometric shape of the star topology (and collapsed ring/
loop topology) simplies cable plant installation and troubleshooting as compared to the
geometric shape of a conventional (distributed) ring/loop topology (shown in Figure 3-3).
Cable plant simplication is the primary benet of a collapsed ring/loop topology versus a
conventional (distributed) ring/loop topology. However, the star topology provides other
benets not achievable with the collapsed ring/loop topology (because of signal
propagation differences). Figure 3-7 illustrates the geometrical similarities and topological
differences of the star and collapsed ring/loop topologies.
Network devices connected in a linear manner form a cascade topology. A cascade
topology is geometrically the same as a bus topology. However, the term cascade usually
applies to a topology that connects network devices (such as hubs or switches), whereas the
term bus usually applies to a topology that connects end nodes. The name indicates the
topological context. Note that the bus topology enables direct communication between each
pair of attached devices. This direct communication model is an important feature of the
bus topology. A cascade topology may require additional protocols for inter-switch or interrouter communication. The connection between each pair of network devices (switch or
router) is sometimes treated as a point-to-point (PTP) topology. A PTP topology is a linear
topology with only two attached devices (similar to a very small bus topology). Protocols
that support multiple topologies often have special procedures or control signals for PTP
connections. In other words, the PTP communication model is not always the same as the
bus communication model. In fact, many protocols have evolved specically for PTP
connections (for example, high-level data-link control [HDLC] and Point-to-Point Protocol
[PPP]). PTP protocols behave quite differently than protocols designed for the bus
topology. Some protocols developed for the bus topology also support the PTP topology.
However, protocols designed specically for PTP connections typically do not operate
properly in a bus topology.
Devices connected in a circular manner form a ring/loop topology. From a geometrical
viewpoint, a ring/loop is essentially a cascade/bus with its two ends connected. However,
most protocols behave quite differently in a ring/loop topology than in a cascade/bus
topology. In fact, many protocols that support the ring/loop topology are specically
designed for that topology and have special procedures and control signals. Link
initialization, arbitration, and frame forwarding are just some of the procedures that require
special handling in a ring/loop topology.
Protocols not specically designed for the ring/loop topology often have special procedures
or protocols for using a ring/loop topology. For example, Ethernets Spanning Tree Protocol
(STP) is a special protocol that allows Ethernet to use a circular topology in a non-circular
manner (a process commonly known as loop suppression). By contrast, Fibre Channels
loop initialization primitive (LIP) sequences allow Fibre Channel devices to use a circular
Conceptual Underpinnings
61
topology (FCAL) in a circular manner. Conventional logic suggests that a ring/loop topology
must contain at least three devices. However, a ring/loop can be formed with only two
devices. When only two devices are connected in a ring/loop topology, the geometry is the
same as a PTP topology, but communication occurs in accordance with the rules of the
particular ring/loop technology. For example, if an FCAL has many devices attached, and
all but two are removed, communication between the remaining two devices still must
follow the rules of FCAL communication.
Devices connected in a perpetually branching manner form a tree topology. Alternately, a
tree topology can have a root with multiple branches that do not subsequently branch. This
appears as multiple, separate cascades terminating into a common root. The tree topology
generally is considered the most scalable topology. Partial and full mesh topologies can be
highly complex. Many people believe that simpler is better in the context of network
design, and this belief has been vindicated time and time again throughout the history of
the networking industry.
Occams razor underlies the most scalable, reliable network designs. For this reason, the
tree topology, which is hierarchical in nature, has proven to be one of the most effective
topologies for large-scale environments. However, complex partial-mesh topologies can
also scale quite large if the address allocation scheme is well considered, and the routing
protocols are congured with sufcient compartmentalization. The Internet offers ample
proof of that.
62
The rst approach is service oriented. A service instance is located, followed by device
name resolution, and nally by optional address resolution. For example, a query for a
service instance might be sent via some service-locating protocol to a well-known unicast
or multicast address, an anycast address, or a broadcast address. A device replying to such
queries typically returns a list of device names known to support the service sought by the
querying device. The querying device then resolves one of the names to an address and
optionally resolves that address to another address. The querying device can then initiate
service requests.
The second approach is device-oriented. Devices are discovered rst and then queried for
a list of supported services. One technique involves the use of a device discovery protocol
to query a network by transmitting to a well-known multicast address or even the broadcast
address. Devices responding to such a protocol often provide their name or address in the
payload of the reply. Another technique is to directly query each unicast address in the
address space. A timeout value must be established to avoid a hang condition (waiting
forever) when a query is sent to an address at which no active device resides. Address
probing typically works well only for technologies that support a small address space.
Another technique is to send a query to a central registration authority to discover all
registered devices. This is more practical for technologies that support a large address
space. In such a case, devices are expected to register upon booting by sending a
registration request to a well-known unicast or multicast address or sometimes to the
broadcast address. Following a query, name or address resolution might be required
depending on the content of the reply sent by the central registration authority. After devices
have been discovered and names and addresses have been resolved, some or all of the
devices are directly probed for a list of supported services. The querying device may then
initiate service requests.
High bandwidth
Low latency
No jitter
Inherent in-order delivery of frames and I/O transactions
Low probability of dropped frames
63
These criteria are easy to meet when designing a new network technology for small-scale
deployments with only one application to transport. However, the modern business climate
demands efciency in all areas, which inevitably leads to infrastructure consolidation.
Meeting all of the SPIs legacy requirements in a converged network environment can be
quite challenging and complex. So we include a brief discussion of the SPI herein to ensure
that readers understand the legacy requirements.
SPI Throughput
As is true of most network technologies, the throughput of the SPI has improved
signicantly since its inception. Four factors determine the maximum supported data
transfer rate of each SPI specication. These are data bus width, data transfer mode, signal
transition mode, and transceiver type. There are two data bus widths known as narrow
(eight bits wide) and wide (16 bits wide). There are three data transfer modes known as
asynchronous, synchronous, and paced. There are two signal transition modes known as single
transition (ST) and double transition (DT). There are four transceiver types known as
single-ended (SE), multimode single-ended (MSE), low voltage differential (LVD), and
high voltage differential (HVD). All these variables create a dizzying array of potential
combinations. SCSI devices attached to an SPI via any transceiver type always default to
the lowest throughput combination for the other three variables: narrow, asynchronous, and
ST. Initiators must negotiate with targets to use a higher throughput combination for data
transfers. Table 3-1 summarizes the maximum data transfer rate of each SPI specication
and the combination used to achieve that rate. Because the SPI is a parallel technology, data
transfer rates customarily are expressed in megabytes rather than megabits. Though the
SPI-5 specication is complete, very few products have been produced based on that
standard. Serial attached SCSI (SAS) was well under way by the time SPI-5 was nished,
so most vendors opted to focus their efforts on SAS going forward.
64
NOTE
A 32-bit data bus was introduced in SPI-2, but it was made obsolete in SPI-3. For this
reason, discussion of the 32-bit data bus is omitted.
Table 3-1
Common Name
Transfer Rate
Negotiated Combination
SPI-2
Ultra2
80 MBps
SPI-3
Ultra3 or Ultra160
160 MBps
SPI-4
Ultra320
320 MBps
SPI-5
Ultra640
640 MBps
The SPI consists of multiple wires that are used for simultaneous parallel transmission of
signals. Some of these signals are used for control and operation of the SPI, and others are
used for data transfer. The set of wires used for data transfer is collectively called the data
bus. References to the data bus can be confusing because the data bus is not a separate bus,
but is merely a subset of wires within the SPI. The data bus also can be used to transfer
certain control information. Use of the data bus is regulated via a state machine. An SPI is
always in one of the following eight states (called bus phases):
BUS FREE
ARBITRATION
SELECTION
RESELECTION
COMMAND
DATA
STATUS
MESSAGE
The COMMAND, DATA, STATUS, and MESSAGE phases use the data bus, whereas the
other four phases do not. The COMMAND, STATUS, and MESSAGE phases always
operate in narrow, asynchronous, ST mode. The DATA phase may operate in several modes
subject to initiator/target negotiation. The transfer rates in Table 3-1 apply only to data
transfer across the data bus during the DATA phase. Data can be transferred across the data
bus only during the DATA phase. Control information can be transferred across the data bus
only during the COMMAND, STATUS, and MESSAGE phases. Each DATA phase is
preceded and followed by other phases. This is the reason some references to the SPIs
throughput describe the maximum transfer rate as a burst rate rather than a sustained rate.
However, the intervening phases are the logical equivalent of frame headers and inter-frame
65
SPI Topologies
As previously stated, the SPI operates as a multidrop bus. Each attached device is connected
directly to a shared medium, and signals propagate the entire length of the medium
without being regenerated by attached devices. The device at each end of the bus must have
a terminator installed to prevent signal reection. All intermediate devices must have
their terminator removed or disabled to allow end-to-end signal propagation. Each device
uses the data bus for both transmission and reception of data. Moreover, there are no dedicated
transmit wires or dedicated receive wires in the data bus. There are only data wires. Each
device implements a dual function driver per data bus wire. Each driver is capable of
both transmission and reception. This contrasts with other technologies in which separate
transmit and receive wires require separate transmit and receive drivers in the attached
devices. In such technologies, the transmit driver of each device is connected to the receive
driver of the next device via a single wire. The signal on that wire is perceived as transmit
by one device and as receive by the other device. A second wire connects the reciprocal
receive driver to the reciprocal transmit driver. Figure 3-8 illustrates the difference.
SPI Common Drivers Versus Dedicated Drivers
Terminator
Figure 3-8
NIC
Rx
Tx
Tx
Rx
NIC
66
The data bus implementation of the SPI makes the SPI inherently half-duplex. Though it is
possible to build dual function drivers capable of simultaneous transmission and reception,
the SPI does not implement such drivers. The half-duplex nature of the SPI reects the halfduplex nature of the SCSI protocol. The SPI was designed specically for the SCSI
protocol, and the multidrop bus is the SPIs only supported topology.
NOTE
The SPI supports multiple initiators, but such a conguration is very rare in the real world.
Thus, a single initiator is assumed in all SPI discussions in this book unless otherwise
stated.
Ethernet
Many people misunderstand the current capabilities of Ethernet because of lingering
preconceptions formed during the early days of Ethernet. In its pre-switching era, Ethernet
had some severe limitations. However, most of Ethernets major drawbacks have been
eliminated by the widespread adoption of switching and other technological advances. This
section explains how Ethernet has evolved to become the most broadly applicable LAN
technology in history, and how Ethernet provides the foundation for new services (like
storage networking) that might not be considered LAN-friendly.
Ethernet
67
68
to the medium. The method of arbitration is called carrier sense multiple access with
collision detection (CSMA/CD). ARCNET and Token Ring both employed half-duplex
communication at that time, because they too were shared media implementations.
ARCNET and Token Ring both employed arbitration based on token passing schemes in
which some nodes could have higher priority than others. The CSMA/CD mechanism has
very low overhead compared to token passing schemes. The tradeoffs for this low overhead
are indeterminate throughput because of unpredictable collision rates and (as mentioned
previously) the inability to achieve the maximum theoretical throughput.
As time passed, ARCNET and Token Ring lost market share to Ethernet. That shift in
market demand was motivated primarily by the desire to avoid unnecessary complexity, to
achieve higher throughput, and (in the case of Token Ring) to reduce costs. As Ethernet
displaced ARCNET and Token Ring, it expanded into new deployment scenarios. As more
companies became dependent upon Ethernet, its weaknesses became more apparent. The
unreliability of Ethernets coax cabling became intolerable, so Thinnet and Thicknet gave
way to 10BASE-T. The indeterminate throughput of CSMA/CD also came into sharper
focus. This and other factors created demand for deterministic throughput and line-rate
performance. Naturally, the cost savings of Ethernet became less important to consumers
as their demand for increased functionality rose in importance. So, 10BASE-T switches
were introduced, and Ethernets 10 Mbps line rate was increased to 100 Mbps (ofcially
called 100BASE-T but commonly called Fast Ethernet and abbreviated as FE). FE hubs
were less expensive than 10BASE-T switches and somewhat masked the drawbacks of
CSMA/CD. 10BASE-T switches and FE hubs temporarily satiated 10BASE-T users
demand for improved performance.
FE hubs made it possible for Fiber Distributed Data Interface (FDDI) users to begin
migrating to Ethernet. Compared to FDDI, FE was inexpensive for two reasons. First, the
rapidly expanding Ethernet market kept prices in check as enhancements were introduced.
Second, FE used copper cabling that was (at the time) signicantly less expensive than
FDDIs ber optic cabling. FE hubs proliferated quickly as 10BASE-T users upgraded and
FDDI users migrated. Newly converted FDDI users renewed the demand for deterministic
throughput and line rate performance. They were accustomed to high performance because
FDDI used a token passing scheme to ensure deterministic throughput. Around the same
time, businesses of all sizes in all industries began to see LANs as mandatory rather than
optional. Centralized le servers and e-mail servers were proliferating and changing the
way businesses operated. So, FE hubs eventually gave way to switches capable of
supporting both 10 Mbps and 100 Mbps nodes (commonly called 10/100 auto-sensing
switches).
TIP
Auto-sensing is a common term that refers to the ability of Ethernet peers to exchange link
level capabilities via a process ofcially named auto-negotiation. Note that Fibre Channel
also supports auto-sensing.
Ethernet
69
10BASE-T switches had been adopted by some companies, but uncertainty about switching
technology and the comparatively high cost of 10BASE-T switches prevented their
widespread adoption. Ethernet switches cost more than their hub counterparts because they
enable full-duplex communication. Full-duplex communication enables each node to
transmit and receive simultaneously, which eliminates the need to arbitrate, which
eliminates Ethernet collisions, which enables deterministic throughput and full line-rate
performance. Another benet of switching is the decoupling of transmission rate from
aggregate throughput. In switched technologies, the aggregate throughput per port is twice
the transmission rate. The aggregate throughput per switch is limited only by the internal
switch design (the crossbar implementation, queuing mechanisms, forwarding decision
capabilities, and so on) and the number of ports. So, as people began to fully appreciate the
benets of switching, the adoption rate for 10/100 auto-sensing switches began to rise. As
switching became the norm, Ethernet left many of its limitations behind. Today, the vast
majority of Ethernet deployments are switch-based.
Ethernets basic media access control (MAC) frame format contains little more than the
bare essentials. Ethernet assumes that most protocol functions are handled by upper layer
protocols. This too reects Ethernets low overhead, high efciency philosophy. As
switching became the norm, some protocol enhancements became necessary and took the
form of MAC frame format changes, new frame types, and new control signals.
One of the more notable enhancements is support for link-level ow control. Many people
are unaware that Ethernet supports link-level ow control. Again reecting its low overhead
philosophy, Ethernet supports a simple back-pressure ow-control mechanism rather than
a credit-based mechanism. Some perceive this as a drawback because back-pressure
mechanisms can result in frame loss during periods of congestion. This is true of some
back-pressure mechanisms, but does not apply to Ethernets current mechanism if it is
implemented properly. Some people remember the early days of Ethernet when some
Ethernet switch vendors used a different back-pressure mechanism (but not Cisco). Many
switches initially were deployed as high-speed interconnects between hub segments. If a
switch began to experience transmit congestion on a port, it would create back pressure on
the port to which the source node was connected. The switch did so by intentionally
generating a collision on the source nodes segment. That action resulted in dropped frames
and lowered the effective throughput of all nodes on the source segment. The modern
Ethernet back-pressure mechanism is implemented only on full-duplex links, and uses
explicit signaling to control the transmission of frames intelligently on a per-link basis. It
is still possible to drop Ethernet frames during periods of congestion, but it is far less likely
in modern Ethernet implementations. We can further reduce the likelihood of dropped
frames through proper design of the MAC sublayer components. In fact, the IEEE 802.32002 specication explicitly advises system designers to account for processing and linkpropagation delays when implementing ow control. In other words, system designers can
and should proactively invoke the ow-control mechanism rather than waiting for all
receive buffers to ll before transmitting a ow control frame. A receive buffer high-water
mark can be established to trigger ow-control invocation. Because there is no mechanism
for determining the round-trip time (RTT) of a link, system designers should take a
70
conservative approach to determining the high-water mark. The unknown RTT represents
the possibility of dropped frames. When frames are dropped, Ethernet maintains its lowoverhead philosophy by assuming that an upper layer protocol will handle detection and
retransmission of the dropped frames.
Ethernet Throughput
As mentioned previously, Ethernet initially operated at only 10 Mbps. Over time, faster
transmission rates came to market beginning with 100 Mbps (also called Fast Ethernet
[FE]) followed by 1000 Mbps (GE). Each time the transmission rate increased, the autosensing capabilities of NICs and switch ports adapted. Today, 10/100/1000 auto-sensing
NICs and switches are common. Ethernet achieved a transmission rate of 10 Gbps (called
10Gig-E and abbreviated as 10GE) in early 2003. Currently, 10GE interfaces do not
interoperate with 10/100/1000 interfaces because 10GE does not support auto-negotiation.
10GE supports only full-duplex mode and does not implement CSMA/CD.
As previously mentioned, a detailed analysis of Ethernet data transfer rates is germane to
modern storage networking. This is because the iSCSI, Fibre Channel over TCP/IP (FCIP),
and Internet Fibre Channel Protocol (iFCP) protocols (collectively referred to as IP storage
(IPS) protocols) are being deployed on FE and GE today, and the vast majority of IPS
deployments will likely run on GE and 10GE as IPS protocols proliferate. So, it is useful
to understand the throughput of GE and 10GE when calculating throughput for IPS
protocols. The names GE and 10GE refer to their respective data bit rates, not to their
respective raw bit rates. This contrasts with some other technologies, whose common
names refer to their raw bit rates. The ber optic variants of GE operate at 1.25 GBaud and
encode 1 bit per baud to provide a raw bit rate of 1.25 Gbps. The control bits reduce the
data bit rate to 1 Gbps. To derive ULP throughput, we must make some assumptions
regarding framing options. Using a standard frame size (no jumbo frames), the maximum
payload (1500 bytes), the 802.3 basic MAC frame format, no 802.2 header, and minimum
inter-frame spacing (96 bit times), a total of 38 bytes of framing overhead is incurred. The
ULP throughput rate is 975.293 Mbps.
The copper variant of GE is somewhat more complex. It simultaneously uses all four pairs
of wires in a Category 5 (Cat5) cable. Signals are transmitted on all four pairs in a striped
manner. Signals are also received on all four pairs in a striped manner. Implementing dual
function drivers makes full-duplex communication possible. Each signal operates at 125
MBaud. Two bits are encoded per baud to provide a raw bit rate of 1 Gbps. There are no
dedicated control bits, so the data bit rate is also 1 Gbps. This yields the same ULP
throughput rate as the ber optic variants.
The numerous variants of 10GE each fall into one of three categories:
10GBASE-X
10GBASE-R
10GBASE-W
Ethernet
71
NOTE
Baud Rate
Raw Bit
Rate
Data Bit
Rate
ULP
Throughput
GE Fiber Optic
1.25 GBaud
1.25 Gbps
1 Gbps
975.293 Mbps
GE Copper
125 MBaud x 4
1 Gbps
1 Gbps
975.293 Mbps
10GBASE-X
3.125 GBaud x 4
12.5 Gbps
10 Gbps
9.75293 Gbps
10GBASE-R
10.3125 GBaud
10.3125
Gbps
10 Gbps
9.75293 Gbps
10GBASE-W
9.95328 GBaud
9.5846 Gbps
9.2942
Gbps
9.06456 Gbps
The IEEE 802.3-2002 specication includes a 1 Mbps variant called 1Base5 that was
derived from an obsolete network technology called StarLAN. Discussion of 1Base5 is
omitted herein. StarLAN evolved to support 10 Mbps operation and was called StarLAN10.
The IEEEs 10BaseT specication was partially derived from StarLAN10.
72
To transport IPS protocols, additional framing overhead must be incurred. Taking iSCSI as
an example, three additional headers are required: IP, TCP, and iSCSI. The TCP/IP section
of this chapter discusses IPS protocol overhead.
Ethernet Topologies
Today, Ethernet supports all physical topologies. The original Ethernet I and Ethernet II
specications supported only the bus topology. When the IEEE began development of the
rst 802.3 specication, they decided to include support for the star topology. However,
the communication model remained bus-oriented. Although the 802.3-2002 specication
still includes the Thinnet and Thicknet bus topologies (ofcially called 10BASE2 and
10BASE5 respectively), they are obsolete in the real world. No other bus topologies are
specied in 802.3-2002. The star topology is considered superior to the bus topology for
various reasons. The primary reasons are easier cable plant installation, improved fault
isolation, and (when using a switch) the ability to support full-duplex communication. The
star topology can be extended by cascading Ethernet hubs (resulting in a hybrid topology).
There are two classes of FE hub. Class I FE hubs cannot be cascaded because of CSMA/
CD timing restrictions. Class II FE hubs can be cascaded, but no more than two are allowed
per collision domain. Cascading is the only means of connecting Ethernet hubs. FE hubs
were largely deprecated in favor of 10/100 switches. GE also supports half-duplex
operation, but consumers have not embraced GE hubs. Instead, 10/100/1000 switches have
become the preferred upgrade path. While it is technically possible to operate IPS protocols
on half-duplex Ethernet segments, it is not feasible because collisions can (and do) occur.
So, the remaining chapters of this book focus on switch-based Ethernet deployments.
Unlike Ethernet hubs, which merely repeat signals between ports, Ethernet switches bridge
signals between ports. For this reason, Ethernet switches are sometimes called multiport
bridges. In Ethernet switches, each port is a collision domain unto itself. Also, collisions
can occur only on ports operating in half-duplex mode. Because most Ethernet devices
(NICs and switches alike) support full-duplex mode and auto-negotiation, most switch
ports operate in full-duplex mode today. Without the restrictions imposed by CSMA/CD,
all topologies become possible. There is no restriction on the manner in which Ethernet
switches may be interconnected. Likewise, there is no restriction on the number of Ethernet
switches that may be interconnected. Ethernet is a broadcast capable technology; therefore
loops must be suppressed to avoid broadcast storms. As mentioned previously, this is
accomplished via STP. The physical inter-switch topology will always be reduced to a
logical cascade or tree topology if STP is enabled. Most switch-based Ethernet
deployments have STP enabled by default. The remainder of this book assumes that STP is
enabled unless stated otherwise.
A pair of modern Ethernet nodes can be directly connected using a twisted pair or ber
optic cable (crossover cable). The result is a PTP topology in which auto-negotiation occurs
directly between the nodes. The PTP topology is obviously not useful for mainstream
storage networking, but is useful for various niche situations. For example, dedicated
TCP/IP Suite
73
heartbeat connections between clustered devices are commonly implemented via Ethernet.
If the cluster contains only two devices, a crossover cable is the simplest and most reliable
solution.
NOTE
Each ULP has a reserved protocol identier called an Ethertype. Originally, the Ethertype
was not included in the 802.3 header, but it is now used in the 802.3 header to identify the
intended ULP within the destination node. The Ethertype eld enables multiplexing of
ULPs on Ethernet. Ethertypes could be used as well known ports (WKPs) for the purpose
of ULP discovery, but each ULP would need to dene its own probe/reply mechanism
to exploit the Ethertype eld. This is not the intended purpose of the Ethertype eld.
Ethertypes are assigned and administered by the IEEE to ensure global uniqueness.
TCP/IP Suite
Many years ago, corporate LANs were characterized by multiple protocols. Each network
operating system (NOS) implemented its own protocol at OSI Layer 3. That made life very
difcult for systems administrators, network administrators, and users alike. Over time,
each NOS vendor began to support multiple protocols. That enabled system and network
administrators to converge on one protocol. For better or worse, IP eventually emerged as
the predominant OSI Layer 3 protocol for LANs, MANs, and WANs. For many companies,
74
the choice was not based on technical superiority, but rather on the need for Internet
connectivity. If any other protocol ever had a chance against IP, the Internet boom sealed its
fate. Today, IP is by far the most ubiquitous OSI Layer 3 protocol on earth.
TCP/IP Suite
75
interested parties comprising vendors and users alike. Maintenance of the global address
pool, root domain list, protocol documentation, port registry, and other operational
constructs is overseen by nonprot organizations. While no protocol suite is perfect, TCP/
IP seems to get closer every day as the result of unprecedented development efforts aimed
at meeting the ever-increasing range of demands placed upon TCP/IP. TCP/IPs application
support is unrivaled by any other protocol suite, and virtually every operating system
supports TCP/IP. Additionally, IP supports virtually every OSI Layer 2 protocol. That
powerful combination of attributes forms the basis of TCP/IPs truly ubiquitous
connectivity. Ubiquitous connectivity is hugely advantageous for the users of TCP/IP.
Ubiquitous connectivity fosters new relationships among computing resources and human
communities. In an interesting quote about the thought process that led to the creation of
the ARPANET protocols and eventually the modern TCP/IP suite, Stephen Crocker states,
We looked for existing abstractions to use. It would have been convenient if we could have
made the network simply look like a tape drive to each host, but we knew that wouldnt do.
The recent ratication of the iSCSI protocol, which adapts tape drives and other storage
devices to TCP/IP, is testimony of how far TCP/IP has come.
TCP/IP Throughput
IP runs on lower-layer protocols; therefore the achievable throughput depends on the
underlying technology. At the low end, IP can run on analog phone lines via PPP and a
modem at speeds from 300 bps to 56 Kbps. At the high end, IP can run on SONET OC192c or 10GE at approximately 10 Gbps. Many factors can affect throughput in IP
networks. The maximum transmission unit (MTU) of the underlying technology
determines the TCP maximum segment size (MSS) for the source node. The larger the
MSS, the more efciently TCP can use the underlying technology.
Fragmentation of IP packets at intermediate routers also can affect throughput. Fragmentation
occurs when a router forwards an IP packet onto an interface with a smaller MTU than
the source interface. Once a packet is fragmented, it must remain fragmented across the
remainder of the path to the destination node. That decreases the utilization efciency of
the path from the point of fragmentation to the destination node. IP fragmentation also
increases the processing burden on the intermediate router that performs the fragmentation,
and on the destination node that reassembles the fragments. This can lead to degradation of
performance by increasing CPU and memory consumption. Path MTU (PMTU) discovery
offers a means of avoiding IP fragmentation by using ICMP immediately before TCP
session establishment. (See Chapter 6, OSI Network Layer, for more information about
ICMP and PMTU discovery in IP networks.)
TCP optimization is another critical factor. There are many parameters to consider when
optimizing TCP performance. RFC 1323 describes many such parameters. Windowing is
central to all TCP operations. The TCP slow start algorithm describes a process by which
end nodes begin communicating slowly in an effort to avoid overwhelming intermediate
network links. This is necessary because end nodes generally do not know how much
bandwidth is available in the network or how many other end nodes are vying for that
76
GE Fiber Optic
975.293 Mbps
918.075 Mbps
GE Copper
975.293 Mbps
918.075 Mbps
10GBASE-X
9.75293 Gbps
9.18075 Gbps
10GBASE-R
9.75293 Gbps
9.18075 Gbps
10GBASE-W
9.06456 Gbps
8.53277 Gbps
The ULP throughput calculation for iSCSI is different than FCIP and iFCP. Both FCIP and
iFCP use the common FC frame encapsulation (FC-FE) format dened in RFC 3643. The
FC-FE header consists of 7 words, which equates to 28 bytes. This header is encapsulated
in the TCP header. Using the same assumptions as in Table 3-3, the ULP throughput can be
calculated by adding 20 bytes for the IP header and 20 bytes for the TCP header and 28
bytes for the FC-FE header. Table 3-4 summarizes the throughput rates available to FC via
FCIP and iFCP on an Ethernet network.
Table 3-4
GE Fiber Optic
975.293 Mbps
931.079 Mbps
GE Copper
975.293 Mbps
931.079 Mbps
10GBASE-X
9.75293 Gbps
9.31079 Gbps
10GBASE-R
9.75293 Gbps
9.31079 Gbps
10GBASE-W
9.06456 Gbps
8.65364 Gbps
TCP/IP Suite
77
Note that the ULP in Table 3-4 is FC, not SCSI. To determine the throughput available to
SCSI, the FC framing overhead must also be included. If the IP Security (IPsec) protocol
suite is used with any of the IPS protocols, an additional header must be added. This further
decreases the throughput available to the ULP. Figure 3-9 illustrates the protocol stacks for
iSCSI, FCIP and iFCP. Note that neither FCIP nor iFCP adds additional header bits. Certain
bits in the FC-FE header are available for ULP-specic usage.
Figure 3-9
iSCSI
FCIP
iFCP
SCSI
SCSI
FCP
FCP
FC
FC
SCSI
FCIP
iFCP
iSCSI
FC-FE
FC-FE
TCP
TCP
TCP
IP
IP
IP
Ethernet
Ethernet
Ethernet
TCP/IP Topologies
IP supports all physical topologies but is subject to the limitations of the underlying
protocol. OSI Layer 2 technologies often reduce complex physical topologies into simpler
logical topologies. IP sees only the logical topology created by the underlying technology.
For example, the logical tree topology created by Ethernets STP appears to be the physical
topology as seen by IP. IP routing protocols can organize the OSI Layer 2 end-to-end
logical topology (that is, the concatenation of all interconnected OSI Layer 2 logical
topologies) into a wide variety of OSI Layer 3 logical topologies.
Large-scale IP networks invariably incorporate multiple OSI Layer 2 technologies, each
with its own limitations. The resulting logical topology at OSI Layer 3 is usually a partial
mesh or a hybrid of simpler topologies. Figure 3-10 illustrates a hybrid topology in which
multiple sites, each containing a tree topology, are connected via a ring topology.
IP routing is a very complex topic that exceeds the scope of this book. Chapter 10, Routing
and Switching Protocols, introduces some basic routing concepts, but does not cover the
topic in depth. One point is worth mentioning: Some IP routing protocols divide a large
topology into multiple smaller topologies called areas or autonomous regions. The
boundary between the logical topologies is usually called a border. The logical topology on
each side of a border is derived independently by the instance of the routing protocol
running within each area or autonomous region.
78
WAN
Discovery Contexts
In the context of humans, a user often learns the location of a desired service via noncomputerized means. For example, a user who needs access to a corporate e-mail server is
told the name of the e-mail server by the e-mail administrator. The World Wide Web (more
commonly referred to as the Web) provides another example. Users often learn the name of
websites via word of mouth, e-mail, or TV advertisements. For example, a TV commercial
that advertises for Cisco Systems would supply the companys URL, http://
www.cisco.com/. When the user decides to visit the URL, the service (HTTP) and the host
providing the service (www.cisco.com) are already known to the user. So, service and
device discovery mechanisms are not required. Name and address resolution are the only
required mechanisms. The user simply opens the appropriate application and supplies the
name of the destination host. The application transparently resolves the host name to an IP
address via the Domain Name System (DNS).
TCP/IP Suite
79
Another broadly deployed IP-based name resolution mechanism is the NetBIOS Name
Service (NBNS). NBNS enables Microsoft Windows clients to resolve NetBIOS names to
IP addresses. The Windows Internet Name Service (WINS) is Microsofts implementation
of NBNS. DNS and NBNS are both standards specied by IETF RFCs, whereas WINS is
proprietary to Microsoft. Once the host name has been resolved to an IP address, the TCP/
IP stack may invoke ARP to resolve the Ethernet address associated with the destination IP
address. An attempt to resolve the destination hosts Ethernet address occurs only if the
IP address of the destination host is within the same IP subnet as the source host. Otherwise,
the Ethernet address of the default gateway is resolved. Sometimes a user does not know of
instances of the required service and needs assistance locating such instances. The Service
Location Protocol (SLP) can be used in those scenarios. SLP is discussed in the context of
storage in following paragraphs.
In the context of storage, service and device discovery mechanisms are required.
Depending on the mechanisms used, the approach may be service-oriented or deviceoriented. This section uses iSCSI as a representative IPS protocol. An iSCSI target node
represents a service (SCSI target). An iSCSI target node is, among other things, a process
that acts upon SCSI commands and returns data and status to iSCSI initiators. There are
three ways to inform an iSCSI initiator of iSCSI target nodes:
Manual Conguration
Manual conguration works well in small-scale environments where the incremental cost
and complexity of dynamic discovery is difcult to justify. Manual conguration can also
be used in medium-scale environments that are mostly static, but this is not recommended
because the initial conguration can be onerous. The administrator must supply each
initiator with a list containing the IP address, port number, and iSCSI target name
associated with each iSCSI target node that the initiator will access. TCP port number 3260
is registered for use by iSCSI target devices, but iSCSI target devices may listen on other
port numbers. The target name can be specied in extended unique identier (EUI) format,
iSCSI qualied name (IQN) format or network address authority (NAA) format (see
Chapter 8, OSI Session, Presentation, and Application Layers). After an initiator
establishes an iSCSI session with a target node, it can issue a SCSI REPORT LUNS
command to discover the LUNs dened on that target node.
Semi-Manual Conguration
Semi-manual conguration works well for small- to medium-scale environments. It involves
the use of the iSCSI SendTargets command. The SendTargets command employs a deviceoriented approach. To understand the operation of the SendTargets command, some
80
background information is needed. There are two types of iSCSI session: discovery and
normal. All iSCSI sessions proceed in two phases: login and full feature. The login phase
always occurs rst. For discovery sessions, the purpose of login is to identify the initiator
node to the target entity, so that security lters can be applied to responses. Thus, the initiator
node name must be included in the login request. Because target node names are not yet
known to the initiator, the initiator is not required to specify a target node name in the login
request. So, the initiator does not log in to any particular target node. Instead, the initiator
logs into the unidentied iSCSI entity listening at the specied IP address and TCP port. This
special login procedure is unique to discovery sessions. Normal iSCSI sessions require the
initiator to specify the target node name in the login request. Upon completion of login, a
discovery session changes to the full-feature phase. iSCSI commands can be issued only
during the full-feature phase. During a discovery session, only the SendTargets command
may be issued; no other operations are supported. The sole purpose of a discovery session is
to discover the names of and paths to target nodes. Upon receiving a SendTargets command,
the target entity issues a SendTargets response containing the iSCSI node names of targets
accessible via the IP address and TCP port at which the SendTargets command was received.
The response may also contain additional IP addresses and TCP ports at which the specied
target nodes can be reached. After discovery of target nodes, a SCSI REPORT Luns
command must be issued to each target node to discover LUNs (via a normal iSCSI session).
To establish a discovery session, the initiator must have some knowledge of the target
entities. Thus, the initiators must be manually congured with each target entitys IP
address and TCP port number. The SendTargets command also may be used during normal
iSCSI sessions for additional path discovery to known target nodes.
The SendTargets command contains a parameter that must be set to one of three possible
values: ALL, the name of an iSCSI target node, or null. The parameter value ALL can
be used only during an iSCSI discovery session. The administrator must congure each
initiator with at least one IP address and port number for each target device. Upon boot
or reset, the initiator establishes a TCP session and an iSCSI discovery session to each
congured target device. The initiator then issues a SendTargets command with a value of
ALL to each target device. Each target device returns a list containing all iSCSI target
names (representing iSCSI target nodes) to which the initiator has been granted access. The
IP address(es), port number(s) and target portal group tag(s) (TPGT) at which each target
node can be reached are also returned. (The TPGT is discussed in Chapter 8, OSI Session,
Presentation, and Application Layers.)
After initial discovery of target nodes, normal iSCSI sessions can be established. The
discovery session may be maintained or closed. Subsequent discover sessions may be
established. If an initiator issues the SendTargets command during a normal iSCSI session,
it must specify the name of a target node or use a parameter value of null. When the
parameter value is set to the name of an iSCSI target node, the target device returns the IP
address(es), port number(s), and TPGT(s) at which the specied target node can be reached.
This is useful for discovering new paths to the specied target node or rediscovering paths
after an unexpected session disconnect. This parameter value is allowed during discovery
and normal iSCSI sessions. The third parameter value of null is similar to the previous
TCP/IP Suite
81
example, except that it can be used only during normal iSCSI sessions. The target device
returns a list containing all IP address(es), port number(s), and TPGT(s) at which the target
node of the current session can be reached. This is useful for discovering path changes
during a normal iSCSI session.
The iSCSI RFCs specify no method for automating the discovery of target devices.
However, it is technically possible for initiators to probe with echo-request ICMP packets
to discover the existence of other IP devices. Given the range of possible IP addresses, it is
not practical to probe every IP address. So, initiators would need to limit the scope of their
ICMP probes (perhaps to their local IP subnet). Initiators could then attempt to establish a
TCP session to port 3260 at each IP address that replied to the echo-request probe. Upon
connection establishment, an iSCSI discovery session could be established followed by an
iSCSI SendTargets command. Target devices that are not listening on the reserved port
number would not be discovered by this method. Likewise, target devices on unprobed IP
subnets would not be discovered. This probe method is not recommended because it is not
dened in any iSCSI-related RFC, because it has the potential to generate considerable
overhead trafc, and because it suffers from functional limitations.
Automated Conguration
Automated conguration is possible via SLP, which is dened in RFC 2165 and updated in
RFC 2608. SLP employs a service-oriented approach that works well for medium- to largescale environments. SLP denes three entities known as the user agent (UA), service agent
(SA), and directory agent (DA). The UA is a process that runs on a client device. It issues
service-request messages via multicast or broadcast on behalf of client applications. (Other
SLP message types are dened but are not discussed herein.) The SA is a process that runs
on a server device and replies to service-request messages via unicast if the server device is
running a service that matches the request. The SLP service type templates that describe
iSCSI services are dened in RFC 4018. UAs may also send service request messages via
unicast if the server location (name or address) is known. An SA must reply to all unicast
service requests, even if the requested service is not supported. A UA may include its iSCSI
initiator name in the service request message. This allows SAs to lter requests and reply
only to authorized initiators. Such a lter is generically called an access control list (ACL).
For scalability, the use of one or more SLP DAs can be enlisted. The DA is a process that
runs on a server device and provides a central registration facility for SAs. If a DA is
present, each SA registers its services with the DA, and the DA replies to UAs on behalf of
SAs. SA service information is cached in the DA store.
There are four ways that SAs and UAs can discover DAs: multicast/broadcast request,
multicast/broadcast advertisement, manual conguration, and Dynamic Host
Conguration Protocol (DHCP).
The rst way involves issuing a service-request message via multicast or broadcast seeking
the DA service. The DA replies via unicast. The reply consists of a DAAdvert message
containing the name or address of the DA and the port number of the DA service if a nonstandard port is in use.
82
The second way involves listening for unsolicited DAAdvert messages, which are
transmitted periodically via multicast or broadcast by each DA.
The third way is to manually congure the addresses of DAs on each device containing a
UA or SA. This approach defeats the spirit of SLP and is not recommended. The fourth way
is to use DHCP to advertise the addresses of DAs. DHCP code 78 is dened as the SLP
Directory Agent option.
When a DA responds to a UA service request message seeking services other than the DA
service, the reply contains the location (host name/IP address and TCP port number if a
non-standard port is in use) of all hosts that have registered the requested service. In the
case of iSCSI, the reply also contains the iSCSI node name of the target(s) accessible at
each IP address. Normal name and address resolution then occurs as needed.
SLP supports a scope feature that increases scalability. SLP scopes enable efcient use of
DAs in multi-DA environments by conning UA discovery within administratively dened
boundaries. The SLP scope feature offers some security benets, but it is considered
primarily a provisioning tool. Every SA and DA must belong to one or more scopes, but
scope membership is optional for UAs. A UA that belongs to a scope can discover services
only within that scope. UAs can belong to more than one scope at a time, and scope
membership is additive. UAs that do not belong to a scope can discover services in all
scopes. Scope membership can be manually congured on each UA and SA, or DHCP can
be used. DHCP code 79 is dened as the SLP Service Scope option.
Automated conguration is also possible via the Internet Storage Name Service (iSNS), which
is dened in RFC 4171. iSNS employs a service-oriented approach that is well suited to largescale environments. Like SLP, iSNS provides registration and discovery services. Unlike SLP,
these services are provided via a name server modeled from the Fibre Channel Name Server
(FCNS). Multiple name servers may be present, but only one may be active. The others act as
backup name servers in case the primary name server fails. All discovery requests are
processed by the primary iSNS name server; target devices do not receive or reply to discovery
requests. This contrasts with the SLP model, in which direct communication between initiators
and target devices can occur during discovery. Because of this, the iSNS client is equivalent to
both the SLP SA and UA. The iSNS server is equivalent to the SLP DA, and the iSNS database
is equivalent to the SLP DA store. Also like SLP, iSNS provides login and discovery control.
The iSNS Login Control feature is equivalent to the initiator name lter implemented by SLP
targets, but iSNS Login Control is more robust. The iSNS dscovery domain (DD) is equivalent
to the SLP scope, but iSNS DDs are more robust. Unlike SLP, iSNS supports centralized
conguration, state change notication (SCN), and device mapping.
iSNS clients locate iSNS servers via the same four methods that SLP UAs and SAs use to
locate SLP DAs. The rst method is a client-initiated multicast/broadcast request. Rather
than dene a new procedure for this, iSNS clients use the SLP multicast/broadcast
procedure. This method requires each iSNS client to implement an SLP UA and each iSNS
server to implement an SLP SA. If an SLP DA is present, the iSNS servers SA registers
with the DA, and the DA responds to iSNS clients UA requests. Otherwise, the iSNS
TCP/IP Suite
83
servers SA responds directly to iSNS clients UA requests. The service request message
contains a request for the iSNS service.
The second method is a server initiated multicast/broadcast advertisement. The iSNS server
advertisement is called the Name Service Heartbeat. In addition to client discovery, the
heartbeat facilitates iSNS primary server health monitoring by iSNS backup servers.
The third method is manual conguration. Though this method is not explicitly permitted
in the iSNS RFC, support for manual conguration is common in vendor implementations
of practically every protocol. As with SLP, this approach is not recommended because it
undermines the spirit of iSNS. The fourth method is DHCP. DHCP code 83 is dened as
the iSNS option.
All clients (initiators and targets) can register their name, addresses, and services with the
name server on the iSNS server upon boot or reset, but registration is not required. Any
registered client (including target nodes) can query the name server to discover other
registered clients. When a registered iSCSI initiator queries the iSNS, the reply contains the
IP address(es), TCP port(s), and iSCSI node name of each target node accessible by the
initiator. Unregistered clients are denied access to the name server. In addition, clients can
optionally register for state change notication. An SCN message updates registered clients
whenever a change occurs in the iSNS database (such as a new client registration). SCN
messages are limited by DD membership, so messages are sent only to the affected clients.
This is known as regular SCN registration. Management stations can also register for SCN.
Management registration allows all SCN messages to be sent to the management node
regardless of DD boundaries. Target devices may also register for entity status inquiry (ESI)
messages. ESI messages allow the iSNS server to monitor the reachability of target devices.
An SCN message is generated when a target device is determined to be unreachable.
DD membership works much like SLP scope membership. Clients can belong to one or more
DDs simultaneously, and DD membership is additive. A default DD may be dened into
which all clients not explicitly assigned to at least one named DD are placed. Clients in the
default DD may be permitted access to all clients in all DDs or may be denied access to all
DDs other than the default DD. The choice is implementation specic. Clients belonging to
one or more named DDs are allowed to discover only those clients who are in at least one
common DD. This limits the probe activity that typically follows target node discovery. As
mentioned previously, a SCSI REPORT LUNs command must be issued to each target node
to discover LUNs (via a normal iSCSI session). After discovery of target nodes, LUN discovery
is usually initiated to every discovered target node. By limiting discovery to only those target
nodes that the initiator will use, unnecessary probe activity is curtailed. Management nodes are
allowed to query the entire iSNS database without consideration for DD membership.
Management nodes also can update the iSNS database with DD and Login Control
conguration information that is downloadable by clients, thus centralizing conguration
management. The notion of a DD set (DDS) is supported by iSNS to improve manageability.
Many DDs can be dened, but only those DDs that belong to the currently active DDS are
considered active.
Microsoft has endorsed iSNS as its preferred iSCSI service location mechanism. Additionally,
iSNS is required for iFCP operation. However, SLP was recently augmented by the IETF
84
to better accommodate the requirements of FCIP and iSCSI. Thus, it is reasonable to expect
that both iSNS and SLP will proliferate in IPS environments.
An iSNS database can store information about iSCSI and Fibre Channel devices. This enables
mapping of iSCSI devices to Fibre Channel devices, and vice versa. The common iSNS
database also facilitates transparent management across both environments (assuming that the
management application supports this). Note that there are other ways to accomplish this. For
example, the Cisco MDS9000 cross-registers iSNS devices in the FCNS and vice versa. This
enables the Cisco Fabric Manager to manage iSCSI devices via the FCNS.
Fibre Channel
FC is currently the network technology of choice for storage networks. FC is designed to
offer high throughput, low latency, high reliability, and moderate scalability. Consequently,
FC can be used for a broad range of ULPs. However, market adoption of FC for general
purpose IP connectivity is not likely, given Ethernets immense installed base, lower cost,
and comparative simplicity. Also, FC provides the combined functionality of Ethernet and
TCP/IP. So, running TCP/IP on FC represents a duplication of network services that
unnecessarily increases cost and complexity. That said, IP over FC (IPFC) is used to solve
certain niche requirements.
NOTE
Some people think FC has higher throughput than Ethernet and, on that basis, would be a
good t for IP networks. However, Ethernet supports link aggregation in increments of one
Gbps. So, a host that needs more than one Gbps can achieve higher throughput simply by
aggregating multiple GE NICs. This is sometimes called NIC teaming, and it is very common.
Fibre Channel
85
FC Throughput
FCP is the FC-4 mapping of SCSI-3 onto FC. Understanding FCP throughput in the same
terms as iSCSI throughput is useful because FCP and iSCSI can be considered direct
competitors. (Note that most vendors position these technologies as complementary today,
but both of these technologies solve the same business problems.) Fibre Channel
throughput is commonly expressed in bytes per second rather than bits per second. This is
similar to SPI throughput terminology, which FC aspires to replace. In the initial FC
specication, rates of 12.5 MBps, 25 MBps, 50 MBps, and 100 MBps were introduced on
several different copper and ber media. Additional rates were subsequently introduced,
including 200 MBps and 400 MBps. These colloquial byte rates, when converted to bits per
second, approximate the data bit rate (like Ethernet). 100 MBps FC is also known as one
Gbps FC, 200 MBps as 2 Gbps, and 400 MBps as 4 Gbps. These colloquial bit rates
approximate the raw bit rate (unlike Ethernet). This book uses bit per second terminology
for FC to maintain consistency with other serial networking technologies. Today, 1 Gbps
and 2 Gbps are the most common rates, and ber-optic cabling is the most common
medium. That said, 4 Gbps is being rapidly and broadly adopted. Additionally, ANSI
recently dened a new rate of 10 Gbps (10GFC), which is likely to be used solely for interswitch links (ISLs) for the next few years. Storage array vendors might adopt 10GFC
eventually. ANSI is expected to begin dening a new rate of 8 Gbps in 2006. The remainder
of this book focuses on FC rates equal to and greater than 1 Gbps on ber-optic cabling.
The FC-PH specication denes baud rate as the encoded bit rate per second, which means
the baud rate and raw bit rate are equal. The FC-PI specication redenes baud rate more
accurately and states explicitly that FC encodes 1 bit per baud. Indeed, all FC-1 variants up
to and including 4 Gbps use the same encoding scheme (8B/10B) as GE ber optic variants.
1-Gbps FC operates at 1.0625 GBaud, provides a raw bit rate of 1.0625 Gbps, and provides
a data bit rate of 850 Mbps. 2-Gbps FC operates at 2.125 GBaud, provides a raw bit rate of
2.125 Gbps, and provides a data bit rate of 1.7 Gbps. 4-Gbps FC operates at 4.25 GBaud,
provides a raw bit rate of 4.25 Gbps, and provides a data bit rate of 3.4 Gbps. To derive ULP
throughput, the FC-2 header and inter-frame spacing overhead must be subtracted. Note
that FCP does not dene its own header. Instead, elds within the FC-2 header are used by
86
FCP. The basic FC-2 header adds 36 bytes of overhead. Inter-frame spacing adds another
24 bytes. Assuming the maximum payload (2112 bytes) and no optional FC-2 headers,
the ULP throughput rate is 826.519 Mbps, 1.65304 Gbps, and 3.30608 Gbps for 1 Gbps,
2 Gbps, and 4 Gbps respectively. These ULP throughput rates are available directly
to SCSI.
The 10GFC specication builds upon the 10GE specication. 10GFC supports ve
physical variants. Two variants are parallel implementations based on 10GBASE-X.
One variant is a parallel implementation similar to 10GBASE-X that employs four pairs
of ber strands. Two variants are serial implementations based on 10GBASE-R. All
parallel implementations operate at a single baud rate. Likewise, all serial variants
operate at a single baud rate. 10GFC increases the 10GE baud rates by 2 percent. Parallel
10GFC variants operate at 3.1875 GBaud per signal, provide an aggregate raw bit rate of
12.75 Gbps, and provide an aggregate data bit rate of 10.2 Gbps. Serial 10GFC variants
operate at 10.51875 GBaud, provide a raw bit rate of 10.51875 Gbps, and provide a data
bit rate of 10.2 Gbps. Note that serial 10GFC variants are more efcient than parallel
10GFC variants. This is because of different encoding schemes. Assuming the maximum
payload (2112 bytes) and no optional FC-2 headers, the ULP throughput rate is 9.91823
Gbps for all 10GFC variants. This ULP throughput rate is available directly to SCSI.
Table 3-5 summarizes the baud, bit, and ULP throughput rates of the FC and 10GFC
variants.
Table 3-5
FC Variant
Baud Rate
FCP ULP
Throughput
1 Gbps
1.0625 GBaud
1.0625 Gbps
850 Mbps
826.519 Mbps
2 Gbps
2.125 GBaud
2.125 Gbps
1.7 Gbps
1.65304 Mbps
4 Gbps
4.25 GBaud
4.25 Gbps
3.4 Gbps
3.30608 Gbps
10GFC Parallel
3.1875 GBaud x 4
12.75 Gbps
10.2 Gbps
9.91823 Gbps
10GFC Serial
10.51875 GBaud
10.51875 Gbps
10.2 Gbps
9.91823 Gbps
FCP
SCSI
FCP
FC
Fibre Channel
87
FC Topologies
FC supports all physical topologies, but protocol operations differ depending on the
topology. Protocol behavior is tailored to PTP, loop, and switch-based topologies. Like
Ethernet, Fibre Channel supports both shared media and switched topologies. A shared
media FC implementation is called a Fibre Channel Arbitrated Loop (FCAL), and a switchbased FC implementation is called a fabric.
FC PTP connections are used for DAS deployments. Companies with SPI-based systems
that need higher throughput can upgrade to newer SPI equipment or migrate away from
SPI. The FC PTP topology allows companies to migrate away from SPI without migrating
away from the DAS model. This strategy allows companies to become comfortable with FC
technology in a controlled manner and offers investment protection of FC HBAs if and
when companies later decide to adopt FC-SANs. The FC PTP topology is considered
a niche.
Most FC switches support FCAL via special ports called fabric loop (FL) ports. Most FC
HBAs also support loop protocol operations. An HBA that supports loop protocol operations
is called a node loop port (NL_Port). Without support for loop protocol operations, a
port cannot join an FCAL. Each time a device joins an FCAL, an attached device resets
or any link-level error occurs on the loop, the loop is reinitialized, and all communication
is temporarily halted. This can cause problems for certain applications such as tape
backup, but these problems can be mitigated through proper network design. Unlike
collisions in shared media Ethernet deployments, loop initialization generally occurs
infrequently. That said, overall FCAL performance can be adversely affected by
recurring initializations to such an extent that a fabric topology becomes a requirement.
The FCAL addressing scheme is different than fabric addressing and limits FCAL
deployments to 127 nodes (126 if not fabric attached). However, the shared medium of
an FCAL imposes a practical limit of approximately 18 nodes. FCAL was popular in the
early days of FC but has lost ground to FC switches in recent years. FCAL is still used
inside most JBOD chassis, and in some NAS lers, blade centers, and even enterpriseclass storage subsystems, but FCAL is now essentially a niche technology for embedded
applications.
Like Ethernet, FC switches can be interconnected in any manner. Unlike Ethernet, there is
a limit to the number of FC switches that can be interconnected. Address space constraints
limit FC-SANs to a maximum of 239 switches. Ciscos virtual SAN (VSAN) technology
increases the number of switches that can be physically interconnected by reusing the entire
FC address space within each VSAN. FC switches employ a routing protocol called fabric
shortest path rst (FSPF) based on a link-state algorithm. FSPF reduces all physical
topologies to a logical tree topology. Most FC-SANs are deployed in one of two designs
commonly known as the core-only and core-edge designs. The core-only is a star topology,
and the core-edge is a two-tier tree topology. The FC community seems to prefer its own
terminology, but there is nothing novel about these two topologies other than their names.
Host-to-storage FC connections are usually redundant. However, single host-to-storage FC
connections are common in cluster and grid environments because host-based failover
88
mechanisms are inherent to such environments. In both the core-only and core-edge
designs, the redundant paths are usually not interconnected. The edge switches in the coreedge design may be connected to both core switches, but doing so creates one physical
network and compromises resilience against network-wide disruptions (for example, FSPF
convergence). As FC-SANs proliferate, their size and complexity are likely to increase.
Advanced physical topologies eventually might become mandatory, but rst, condence in
FSPF and trafc engineering mechanisms must increase. Figures 3-12 and 3-13 illustrate
the typical FC-SAN topologies. The remaining chapters of this book assume a switch-based
topology for all FC discussions.
Figure 3-12 Dual Path Core-Only Topology
FC
FC
Path A
Path B
FC
Path A
Path A
Path B
Path B
Path A
Path B
Fibre Channel
89
90
rmware. Hosts and storage subsystems send FCNS registration and discovery requests to
this WKA. Each FC switch processes FCNS requests for the nodes that are physically
attached. Each FC switch updates its local database with registration information about
locally attached devices, distributes updates to other FCNSs via inter-switch registered state
change notications (SW_RSCNs), listens for SW_RSCNs from other FCNSs, and caches
information about non-local devices. FC nodes can optionally register to generate and
receive normal RSCNs. This is done via the state change registration (SCR) procedure. The
SCR procedure is optional, but practically all FC nodes register because notication of
changes is critical to proper operation of storage devices. Registration also enables FC nodes
to trigger RSCNs whenever their internal state changes (for example, a change occurs to one
of the FC-4 operational parameters). This is done via the RSCN Request procedure.
FC zones are similar to iSNS DDs and SLP scopes. Nodes can belong to one or more zones
simultaneously, and zone membership is additive. Zones are considered to be soft or hard.
Membership in a soft zone may be determined by node name, node port name, or node port
address. Membership in a hard zone is determined by the FC switch port to which a node
is connected. A default zone exists into which all nodes not explicitly assigned to at least
one named zone are placed. Nodes in the default zone cannot communicate with nodes in
other zones. However, communication among nodes within the default zone is optional.
The choice is implementation specic. Nodes belonging to one or more named zones are
allowed to discover only those nodes that are in at least one common zone. Management
nodes are allowed to query the entire FCNS database without consideration for zone
membership. The notion of a zone set is supported to improve manageability. Many zones
can be dened, but only those zones that belong to the currently active zone set are
considered active. Management nodes are able to congure zone sets, zones, and zone
membership via the FC zone server (FCZS). RSCNs are limited by zone boundaries. In
other words, only the nodes in the affected zone(s) are notied of a change.
Summary
This chapter provides a high-level overview of the lower layer network protocols that
compose the SCSI architecture model (SAM) Interconnect technologies. Some historical
insight is provided to help readers understand why each technology works the way it does.
Key operational characteristics are presented to give readers a feel for each technology.
Continuing this approach, Chapter 4, Overview of Modern SCSI Networking Protocols,
provides a high-level overview of the upper-layer network protocols that compose the SAM
Transport technologies.
Review Questions
1 List the functions provided by bit-level encoding schemes.
2 What must be subtracted from the data bit rate to determine the ULP throughput rate?
3 List the six physical topologies.
Review Questions
91
4 How can a protocol that is not designed to operate in a circular physical topology
nodes?
18 Did FC evolve into a multi-protocol transport, or was it designed as such from the
beginning?
19 What is the ULP throughput rate of FCP on one Gbps FC?
20 FC protocol operations are dened for what topologies?
21 FC device discovery can be bounded by what mechanism?
Explain the purpose of each Upper Layer Protocol (ULP) commonly used in modern
storage networks
Describe the general procedures used by each ULP discussed in this chapter
CHAPTER
iSCSI
This section provides a brief introduction to the Internet Small Computer System Interface
(iSCSI) protocol.
94
iSCSI is an application protocol that relies on IP network service protocols for name
resolution (Domain Name System [DNS]), security (IPsec), ow control (TCP windowing),
service location (Service Location Protocol [SLP], and Internet Storage Name Service
[iSNS]), and so forth. This simplies iSCSI implementation for product vendors by
eliminating the need to develop a solution to each network service requirement.
When the IETF rst began developing the iSCSI protocol, concerns about the security of
IP networks prompted the IETF to require IPsec support in every iSCSI product. This
requirement was later deemed too burdensome considering the chip-level technology
available at that time. So, the IETF made IPsec support optional in the nal iSCSI standard
(Request For Comments [RFC] 3720). IPsec is implemented at the OSI network layer and
complements the authentication mechanisms implemented in iSCSI. If IPsec is used in an
iSCSI deployment, it may be integrated into the iSCSI devices or provided via external
devices (such as IP routers). The iSCSI standard stipulates which specic IPsec features
must be supported if IPsec is integrated into the iSCSI devices. If IPsec is provided via
external devices, the feature requirements are not specied. This allows shared external
IPsec devices to be congured as needed to accommodate a wide variety of pass-through
protocols. Most iSCSI deployments currently do not use IPsec.
One of the primary design goals of iSCSI is to match the performance (subject to underlying
bandwidth) and functionality of existing SCSI Transport Protocols. As Chapter 3, An
Overview of Network Operating Principles, discusses, the difference in underlying
bandwidth of iSCSI over Gigabit Ethernet (GE) versus Fibre Channel Protocol (FCP)
over 2-Gbps Fibre Channel (FC) is not as signicant as many people believe. Another
oft misunderstood fact is that very few 2-Gbps Fibre Channel Storage Area Networks
(FC-SANs) are fully utilized. These factors allow companies to build block-level storage
networks using a rich selection of mature IP/Ethernet infrastructure products at comparatively
low prices without sacricing performance. Unfortunately, many storage and switch
vendors have propagated the myth that iSCSI can be used in only low-performance
environments. Compounding this myth is the cost advantage of iSCSI, which enables
cost-effective attachment of low-end servers to block-level storage networks. A low-end
server often costs about the same as a pair of fully functional FC Host Bus Adapters (HBAs)
required to provide redundant FC-SAN connectivity. Even with the recent introduction of
limited-functionality HBAs, FC attachment of low-end servers is difcult to cost-justify in
many cases. So, iSCSI is currently being adopted primarily for low-end servers that are not
SAN-attached. As large companies seek to extend the benets of centralized storage to
low-end servers, they are considering iSCSI. Likewise, small businesses, which have
historically avoided FC-SANs altogether due to cost and complexity, are beginning to
deploy iSCSI networks.
That does not imply that iSCSI is simpler to deploy than FC, but many small businesses are
willing to accept the complexity of iSCSI in light of the cost savings. It is believed that
iSCSI (along with the other IP Storage [IPS] protocols) eventually can breathe new life into
the Storage Service Provider (SSP) market. In the SSP market, iSCSI enables initiators
secure access to centralized storage located at an SSP Internet Data Center (IDC) by
iSCSI
95
removing the distance limitations of FC. Despite the current adoption trend in low-end
environments, iSCSI is a very robust technology capable of supporting relatively highperformance applications. As existing iSCSI products mature and additional iSCSI
products come to market, iSCSI adoption is likely to expand into high-performance
environments.
Even though some storage array vendors already offer iSCSI-enabled products, most
storage products do not currently support iSCSI. By contrast, iSCSI TCP Ofoad Engines
(TOEs) and iSCSI drivers for host operating systems are widely available today. This has
given rise to iSCSI gateway devices that convert iSCSI requests originating from hosts
(initiators) to FCP requests that FC attached storage devices (targets) can understand. The
current generation of iSCSI gateways is characterized by low port density devices designed
to aggregate multiple iSCSI hosts. Thus, the iSCSI TOE market has suffered from low
demand. As more storage array vendors introduce native iSCSI support in their products,
use of iSCSI gateway devices will become less necessary. In the long term, it is likely that
companies will deploy pure FC-SANs and pure iSCSI-based IP-SANs (see Figures 1-1 and
1-2, respectively) without iSCSI gateways, and that use of iSCSI TOEs will likely become
commonplace. That said, iSCSI gateways that add value other than mere protocol conversion might remain a permanent component in the SANs of the future. Network-based
storage virtualization is a good example of the types of features that could extend the useful
life of iSCSI gateways. Figure 4-1 illustrates a hybrid SAN built with an iSCSI gateway
integrated into an FC switch. This deployment approach is common today.
Figure 4-1
FC
iSCSI
IP Network
iSCSI
TOE
TOE
Tape
Library
iSCSI
TOE
Ethernet
Switch
iSCSI-Enabled
Server Farm
FC Switch
with iSCSI
Gateway
Ethernet
Switch
FC
HBA
iSCSI
FC
FC
HBA
FC
HBA
FC-Enabled
Server Farm
Storage
Array
96
NAS ler evolution is the ability to use FC on the backend. A NAS ler is essentially an
optimized le server; therefore the problems associated with the DAS model apply equally
to NAS lers and traditional servers. As NAS lers proliferate, the distributed storage that
is captive to individual NAS lers becomes costly and difcult to manage. Support for
FC on the backend allows NAS lers to leverage the FC-SAN infrastructure that many
companies already have. For those companies that do not currently have an FC-SAN, iSCSI
could be deployed as an alternative behind the NAS lers (subject to adoption of iSCSI by
the storage array vendors). Either way, using a block-level protocol behind NAS lers
enables very large-scale consolidation of NAS storage into block-level arrays. In the long
term, it is conceivable that all NAS protocols and iSCSI could be supported natively by
storage arrays, thus eliminating the need for an external NAS ler. Figure 4-2 illustrates the
model in which an iSCSI-enabled NAS ler is attached to an FC-SAN on the backend.
Figure 4-2
Traditional
NAS Filer
NAS
IP Network
FC
iSCSI
TOE
NAS
iSCSI
TOE
TOE
Ethernet
Switch
iSCSI-Enabled
Server Farm
Ethernet
Switch
Tape
Library
FC Switch
FC
FC
HBA
FC
FC
HBA
FC
HBA
FC-Enabled
Server Farm
Storage
Array
FCP
97
with each successfully authenticated target. SCSI Logical Unit Number (LUN) discovery
is the nal step. The semi-manual discovery method requires an additional intermediate
step. All iSCSI sessions are classied as either discovery or normal. A discovery session is
used exclusively for iSCSI target discovery. All other iSCSI tasks are accomplished using
normal sessions. Semi-manual conguration requires the host to establish a discovery
session with each iSCSI device. Target discovery is accomplished via the iSCSI SendTargets
command. The host then optionally authenticates each target within each iSCSI device.
Next, the host opens a normal iSCSI session with each successfully authenticated target and
performs SCSI LUN discovery. It is common for the discovery session to remain open
with each iSCSI device while normal sessions are open with each iSCSI target.
Each SCSI command is assigned an iSCSI Command Sequence Number (CmdSN). The
iSCSI CmdSN has no inuence on packet tracking within the SCSI Interconnect. All
packets comprising SCSI commands, data, and status are tracked in ight via the TCP
sequence-numbering mechanism. TCP sequence numbers are directional and represent an
increasing byte count starting at the initial sequence number (ISN) specied during TCP
connection establishment. The TCP sequence number is not reset with each new iSCSI
CmdSN. There is no explicit mapping of iSCSI CmdSNs to TCP sequence numbers. iSCSI
complements the TCP sequence-numbering scheme with PDU sequence numbers. All
PDUs comprising SCSI commands, data, and status are tracked in ight via the iSCSI
CmdSN, Data Sequence Number (DataSN), and Status Sequence Number (StatSN),
respectively. This contrasts with the FCP model.
FCP
This section provides a brief introduction to FCP.
98
by reducing development overhead. Note that even OSI Session Layer login procedures are
dened by the FC specications (not the FCP specications). This contrasts with the IP
network model, in which each application protocol species its own OSI session layer
login procedures. That said, FC-4 protocols are not required to use the services dened by
the FC specications.
There is a general perception that FC networks are inherently secure because they are
physically separate (cabled independently) from the Internet and corporate intranets. This
tenet is erroneous, but the FC user community is not likely to realize their error until FC
security breaches become commonplace. For example, the vast majority of hosts that are
attached to an FC-SAN are also attached to an IP network. If a host is compromised via
the IP network, the host becomes a springboard for the intruder to access the FC-SAN.
Moreover, FC-SANs are commonly extended across IP networks for disaster recovery
applications. Such SAN extensions expose FC-SANs to a wide variety of attacks commonly
perpetrated on IP networks. Authentication and encryption mechanisms are dened in the
FC specications, so FCP does not dene its own security mechanisms. Unlike iSCSI, no
security mechanisms are mandatory for FCP deployment. This fact and the misperception
about the nature of FC network security have resulted in the vast majority of FC-SANs
being deployed without any authentication or encryption.
Because FCP can be transported only by FC, the adoption rate of FCP is bound to the
adoption rate of FC. The adoption rate of FC was relatively low in the late 1990s, in part
because of the comparatively high cost of FC infrastructure components, which relegated
FCP to high-end servers hosting mission-critical applications. Around 2000, FC adoption reached critical mass in the high-end server market. Simultaneously, performance
improvements were being realized as companies began to view switched FC networks as
the best practice instead of Fibre Channel Arbitrated Loop (FC-AL). In the years following,
the adoption rate of switched FC increased dramatically. Consequently, FC prices began to
drop and still are dropping as the FC market expands and competition increases. Furthermore,
in response to competition from iSCSI, some FC HBA vendors have recently introduced
light versions of their HBAs that provide less functionality at a lower cost than traditional
FC HBAs. FCP is now being used by mid-range and high-end servers hosting a wide variety
of business applications.
In the traditional FC-SAN design, each host and storage device is dual-attached to the
network. (As previously noted, there are some exceptions to this guideline.) This is
primarily motivated by a desire to achieve 99.999 percent availability. Conventional
wisdom suggests that the network should be built with redundant switches that are not
interconnected to achieve 99.999 percent availability. In other words, the network is
actually two separate networks (commonly called path A and path B), and each end node
(host or storage) is connected to both networks. Some companies take the same approach
with their traditional IP/Ethernet networks, but most do not for reasons of cost. Because the
traditional FC-SAN design doubles the cost of network implementation, many companies
are actively seeking alternatives. Some companies are looking to iSCSI as the answer, and
others are considering single path FC-SANs. Figures 3-12 and 3-13 illustrate typical dual
path FC-SAN designs.
FCP
99
100
FCIP
This section provides a brief introduction to Fibre Channel Over TCP/IP (FCIP).
FCIP
Figure 4-3
101
FC
FC
HBA
HBA
FC
HBA
Tape
Library
IP Network
FCIP Tunnel
FCIP
FC Switch
with FCIP
Storage
Array
FCIP
FC
FC
HBA
HBA
FC
HBA
FC Switch
with FCIP
Storage
Array
The two switches at the FCIP tunnel endpoints establish a standard FC inter-switch link
(ISL) through the tunnel. Essentially, the FCIP tunnel appears to the switches as a cable.
Each FCIP tunnel is created using one or more TCP connections. Multiple tunnels can be
established between a pair of FC-SANs to increase fault tolerance and performance. Each
tunnel carries a single FC ISL. Load balancing across multiple tunnels is accomplished via
FC mechanisms just as would be done across multiple FC ISLs in the absence of FCIP. As
mentioned in Chapter 3, An Overview of Network Operating Principles, encapsulation is
accomplished per the Fibre Channel Frame Encapsulation (FC-FE) specication (RFC
3643). Eight bytes of the encapsulation header are used by each encapsulating protocol
(such as FCIP) to implement protocol-specic functionality. The remainder of the encapsulation
header is used for purposes common to all encapsulating protocols, such as identifying the
encapsulating protocol and enforcing FC timers end-to-end.
Connectivity failures within the transit IP network can disrupt FC operations. Obviously, a
circuit failure that results in FCIP tunnel failure will segment the FC-SAN and prevent
communication between the FC-SAN segments. Unfortunately, the effect of the disruption
is not limited to cross-tunnel trafc. Local connectivity is temporarily disrupted in each FCSAN segment while the FC routing protocol reconverges and Registered State Change
Notications (RSCNs) are processed by end nodes. Additionally, one of the FC-SAN
segments must select a new principle switch (see Chapter 5, The OSI Physical and DataLink Layers, for details). The local effect of routing protocol reconvergence and principal
switch selection can be eliminated via proprietary isolation techniques, but there currently
is no mechanism within the FCIP standard to isolate FC-SANs from IP connectivity
failures. This is generally considered to be the only signicant drawback of FCIP.
102
establishment. The FCIP endpoint that initiated the TCP connection (the tunnel initiator)
then transmits an FCIP Special Frame (FSF). The FSF contains the FC identier and FCIP
endpoint identier of the tunnel initiator, the FC identier of the intended destination, and
a 64-bit randomly selected number that uniquely identies the FSF. The receiver veries
that the contents of the FSF match its local conguration. If the FSF contents are
acceptable, the unmodied FSF is echoed back to the tunnel initiator. After the tunnel
initiator receives and veries the FSF, the FCIP tunnel may carry FC trafc.
NOTE
The term identier is used generically in this section. As discussed in Chapter 5, The OSI
Physical and Data-Link Layers, the terms identier and name each have specic meaning
throughout the subsequent chapters of this book.
A time stamp is inserted into the header of each FCIP packet transmitted. The receiver
checks the time stamp in each packet. If the time stamp indicates that the packet has been
in ight longer than allowed by the FC timers, the packet is dropped. TCP is not responsible
for retransmitting the dropped packet because the packet is dropped after TCP processing
completes. FCP and FC error detection and recovery mechanisms are responsible for
retransmitting the lost FC frame or notifying SCSI.
iFCP
This section provides a brief introduction to Internet Fibre Channel Protocol (iFCP).
iFCP
Figure 4-4
103
IP Network
FC
HBA
FC
FC
iFCP
HBA
iFCP
Gateway
iFCP
iSNS
HBA
iFCP
Gateway
Storage
Array
Storage
Array
The motivation behind this approach is to reduce the total solution cost by leveraging costeffective IP-based technology and widely available IP skill sets, extend the reach of FC
attached devices beyond the FC limits, and enable the integration of FC and IP management
operations. Unfortunately, the cost savings are undermined by the requirement for end
nodes to be attached via FC. If the end nodes were attached via IP/Ethernet, the cost would
be lower, but the solution would closely resemble iSCSI. Because iSCSI was designed to
provide an IP/Ethernet alternative to FC-SANs, iSCSI provides a much more elegant and
cost-effective solution than iFCP. Another challenge for iFCP is that only one vendor
produces iFCP gateways today, and its iFCP products do not currently provide sufcient FC
port densities to accommodate the connectivity requirements of most modern FC-SANs.
So, iFCP gateways are usually deployed in conjunction with FC switches. Combined, these
factors relegate iFCP usage to FC-SAN interconnectivity. Thus, iFCP competes against
FCIP despite the original iFCP design goals. Figure 4-5 illustrates the current deployment
practice.
Figure 4-5
IP Network
FC
HBA
FC
FC
iFCP
HBA
FC
FC
Switch
Storage
Array
iFCP
Gateway
iFCP
iSNS
iFCP
Gateway
HBA
FC
FC
Switch
Storage
Array
104
The remainder of this section focuses on FC-SAN interconnectivity. iFCP gateways can
operate in address transparency mode so that all FC devices share a single address space
across all connected FC-SANs. This mode allows IP network failures to disrupt the attached
FC-SANs just as FCIP does. For this reason, iFCP is rarely deployed in address transparency
mode, and iFCP gateway support for address transparency mode is optional. iFCP gateways
can also operate in address-translation mode. Devices in each FC-SAN communicate with
devices in other FC-SANs using FC addresses allocated from the local FC-SAN address
space. In this mode, the effect of IP network failures is mitigated. Each FC-SAN operates
autonomously, as does the IP network. Network services are provided to FC attached
devices via the FC switches. The state of FC network services in each FC-SAN must be
replicated to the iSNS for propagation to other FC-SANs. Support for address translation
mode is mandatory. Translation mode is customary in almost every deployment, so the
remainder of this section focuses on translation mode.
iFCP operation is transparent to end nodes, and encapsulation is accomplished per the
FC-FE specication (RFC 3643). However, connectivity across the IP network is handled
differently than FCIP. Instead of creating a single tunnel to carry all FC trafc, each iFCP
gateway creates a unique iFCP session to the appropriate destination iFCP gateway for
each initiator-target pair that needs to communicate. This model might work well in the
originally intended iFCP deployment scenario, but it can impede performance in the current
iFCP deployment scenario by limiting the size of the TCP window available to each iFCP
session. Two factors complicate this potential problem. First, iFCP sessions are created
dynamically in response to PLOGI requests and are gracefully terminated only in response
to LOGO requests. Second, PLOGI sessions are typically long-lived.
Like FCIP, IPsec support is mandatory for every iFCP product, and the iFCP standard
stipulates which specic IPsec features must be supported. That said, use of IPsec is
optional, and most iFCP deployments currently do not use IPsec. iFCP supports attachment
of FC-ALs to FC switches within each FC-SAN, but low-level FC-AL signals (primitives)
cannot traverse an iFCP session. This is not a problem because each iFCP gateway is
usually connected to an FC switch, so FC-AL primitives never enter the iFCP gateways.
iFCP gateways act like hosts on the IP network and do not require support for IP routing
protocols. Multiple iFCP gateways may be deployed in each FC-SAN to increase fault
tolerance and performance. Load balancing iFCP sessions across multiple iFCP gateways
is implementation-specic. iFCP supports all FC-4 protocols when operating in transparent
mode. However, it is possible for address translation mode to prevent certain FC-4
protocols from operating properly. Currently, iFCP is deployed only in FCP environments.
Review Questions
105
FCNS of each FC-SAN, the attached iFCP gateway propagates the registration information
to the iSNS. The iSNS then propagates the information to each of the other iFCP gateways.
Upon receipt, each iFCP gateway updates the FCNS of its attached FC-SAN with the
remote node information and creates an entry in its address translation table for the remote
node. At this point, the iFCP fabric is ready for initiator-target communication. TCP
connections can be handled in two ways. An iFCP gateway can proactively establish and
maintain multiple TCP connections to other gateways. These are called unbound TCP
connections. When a PLOGI request is received, the iFCP gateway creates an iFCP session
and binds it to one of the unbound TCP connections. Alternately, an iFCP gateway can
wait until it receives a PLOGI request and then establish a TCP connection immediately
followed by iFCP session establishment. While a TCP connection is bound to an iFCP
session, it cannot be used by other iFCP sessions. iFCP enforces FC frame lifetimes in the
same manner as FCIP. Likewise, detection and recovery of FC frames that are lost due to
timeout are handled by FCP and FC. Due to limited vendor support and a low adoption rate,
further examination of iFCP is currently outside the scope of this book.
Summary
This chapter provides a high-level overview of the upper-layer network protocols that
comprise the SAM Transport Protocols. The protocols reviewed are all standards and, with
the exception of iFCP, are commonly deployed in modern storage networks. The use of iSCSI
and FCP for initiator-target communication is discussed, as is the use of FCIP and iFCP
for long-distance FC-SAN connectivity across IP networks. The overviews in this chapter
complement the information provided in Chapter 3, An Overview of Network Operating
Principles, in an effort to prime readers for the technical details provided in Part II, The OSI
Layers. Part II begins by examining the details of operation at OSI Layers 1 and 2.
Review Questions
1 How does iSCSI complement the traditional storage over IP model?
2 Is iSCSI capable of supporting high-performance applications?
3 How does the FC network service model resemble the IP network service model?
4 What is the guiding principle in traditional FC network designs?
5 What does FCIP create between FC-SANs to facilitate communication?
6 Which FC-4 protocols does FCIP support?
7 Is iFCP currently being deployed for its originally intended purpose?
8 Which address mode is most commonly used in modern iFCP deployments?
PART
II
OSI Layers
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Describe the physical layer characteristics of the SCSI parallel interface (SPI),
Ethernet, and Fibre Channel (FC)
Relate the addressing schemes used by the SPI, Ethernet, and FC to the SCSI
architecture model (SAM) addressing scheme
Differentiate the SPI, Ethernet, and FC mechanisms for name and address assignment
and resolution
List the frame formats of the SPI and deconstruct the frame formats of Ethernet
and FC
Enumerate and describe the delivery mechanisms supported by the SPI, Ethernet,
and FC
Delineate the physical, logical, and virtual network boundaries observed by the SPI,
Ethernet, and FC
Chronicle the stages of link initialization for the SPI, Ethernet, and FC
CHAPTER
Conceptual Underpinnings
Networking professionals understand some of the topics discussed in this chapter, but not
others. Before we discuss each network technology, we need to discuss some of the less
understood conceptual topics, to clarify terminology and elucidate key points. This section
provides foundational knowledge required to understand addressing schemes, address
formats, delivery mechanisms, and link aggregation.
Addressing Schemes
The SPI, Ethernet, IP, and Fibre Channel all use different addressing schemes. To provide
a consistent frame of reference, we discuss the addressing scheme dened by the SAM
in this section. As we discuss addressing schemes of the SPI, Ethernet, IP, and Fibre
Channel subsequently, we will compare each one to the SAM addressing scheme. The
SAM denes four types of objects known as application client, logical unit, port, and
device. Of these, three are addressable: logical unit, port, and device.
The SCSI protocol implemented in an initiator is called a SCSI application client. A SCSI
application client can initiate only SCSI commands. No more than one SCSI application
client may be implemented per SCSI Transport Protocol within a SCSI initiator device.
Thus, no client ambiguity exists within an initiator device. This eliminates the need for
SCSI application client addresses.
The SCSI protocol implemented in a target is called a SCSI logical unit. A SCSI logical
unit can execute only SCSI commands. A SCSI target device may (and usually does) contain
more than one logical unit. So, logical units require addressing to facilitate proper forwarding
of incoming SCSI commands. A SCSI logical unit is a processing entity that represents
any hardware component capable of providing SCSI services to SCSI application clients.
Examples include a storage medium, an application-specic integrated circuit (ASIC) that
supports SCSI enclosure services (SES) to provide environmental monitoring services, a
robotic arm that supports SCSI media changer (SMC) services, and so forth. A SCSI logical
unit is composed of a task manager and a device server. The task manager is responsible for
110
queuing and managing the commands received from one or more SCSI application clients,
whereas the device server is responsible for executing SCSI commands.
SCSI ports facilitate communication between SCSI application clients and SCSI logical
units. A SCSI port consists of the hardware and software required to implement a SCSI
Transport Protocol and associated SCSI Interconnect. One notable exception is the
SPI, which does not implement a SCSI Transport Protocol.
A SCSI initiator device is composed of at least one SCSI port and at least one SCSI
application client. A SCSI target device consists of at least one SCSI port, one task router
per SCSI port, and at least one SCSI logical unit. Each task router directs incoming SCSI
commands to the task manager of the appropriate logical unit. An FC HBA or iSCSI TOE
is considered a SCSI device. This is somewhat confusing because the term device is
commonly used to generically refer to a host, storage subsystem, switch, or router. To avoid
confusion, we use the terms enclosure and network entity in the context of SCSI to describe
any host, storage subsystem, switch, or router that contains one or more SCSI devices.
A SCSI device often contains only a single SCSI port, but may contain more than one. For
example, most JBOD chassis in use today contain dual-port disk drives that implement a
single SCSI logical unit. Each disk drive is a single SCSI device with multiple ports.
Likewise, many intelligent storage arrays contain multiple SCSI ports and implement a
single SCSI device accessible via all ports. However, most multi-port FC HBAs in use
today implement a SCSI application client per port (that is, multiple single-port SCSI
devices). Many of the early iSCSI implementations are software-based to take advantage
of commodity Ethernet hardware. In such an implementation, a multi-homed host (the
network entity) typically contains a single SCSI device that consists of a single SCSI
software driver (the application client) bound to a single iSCSI software driver (the SCSI
Transport Protocol) that uses multiple IP addresses (the initiator ports making up the
SCSI Interconnect) that are assigned to multiple Ethernet NICs.
The SAM denes two types of addresses known as name and identier. A name positively
identies an object, and an identier facilitates communication with an object. Names are
generally optional, and identiers are generally mandatory. Names are implemented by
SCSI Interconnects and SCSI Transport Protocols, and identiers are implemented only
by SCSI Interconnects. The SAM addressing rules are not simple, so a brief description of
the rules associated with each SAM object follows:
Device names are optional in the SAM. However, any particular SCSI Transport
Protocol may require each SCSI device to have a name. A device name never changes
and may be used to positively identify a SCSI device. A device name is useful for
determining whether a device is accessible via multiple ports. A device may be
assigned only one name within the scope of each SCSI Transport Protocol. Each
device name is globally unique within the scope of each SCSI Transport Protocol.
Each SCSI Transport Protocol denes its own device name format and length.
Device identiers are not dened in the SAM. Because each SCSI device name is
associated with one or more SCSI port names, each of which is associated with a SCSI
port identier, SCSI device identiers are not required to facilitate communication.
Conceptual Underpinnings
NOTE
111
Port names are optional in the SAM. However, any particular SCSI Transport Protocol
may require each SCSI port to have a name. A port name never changes and may be
used to positively identify a port in the context of dynamic port identiers. A port
may be assigned only one name within the scope of each SCSI Transport Protocol. Each
port name is globally unique within the scope of each SCSI Transport Protocol.
Each SCSI Transport Protocol denes its own port name format and length.
Port identiers are mandatory. Port identiers are used by SCSI Interconnect
technologies as source and destination addresses when forwarding frames or packets.
Each SCSI Interconnect denes its own port identier format and length.
Logical unit names are optional in the SAM. However, any particular SCSI Transport
Protocol may require each SCSI logical unit to have a name. A logical unit name never
changes and may be used to positively identify a logical unit in the context of dynamic
logical unit identiers. A logical unit name is also useful for determining whether
a logical unit has multiple identiers. That is the case in multi-port storage arrays that
provide access to each logical unit via multiple ports simultaneously. A logical unit may
be assigned only one name within the scope of each SCSI Transport Protocol. Each
logical unit name is globally unique within the scope of each SCSI Transport Protocol.
Each SCSI Transport Protocol denes its own logical unit name format and length.
Logical unit identiers are mandatory. A logical unit identier is commonly called a
logical unit number (LUN). If a target device provides access to multiple logical units, a
unique LUN is assigned to each logical unit on each port. However, a logical unit may be
assigned a different LUN on each port through which the logical unit is accessed. In other
words, LUNs are unique only within the scope of a single target port. To accommodate
the wide range of LUN scale and complexity from a simple SPI bus to a SAN containing
enterprise-class storage subsystems, the SAM denes two types of LUNs known as at
and hierarchical. Each SCSI Transport Protocol denes its own at LUN format and
length. By contrast, the SAM denes the hierarchical LUN format and length. All SCSI
Transport Protocols that support hierarchical LUNs must use the SAM-dened format
and length. Up to four levels of hierarchy may be implemented, and each level may
use any one of four dened formats. Each level is 2 bytes long. The total length of a
hierarchical LUN is 8 bytes regardless of how many levels are used. Unused levels are
lled with null characters (binary zeros). Support for hierarchical LUNs is optional.
In common conversation, the term LUN is often used synonymously with the terms disk
and volume. For example, one might hear the phrases present the LUN to the host, mount
the volume, and partition the disk all used to describe actions performed against the same
unit of storage.
SCSI logical unit numbering is quite intricate. Because LUNs do not facilitate identication of nodes or ports, or forwarding of frames or packets, further details of the SAM
hierarchical LUN scheme are outside the scope of this book. For more information, readers
112
are encouraged to consult the ANSI T10 SAM-3 specication and Annex C of the original
ANSI T10 FCP specication. A simplied depiction of the SAM addressing scheme is
shown in Figure 5-1. Only two levels of LUN hierarchy are depicted.
SAM Addressing Scheme
Host
Bus #0
SAM device
containing one
SAM port
FC-HBA
Device Name
20:00:00:00:c9:2e:70:2f
Port Name
10:00:00:00:c9:2e:70:2f
Port ID
ad 01 1f
Device Name
50:06:01:60:90:20:1e:7a
Port Name
50:06:01:61:10:20:1e:7a
Port ID
ad 21 7b
FC Adapter
Device Name
50:06:01:60:90:20:1e:7a
Port Name
50:06:01:68:10:20:1e:7a
Port ID
ad b4 c8
LUN 0
Cache
Fan
Port ID 7
Bus #1
Port ID 2
Port ID 1
LUN 0
LUN 0
Magnetic
Disk Drive
Magnetic
Disk Drive
Port ID 7
Bus #2
Port ID 2
Port ID 1
LUN 0
FC Adapter
Fibre Channel
Switch
Power
Supply
CPU
Vendor-Specific Backplane
Figure 5-1
LUN 0
Tape Drive
Tape Drive
SAM Port
SAM Second Level LUN
Storage Array
Conceptual Underpinnings
113
catalog to map volume tags to content identiers (for example, backup set names). An
initiator can send a read element status command to the logical unit that represents the
robotic arm (called the media transport element) to discover the volume tag at each element
address. The initiator can then use element addresses to move media cartridges via the
move medium command. After the robotic arm loads the specied media cartridge into the
specied drive, normal I/O can occur in which the application (for example, a tape backup
program) reads from or writes to the medium using the drives LUN to access the medium,
and the mediums LBA scheme to navigate the medium. Element and barcode addressing
are not part of the SAM LUN scheme. The element and barcode addressing schemes are
both required and are both complementary to the LUN scheme. Further details of physical
element addressing, barcode addressing, and media changers are outside the scope of this
book. For more information, readers are encouraged to consult the ANSI T10 SMC-2
specication. A simplied depiction of media changer element addressing is shown in
Figure 5-2 using tape media.
Media Changer Element Addressing
Medium Transport
Element
LUN
Robotic Arm
LUN
Tape Drive 1
Port
Import/Export
Element
Data Transfer
Elements
LUN
Tape Drive 2
Backplane
Figure 5-2
LUN
Tape Drive 3
Storage
Elements
Port
114
Address Formats
Several address formats are in use today. All address formats used by modern storage
networks are specied by standards organizations. In the context of addressing schemes, a
standards body that denes an address format is called a network address authority (NAA)
even though the standards body might be engaged in many activities outside the scope of
addressing. Some network protocols use specied bit positions in the address eld of the
frame or packet header to identify the NAA and format. This enables the use of multiple
address formats via a single address eld. The most commonly used address formats
include the following:
The IEEE formats are used for a broad range of purposes, so a brief description of each
IEEE format is provided in this section. For a full description of each IEEE format,
readers are encouraged to consult the IEEE 802-2001 specication and the IEEE 64-bit
Global Identier Format Tutorial. Descriptions of the various implementations of the
IEEE formats appear throughout this chapter and Chapter 8, OSI Session, Presentation, and
Application Layers. Description of IPv4 addressing is deferred to Chapter 6, OSI
Network Layer. Note that IPv6 addresses can be used by IPS protocols, but IPv4
addresses are most commonly implemented today. Thus, IPv6 addressing is currently
outside the scope of this book. Description of iSCSI qualied names (IQNs) is
deferred to Chapter 8, OSI Session, Presentation, and Application Layers. Descriptions
of world wide names (WWNs) and FC address identiers follow in the FC section of this
chapter.
The IEEE 48-bit media access control (MAC-48) format is a 48-bit address format
that guarantees universally unique addresses in most scenarios. The MAC-48 format
supports locally assigned addresses which are not universally unique, but such usage is
uncommon. The MAC-48 format originally was dened to identify physical elements
such as LAN interfaces, but its use was expanded later to identify LAN protocols and
other non-physical entities. When used to identify non-physical entities, the format is
called the 48-bit extended unique identier (EUI-48). MAC-48 and EUI-48 addresses
are expressed in dash-separated hexadecimal notation such as 00-02-8A-9F-52-95.
Figure 5-3 illustrates the IEEE MAC-48 address format.
Conceptual Underpinnings
Figure 5-3
115
Length in Bits
Extension
Identifier
OUI
U/L
I/G
OUI
OUI
Embedded within the rst byte of the OUI is a bit called the universal/local (U/L) bit.
The U/L bit indicates whether the Extension Identier is universally administered by
the organization that produced the LAN interface or locally administered by the
company that deployed the LAN interface.
Embedded within the rst byte of the OUI is a bit called the individual/group (I/G)
bit. The I/G bit indicates whether the MAC-48 address is an individual address used
for unicast frames or a group address used for multicast frames.
The Extension Identier eld, which is three bytes long, identies each LAN
interface. Each organization manages the Extension Identier values associated
with its OUI. During the interface manufacturing process, the U/L bit is set to 0 to
indicate that the Extension Identier eld contains a universally unique value
assigned by the manufacturer. The U/L bit can be set to 1 by a network administrator
via the NIC driver parameters. This allows a network administrator to assign an
Extension Identier value according to a particular addressing scheme. In this
scenario, each LAN interface address may be a duplicate of one or more other LAN
interface addresses. This duplication is not a problem if no duplicate addresses exist
on a single LAN. That said, local administration of Extension Identiers is rare. So,
the remainder of this book treats all MAC-48 Extension Identiers as universally
unique addresses.
The growing number of devices that require a MAC-48 address prompted the IEEE to
dene a new address format called EUI-64. The EUI-64 format is a 64-bit universally
unique address format used for physical network elements and for non-physical entities.
Like the MAC-48 format, the EUI-64 format supports locally assigned addresses which are
not universally unique, but such usage is uncommon. MAC-48 addresses are still supported
116
but are no longer promoted by the IEEE. Instead, vendors are encouraged to use the
EUI-64 format for all new devices and protocols. A mapping is dened by the IEEE for use
of MAC-48 and EUI-48 addresses within EUI-64 addresses. EUI-64 addresses are
expressed in hyphen-separated hexadecimal notation such as 00-02-8A-FF-FF-9F-52-95.
Figure 5-4 illustrates the IEEE EUI-64 address format.
Figure 5-4
Length in Bits
Extension
Identifier
OUI
U/L
I/G
OUI
OUI
OUIidentical in format and usage to the OUI eld in the MAC-48 address
format.
U/L bitidentical in usage to the U/L bit in the MAC-48 address format.
I/G bitidentical in usage to the I/G bit in the MAC-48 address format.
Extension Identieridentical in purpose to the Extension Identier eld in the
MAC-48 address format. However, the length is increased from 3 bytes to 5 bytes.
The rst two bytes can be used to map MAC-48 and EUI-48 Extension Identier
values into the remaining three bytes. Alternately, the rst 2 bytes can be concatenated
with the last 3 bytes to yield a 5-byte Extension Identier for new devices and
protocols. Local administration of Extension Identiers is rare. So, the remainder of
this book treats all EUI-64 Extension Identiers as universally unique addresses.
Note the U/L and I/G bits are rarely used in modern storage networks. In fact, some FC
address formats omit these bits. However, omission of these bits from FC addresses has
no effect on FC-SANs.
Delivery Mechanisms
Delivery mechanisms such as acknowledgement, frame/packet reordering, and error
notication vary from one network technology to the next. Network technologies are
generally classied as connection-oriented or connectionless depending on the suite of
delivery mechanisms employed. However, these terms are not well dened. Confusion
can result from assumed meanings when these terms are applied to disparate network
technologies because their meanings vary signicantly depending on the context.
Conceptual Underpinnings
117
118
The SAM does not explicitly require all of these delivery guarantees, but the SAM does
assume error-free delivery of SCSI requests or responses. How the SCSI Transport Protocol
or SCSI Interconnect accomplish error-free delivery is self-determined by each protocol
suite. Client notication of delivery failure is explicitly required by the SAM. That implies
a requirement for detection of failures within the SCSI Transport Protocol layer, the SCSI
Interconnect layer, or both. A brief discussion of each delivery mechanism follows.
Buffer overrun
No route to the destination
Frame/packet corruption
Intra-switch forwarding timeout
Fragmentation required but not permitted
Administrative routing policy
Administrative security policy
Quality of Service (QoS) policy
Transient protocol error
Bug in switch or router software
For these reasons, no network protocol can guarantee that frames or packets will never be
dropped. However, a network protocol may guarantee to detect drops. Upon detection, the
protocol may optionally request retransmission of the dropped frame or packet. Because
this requires additional buffering in the transmitting device, it is uncommon in network
devices. However, this is commonly implemented in end nodes. Another option is for the
protocol that detects the dropped frame or packet to notify the ULP. In this case, the ULP
may request retransmission of the frame or packet, or simply notify the next ULP of the
drop. If the series of upward notications continues until the application is notied, the
application must request retransmission of the lost data. A third option is for the protocol
that detects the dropped frame or packet to take no action. In this case, one of the ULPs
or the application must detect the drop via a timeout mechanism.
Conceptual Underpinnings
119
delivery. Assume also that there are multiple paths between host A and host B. When load
balancing is employed within the network, and one path is congested while the other is not,
some packets will arrive at host B while others are delayed in transit. Host B might have
a timer to detect dropped packets, and that timer might expire before all delayed packets
are received. If so, host B may request retransmission of some packets from host A. When
host A retransmits the requested packets, duplicate packets eventually arrive at host B.
Various actions may be taken when a duplicate frame or packet arrives at a destination.
The duplicate can be transparently discarded, discarded with notication to the ULP, or
delivered to the ULP.
Acknowledgement
Acknowledgement provides notication of delivery success or failure. You can implement
acknowledgement as positive or negative, and as explicit or implicit. Positive acknowledgement
involves signaling from receiver to transmitter when frames or packets are successfully
received. The received frames or packets are identied in the acknowledgement frame or
packet (usually called an ACK). This is also a form of explicit acknowledgement. Negative
acknowledgement is a bit more complicated. When a frame or packet is received before
all frames or packets with lower identities are received, a negative acknowledgement frame
or packet (called an NACK) may be sent to the transmitter for the frames or packets
with lower identities, which are assumed missing. A receiver timeout value is usually
implemented to allow delivery of missing frames or packets prior to NACK transmission.
NACKs may be sent under other circumstances, such as when a frame or packet is received
but determined to be corrupt. With explicit acknowledgement, each frame or packet that is
successfully received is identied in an ACK, or each frame or packet that is dropped is
identied in a NACK. Implicit acknowledgement can be implemented in several ways.
120
For example, a single ACK may imply the receipt of all frames or packets up to the frame
or packet identied in the ACK. A retransmission timeout value is an example of implicit
NACK.
The SAM does not require explicit acknowledgement of delivery. The SCSI protocol
expects a response for each command, so frame/packet delivery failure eventually will
generate an implicit negative acknowledgement via SCSI timeout. The SAM assumes
that delivery failures are detected within the service delivery subsystem, but places no
requirements upon the service delivery subsystem to take action upon detection of a
delivery failure.
Guaranteed Delivery
Guaranteed delivery requires retransmission of every frame or packet that is dropped. Some
form of frame/packet acknowledgement is required for guaranteed delivery, even if it is
implicit. Frames or packets must be held in the transmitters memory until evidence of
successful delivery is received. Protocols that support retransmission typically impose a
limit on the number of retransmission attempts. If the limit is reached before successful
transmission, the ULP is usually notied.
Conceptual Underpinnings
121
The networking industry uses the phrases quality of service and class of service
inconsistently. Whereas quality of service generally refers to queuing policies based on
trafc prioritization, some networking technologies use class of service to convey this
meaning. Ethernet falls into this category, as some Ethernet documentation refers to
Class of Service instead of Quality of Service. Other networking technologies use class
of service to convey the set of delivery mechanisms employed. FC falls into this
category.
Guaranteed Bandwidth
Circuit-switching technologies like that used by the PSTN inherently dedicate end-to-end
link bandwidth to the connected end nodes. The drawback of this model is inefcient use
of available bandwidth within the network. Packet-switching technologies seek to optimize
use of bandwidth by sharing links within the network. The drawback of this model is
that some end nodes might be starved of bandwidth or might be allotted insufcient
bandwidth to sustain acceptable application performance. Thus, many packet-switching
technologies support bandwidth reservation schemes that allow end-to-end partial or
full-link bandwidth to be dedicated to individual trafc ows or specic node pairs.
Guaranteed Latency
All circuit-switching technologies inherently provide consistent latency for the duration of
a connection. Some circuit-switching technologies support transparent failover at OSI
Layer 1 in the event of a circuit failure. The new connection might have higher or lower
latency than the original connection, but the latency will be consistent for the duration of
the new connection. Packet-switching technologies do not inherently provide consistent
latency. Circuit emulation service, if supported, can guarantee consistent latency through
packet-switched networks. Without circuit-emulation services, consistent latency can be
achieved in packet-switching networks via proper network design and the use of QoS
mechanisms.
122
In-order Delivery
If there is only one path between each pair of nodes, in-order delivery is inherently
guaranteed by the network. When multiple paths exist between node pairs, network routing
algorithms can suppress all but one link and optionally use the suppressed link(s) as backup
in the event of primary link failure. Alternately, network routing algorithms can use all links
in a load-balanced fashion. In this scenario, frames can arrive out of order unless measures
are taken to ensure in-order delivery. In-order delivery can be dened as the receipt of
frames at the destination node in the same order as they were transmitted by the source
node. Alternately, in-order delivery can be dened as delivery of data to a specic protocol
layer within the destination node in the same order as they were transmitted by the same
protocol layer within the source node. In both of these denitions, in-order delivery applies
to a single source and a single destination. The order of frames or packets arriving at a
destination from one source relative to frames or packets arriving at the same destination
from another source is not addressed by the SAM and is generally considered benign with
regard to data integrity.
In the rst denition of in-order delivery, the network must employ special algorithms or
special congurations for normal algorithms to ensure that load-balanced links do not result
in frames arriving at the destination out of order. There are four levels of granularity for load
balancing in serial networking technologies; frame level, ow level, node level, and
network level.
Frame-level load balancing spreads individual frames across the available links and makes
no effort to ensure in-order delivery of frames. Flow-level load balancing spreads individual
ows across the available links. All frames within a ow follow the same link, thus ensuring
Conceptual Underpinnings
123
in-order delivery of frames within each ow. However, in-order delivery of ows is not
guaranteed. It is possible for frames in one ow to arrive ahead of frames in another ow
without respect to the order in which the ows were transmitted. Some protocols map each
I/O operation to an uniquely identiable ow. In doing so, these protocols enable I/O
operation load balancing. If the order of I/O operations must be preserved, node-level load
balancing must be used.
Node-level load balancing spreads node-to-node connections across the available links,
thus ensuring that all frames within all ows within each connection traverse the same link.
Multiple simultaneous connections may exist between a source and destination. Node-level
load balancing forwards all such connections over the same link. In this manner, node-level
load balancing ensures in-order delivery of all frames exchanged between each pair of
nodes. Node-level load balancing can be congured for groups of nodes at the network level
by effecting a single routing policy for an entire subnet. This is typically (but not always)
the manner in which IP routing protocols are congured. For example, a network
administrator who wants to disable load balancing usually does so for a routing protocol
(affecting all subnets reachable by that protocol) or for a specic subnet (affecting all nodes
on that single subnet). In doing so, the network administrator forces all trafc destined for
the affected subnet(s) to follow a single link. This has the same affect on frame delivery as
implementing a node-level load-balancing algorithm, but without the benets of load
balancing.
Conversely, network-level load balancing can negate the intended affects of node-level
algorithms that might be congured on a subset of intermediate links in the end-to-end
path. To illustrate this, you need only to consider the default behavior of most IP routing
protocols, which permit equal-cost path load balancing. Assume that node A transmits two
frames. Assume also that no intervening frames destined for the same subnet are received
by node As default gateway. If the default gateway has two equal-cost paths to the
destination subnet, it will transmit the rst frame via the rst path and the second frame via
the second path. That action could result in out-of-order frame delivery. Now assume that the
destination subnet is a large Ethernet network with port channels between each pair of
switches. If a node-level load-balancing algorithm is congured on the port channels, the
frames received at each switch will be delivered across each port channel with order delity,
but could still arrive at the destination node out of order. Thus, network administrators must
consider the behavior of all load-balancing algorithms in the end-to-end path. That raises
an important point; network-level load balancing is not accomplished with a special
algorithm, but rather with the algorithm embedded in the routing protocol. Also remember
that all load-balancing algorithms are employed hop-by-hop.
In the second denition of in-order delivery, the network makes no attempt to ensure inorder delivery in the presence of load-balanced links. Frames or packets may arrive out of
order at the destination node. Thus, one of the network protocols operating within the
destination node must support frame/packet reordering to ensure data integrity.
The SAM does not explicitly require in-order delivery of frames or packets composing a
SCSI request/response. However, because the integrity of a SCSI request/response
124
depends upon in-order delivery of its constituent frames, the SAM implicitly requires
the SCSI service delivery subsystem (including the protocols implemented within the
end nodes) to provide in-order delivery of frames. The nature of some historical SCSI
Interconnects, such as the SPI, inherently provides in-order delivery of all frames. With
frame-switched networks, such as FC and Ethernet, frames can be delivered out of order.
Therefore, when designing and implementing modern storage networks, take care to
ensure in-order frame delivery. Note that in-order delivery may be guaranteed without a
guarantee of delivery. In this scenario, some data might be lost in transit, but all data
arriving at the destination node will be delivered to the application in order. If the network
does not provide notication of non-delivery, delivery failures must be detected by an
ULP or the application.
Similarly, the SAM does not require initiators or targets to support reordering of SCSI
requests or responses or the service delivery subsystem to provide in-order delivery of
SCSI requests or responses. The SAM considers such details to be implementationspecic. Some applications are insensitive to the order in which SCSI requests or
responses are processed. Conversely, some applications fail or generate errors when SCSI
requests or responses are not processed in the desired order. If an application requires inorder processing of SCSI requests or responses, the initiator can control SCSI command
execution via task attributes and queue algorithm modiers (see ANSI T10 SPC-3). In
doing so, the order of SCSI responses is also controlled. Or, the application might expect
the SCSI delivery subsystem to provide in-order delivery of SCSI requests or responses.
Such applications might exist in any storage network, so storage network administrators
typically side with caution and assume the presence of such applications. Thus, modern
storage networks are typically designed and implemented to provide in-order delivery of
SCSI requests and responses. Some SCSI Transport protocols support SCSI request/
response reordering to facilitate the use of parallel transmission techniques within end
nodes. This loosens the restrictions on storage network design and implementation
practices.
Reordering should not be confused with reassembly. Reordering merely implies the
order of frame/packet receipt at the destination node does not determine the order of
delivery to the application. If a frame or packet is received out of order, it is held until
the missing frame or packet is received. At that time, the frames/packets are delivered
to the ULP in the proper order. Reassembly requires frames/packets to be held in a
buffer until all frames/packets that compose an upper layer protocol data unit (PDU)
have been received. The PDU is then reassembled, and the entire PDU is delivered to
the ULP. Moreover, reassembly does not inherently imply reordering. A protocol that
supports reassembly may discard all received frames/packets of a given PDU upon
receipt of an out-of-order frame/packet belonging to the same PDU. In this case, the
ULP must retransmit the entire PDU. In the context of fragmentation, lower-layer
protocols within the end nodes typically support reassembly. In the context of
segmentation, higher layer protocols within the end nodes typically support
reassembly.
Conceptual Underpinnings
125
Link Aggregation
Terminology varies in the storage networking industry related to link aggregation.
Some FC switch vendors refer to their proprietary link aggregation feature as trunking.
However, trunking is well dened in Ethernet environments as the tagging of frames
transmitted on an inter-switch link (ISL) to indicate the virtual LAN (VLAN) membership
of each frame. With the advent of virtual SAN (VSAN) technology for FC networks,
common sense dictates consistent use of the term trunking in both Ethernet and FC
environments.
NOTE
One FC switch vendor uses the term trunking to describe a proprietary load-balancing
feature implemented via the FC routing protocol. This exacerbates the confusion surrounding
the term trunking.
By contrast, link aggregation is the bundling of two or more physical links so that the
links appear to ULPs as one logical link. Link aggregation is accomplished within
the OSI data-link layer, and the resulting logical link is properly called a port channel.
Cisco Systems invented Ethernet port channels, which were standardized in March 2000
by the IEEE via the 802.3ad specication. Standardization enabled interoperability
in heterogeneous Ethernet networks. Aggregation of FC links is not yet standardized.
Thus, link aggregation must be deployed with caution in heterogeneous FC
networks.
Transceivers
Transceivers can be integrated or pluggable. An integrated transceiver is built into the
network interface on a line card, HBA, TOE, or NIC such that it cannot be replaced if it
fails. This means the entire interface must be replaced if the transceiver fails. For switch
line cards, this implication can be very problematic because an entire line card must be
replaced to return a single port to service when a transceiver fails. Also, the type of
connector used to mate a cable to an interface is determined by the transceiver. So, the
types of cable that can be used by an interface with an integrated transceiver are limited.
An example of an integrated transceiver is the traditional 10/100 Ethernet NIC, which
has an RJ-45 connector built into it providing cable access to the integrated electrical
transceiver.
By contrast, a pluggable transceiver incorporates all required transmit/receive
componentry onto a removable device. The removable device can be plugged into an
interface receptacle without powering down the device containing the interface (hotpluggable). This allows the replacement of a failed transceiver without removal of the
network interface. For switch line cards, this enables increased uptime by eliminating
126
A GBIC or SFP can operate at any transmission rate. The rate is specied in the MSA.
Some MSAs specify multi-rate transceivers. Typically, GBICs and SFPs are not used for
rates below 1 Gbps. Any pluggable transceiver that has a name beginning with the letter X
operates at 10 Gbps. The currently available 10-Gbps pluggable transceivers include the
following:
XENPAK is the oldest 10 gigabit MSA. The X2 and XPAK MSAs build upon the XENPAK
MSA. Both X2 and XPAK use the XENPAK electrical specication. XFP incorporates a
completely unique design.
Table 5-1
127
Media
Connectors
SPI-2
50-pin unshielded
68-pin unshielded
80-pin unshielded
50-pin shielded
68-pin shielded
50-pin unshielded
68-pin unshielded
80-pin unshielded
PCB backplane
50-pin shielded
SPI-3
68-pin shielded
SPI-4
50-pin unshielded
68-pin unshielded
80-pin unshielded
PCB backplane
50-pin shielded
68-pin shielded
SPI-5
50-pin unshielded
68-pin unshielded
80-pin unshielded
PCB backplane
50-pin shielded
68-pin shielded
SPI-2
1.525
SPI-3
1.525
SPI-4
1.525
SPI-5
225
128
129
Priority
10
11
12
13
14
15
16
SCSI ID
15
14
13
12
11
10
NOTE
It is possible for multiple initiators to be connected to a single SPI bus. Such a conguration
is used for clustering solutions, in which multiple hosts need simultaneous access to a
common set of LUNs. However, most SPI bus deployments are single initiator congurations.
There are two methods of arbitration; normal arbitration, and quick arbitration and
selection (QAS). Normal arbitration must be supported by every device, but QAS support
is optional. QAS can be negotiated between pairs of devices, thus allowing each device
to use normal arbitration to communicate with some devices and QAS arbitration to
communicate with other devices. This enables simultaneous support of QAS-enabled
devices and non-QAS devices on a single SPI bus.
Using normal arbitration, priority determines which device gains access to the bus if more
than one device simultaneously requests access. Each device that loses an arbitration
attempt simply retries at the next arbitration interval. Each device continues retrying until
no higher priority devices simultaneously arbitrate, at which point the lower priority device
can transmit. This can result in starvation of low-priority devices. So, an optional fairness
algorithm is supported to prevent starvation. When fairness is used, each device maintains
a record of arbitration attempts. Higher priority devices are allowed to access the bus rst,
but are restrained from arbitrating again until lower priority devices that lost previous
arbitration attempts are allowed to access the bus.
The intricacies of this arbitration model can be illustrated with an analogy. Suppose that a
director of a corporation goes to the companys cafeteria for lunch and arrives at the cashier
line at precisely the same time as a vice president. The director must allow the vice
president to go ahead based on rank. Now suppose that the corporations president arrives.
The president must get in line behind the director despite the presidents superior rank.
Also, the director may not turn around to schmooze with the president (despite the
130
directors innate desire to schmooze upwardly). Thus, the director cannot offer to let the
president go ahead. However, the director can step out of line (for example, to swap a
chocolate whole milk for a plain low-fat milk). If the director is out of line long enough for
the cashier to service one or more patrons, the director must re-enter the line behind all
other employees.
QAS is essentially normal arbitration with a streamlined method for detecting when the
bus is available for a new arbitration attempt. Normal arbitration requires the SPI bus to
transition into the BUS FREE phase before a new arbitration attempt can occur. QAS
allows a QAS-enabled device to take control of the bus from another QAS-enabled device
without changing to BUS FREE. To prevent starvation of non-QAS-enabled devices, the
initiator can arbitrate via QAS and, upon winning, force a BUS FREE transition to occur.
Normal arbitration must then be used by all devices for the ensuing arbitration cycle. In this
respect, the convention of assigning the highest priority to the initiator allows the initiator
to police the bus. The fairness algorithm is mandatory for all QAS-enabled devices when
using QAS, but it is optional for QAS-enabled devices during normal arbitration.
131
It is not possible for frames to be dropped on an SPI bus by any device other than the
receiver. This is because there are no intermediate processing devices. Electrical
signals pass unmodied through each device attached to an SPI bus between the
source and destination devices. If the receiver drops a frame for any reason, the sender
must be notied.
When a device drives a signal on an SPI bus, the destination device reads the signal in
real time. Thus, it is not possible for duplicate frames to arrive at a destination.
Corrupt data frames are detected via the parity signal or the CRC eld. Corrupt frames
are immediately dropped by the receiver, and the sender is notied.
Devices attached to an SPI bus are not required to retransmit dropped frames. Upon
notication of frame drop, a sender may choose to retransmit the dropped frames
or abort the delivery request. If the retransmission limit is reached or the delivery
request is aborted, the ULP (SCSI) is notied.
The SPI supports ow control in different ways depending on the data transfer mode
(asynchronous, synchronous, or paced). Flow control is negotiated between each
initiator/target pair. In all cases, the ow-control mechanism is proactive.
The SPI does not provide guaranteed bandwidth. While a device is transmitting, the
full bandwidth of the bus is available. However, the full bandwidth of the bus must
be shared between all connected devices. So, each device has access to the bus less
frequently as the number of connected devices increases. Thus, the effective
bandwidth available to each device decreases as the number of connected devices
increases. The fairness algorithm also plays a role. Without fairness, high-priority
devices are allowed access to the bus more frequently than low-priority devices.
The SPI intrinsically supports in-order delivery. Because all devices are connected to
a single physical medium, and because no device on an SPI bus buffers frames for
other devices, it is impossible for frames to be delivered out of order.
Fragmentation cannot occur on the SPI bus because all devices are always connected
via a single network segment.
132
impressive that the ANSI T10 subcommittee dened a means to aggregate parallel links.
The limited distance and single-segment nature of the SPI simplied some aspects of the
SPI link aggregation scheme, which made ANSIs job a bit easier. The SPI 32-bit data
bus signals were spread across two parallel links to load-balance at the byte level.
The parallel nature of the SPI made byte-level load balancing possible. The SPI 32-bit
data bus was made obsolete by SPI-3. No subsequent SPI specication denes a new
link-aggregation technique.
Ethernet
This section explores the details of Ethernet operation. Because Ethernet has long been
sufciently stable to operate as a plug-and-play technology, it is assumed by many to be
a simple technology. In fact, the inner workings of Ethernet are quite intricate. Ethernet
is a very mature technology. It is considered the switching technology of choice for
almost every network environment. However, IPS protocols are relatively immature,
so Ethernet is trailing FC market share in block-level storage environments. As IPS
protocols mature, additional IPS products will come to market, and Ethernet will gain
market share in block-level storage environments. Thus, it is important to understand
Ethernets inner workings.
Ethernet
133
GE Variant
Medium
Modal
Bandwidth
Connectors
Transceiver
Operating
Range (m)
1000BASE-LX
9 m SMF
N/A
Duplex SC
1310nm laser
25000
1000BASE-LX
50 m MMF
500 MHz*km
Duplex SC
1310nm laser
2550
1000BASE-LX
50 m MMF
400 MHz*km
Duplex SC
1310nm laser
2550
1000BASE-LX
62.5 m MMF
500 MHz*km
Duplex SC
1310nm laser
2550
1000BASE-SX
50 m MMF
500 MHz*km
Duplex SC
850nm laser
2550
1000BASE-SX
50 m MMF
400 MHz*km
Duplex SC
850nm laser
2500
1000BASE-SX
62.5 m MMF
200 MHz*km
Duplex SC
850nm laser
2275
1000BASE-SX
62.5 m MMF
160 MHz*km
Duplex SC
850nm laser
2220
1000BASE-T
N/A
RJ-45
Electrical
0100
1000BASE-CX
N/A
DB-9, HSSDC
Electrical
025
The MT-RJ and LC ber optic connectors are not listed in Table 5-4 because they are not
specied in IEEE 802.3-2002. However, both are quite popular, and both are supported
by most GE switch vendors. Many transceiver vendors offer 1000BASE-LX-compliant
GBICs that exceed the optical requirements specied in 802.3-2002. These transceivers
are called 1000BASE-LH GBICs. They typically support a maximum distance of 10km.
Another non-standard transceiver, 1000BASE-ZX, has gained signicant popularity.
1000BASE-ZX uses a 1550nm laser instead of the standard 1310nm laser. The 1000BASEZX operating range varies by vendor because it is not standardized, but the upper limit is
typically 70100km.
Table 5-5 summarizes the media, connectors, transceivers, and operating ranges that
are specied in IEEE 802.3ae-2002 and 802.3ak-2004. The nomenclature used to
represent each dened 10GE implementation is [data rate expressed in bps concatenated
with the word BASE]-[transceiver designator concatenated with encoding
designator].
134
Table 5-5
10GE Variant
Medium
Modal
Bandwidth
Connectors
Transceiver
Operating
Range (m)
10GBASE-EW
9 m SMF
N/A
Unspecied
1550nm laser
240k *
10GBASE-EW
9 m SMF
N/A
Unspecied
1550nm laser
230k
10GBASE-ER
9 m SMF
N/A
Unspecied
1550nm laser
240k *
10GBASE-ER
9 m SMF
N/A
Unspecied
1550nm laser
230k
10GBASE-LW
9 m SMF
N/A
Unspecied
1310nm laser
210k
10GBASE-LR
9 m SMF
N/A
Unspecied
1310nm laser
210k
10GBASE-LX4
9 m SMF
N/A
Unspecied
1269-1356nm
CWDM lasers
210k
10GBASE-LX4
50 m MMF
500 MHz*km
Unspecied
12691356nm
CWDM lasers
2300
10GBASE-LX4
50 m MMF
400 MHz*km
Unspecied
12691356nm
CWDM lasers
2240
10GBASE-LX4
62.5 m MMF
500 MHz*km
Unspecied
12691356nm
CWDM lasers
2300
10GBASE-SW
50 m MMF
2000 MHz*km
Unspecied
850nm laser
2300
10GBASE-SW
50 m MMF
500 MHz*km
Unspecied
850nm laser
282
10GBASE-SW
50 m MMF
400 MHz*km
Unspecied
850nm laser
266
10GBASE-SW
62.5 m MMF
200 MHz*km
Unspecied
850nm laser
233
10GBASE-SW
62.5 m MMF
160 MHz*km
Unspecied
850nm laser
226
10GBASE-SR
50 m MMF
2000 MHz*km
Unspecied
850nm laser
2300
10GBASE-SR
50 m MMF
500 MHz*km
Unspecied
850nm laser
282
10GBASE-SR
50 m MMF
400 MHz*km
Unspecied
850nm laser
266
10GBASE-SR
62.5 m MMF
200 MHz*km
Unspecied
850nm laser
233
10GBASE-SR
62.5 m MMF
160 MHz*km
Unspecied
850nm laser
226
10GBASE-CX4
N/A
IEC 61076-3-113
Electrical
015
Though IEEE 802.3ae-2002 does not specify which connectors may be used, the duplex SC
style is supported by many 10GE switch vendors because the XENPAK, X2, and XPAK
MSAs specify duplex SC. The XFP MSA supports several different connectors, including
duplex SC. Note that 10GBASE-EW and 10GBASE-ER links that are longer than 30km
are considered engineered links and must provide better attenuation characteristics than
normal SMF links.
Ethernet
135
Encoding Scheme
BER Objective
1000BASE-LX
8B/10B
1012
1000BASE-SX
8B/10B
1012
1000BASE-T
8B1Q4
1010
1000BASE-CX
8B/10B
1012
10GBASE-EW
1012
10GBASE-ER
64B/66B
1012
10GBASE-LW
1012
10GBASE-LR
64B/66B
1012
10GBASE-LX4
8B/10B
1012
10GBASE-SW
1012
10GBASE-SR
64B/66B
1012
10GBASE-CX4
8B/10B
1012
The 8B/10B encoding scheme generates 10-bit characters from 8-bit characters. Each
10-bit character is categorized as data or control. Control characters are used to indicate
the start of control frames. Control frames can be xed or variable length. Control frames
can contain control and data characters. The set of characters in each control frame must
be in a specic order to convey a specic meaning. Thus, control frames are called
ordered sets.
Fiber-based implementations of GE use the 8B/10B encoding scheme. GE uses only ve of
the control characters dened by the 8B/10B encoding scheme. These control characters
136
are denoted as K23.7, K27.7, K28.5, K29.7, and K30.7. GE uses variable-length ordered
sets consisting of one, two, or four characters. GE denes eight ordered sets. Two ordered
sets are used for auto-negotiation of link parameters between adjacent devices. These are
called Conguration ordered sets and are denoted as /C1/ and /C2/. Each is four characters
in length consisting of one specied control character followed by one specied data
character followed by two variable data characters. The last two data characters represent
device conguration parameters. Two ordered sets are used as llers when no data frames
are being transmitted. These are called Idle ordered sets. They are denoted as /I1/ and /I2/,
and each is two characters in length. Idles are transmitted in the absence of data trafc to
maintain clock synchronization. The remaining four ordered sets are each one character in
length and are used to delimit data frames, maintain inter-frame spacing, and propagate
error information. These include the start_of_packet delimiter (SPD) denoted as /S/,
end_of_packet delimiter (EPD) denoted as /T/, carrier_extend denoted as /R/, and
error_propagation denoted as /V/.
Copper-based implementations of GE use the 8B1Q4 encoding scheme. The 8B1Q4
encoding scheme is more complex than the 8B/10B encoding scheme. Eight data bits
are converted to a set of four symbols, which are transmitted simultaneously using a
quinary electrical signal. The individual symbols are not categorized as data or control,
but each four-symbol set is. There are 31 four-symbol sets designated as control sets.
These are used to delimit data frames, maintain inter-frame spacing, and propagate error
information. Like 8B/10B implementations of GE, 8B1Q4 implementations support
auto-negotiation of link parameters between adjacent devices. This is accomplished via
the fast link pulse (FLP). The FLP is not a four-symbol set, but it is dened at OSI Layer
1, and it does have ordered bit positions. The FLP consists of 33 bit positions containing
alternating clock and data bits: 17 clock bits and 16 data bits. The FLP data bits convey
device capabilities.
Some 10GE implementations use 8B/10B encoding but do so differently than GE.
The following denitions and rules apply to CWDM and parallel implementations.
10GE uses seven control characters denoted as K27.7, K28.0, K28.3, K28.4, K28.5,
K29.7, and K30.7. With the exception of K30.7, these are used to identify ordered
sets. The K30.7 control character is used for error control and may be transmitted
independently. 10GE implementations based on 8B/10B use 10 xed-length ordered
sets consisting of four characters. Three ordered sets are dened to maintain clock
synchronization, maintain inter-frame spacing, and align parallel lanes. These are
collectively classied as Idle and include Sync Column denoted as ||K||, Skip Column
denoted as ||R||, and Align Column denoted as ||A||. Five ordered sets are dened to
delimit data frames. These are collectively classied as Encapsulation and include Start
Column denoted as ||S||, Terminate Column in Lane 0 denoted as ||T0||, Terminate Column
in Lane 1 denoted as ||T1||, Terminate Column in Lane 2 denoted as ||T2||, and Terminate
Column in Lane 3 denoted as ||T3||. Two ordered sets are dened to communicate linkstatus information. These include Local Fault denoted as ||LF|| and Remote Fault
denoted as ||RF||.
Ethernet
137
Serial implementations of 10GE use the 64B/66B encoding scheme. The 64B/66B
encoding scheme generates a 64-bit block from two 32-bit words received from the
10-Gigabit Media Independent Interface (XGMII). Two bits are prepended to each 64-bit
block to indicate whether the block is a data block or a control block. Data blocks contain
only data characters. Control blocks can contain control and data characters. There are 15
formats for control blocks. The rst byte of each control block indicates the format of the
block and is called the block type eld. The remaining seven bytes of each control block
are lled with a combination of 8-bit data characters, 7-bit control characters, 4-bit control
characters, and single-bit null character elds.
There are two 7-bit control characters: Idle and Error. These are used to maintain interframe spacing, maintain clock synchronization, adapt clock rates, and propagate error
information. There is one four-bit control character: the Sequence ordered set character
denoted as /Q/. 10GE ordered sets are embedded in control blocks. Each ordered set is
xed length and consists of a single 4-bit control character followed or preceded by three
8-bit data characters. The Sequence ordered set is used to adapt clock rates. One other
ordered set is dened, but it is not used. The null character elds are interpreted as Start or
Terminate control characters, which delimit data frames. The value of the block type eld
implies that a frame delimiter is present and conveys the position of the null character elds.
This eliminates the need for explicit coding of information in the actual Start and Terminate
control characters. In fact, these control characters are completely omitted from some
frame-delimiting control blocks.
Further details of each encoding scheme are outside the scope of this book. The 8B/10B
encoding scheme is well documented in clause 36 of the IEEE 802.3-2002 specication and
clause 48 of the IEEE 802.3ae-2002 specication. The 8B1Q4 encoding scheme is well
documented in clause 40 of the IEEE 802.3-2002 specication. The 64B/66B encoding
scheme is well documented in clause 49 of the IEEE 802.3ae-2002 specication.
138
Ethernet
139
only one pair of devices. It is possible for a node to negotiate half-duplex mode when
connected to a switch, but this suboptimal condition typically is corrected by the network
administrator as soon as it is discovered. Collision-free line-rate performance is achievable
if a switched Ethernet network is designed as such. This book does not discuss CSMA/CD
in depth because modern storage networks built on Ethernet are switched.
Collision
Domain
Attenuation
Domain
Computer
No
Collisions
Attenuation
Domain
Ethernet Hub
Attenuation
Domain
Ethernet Switch
Router
An Ethernet network also can have virtual boundaries. The IEEE 802.1Q-2003 specication
denes a method for implementing multiple VLANs within a single physical LAN. In the
simplest scenario, each switch port is statically assigned to a single VLAN by the network
administrator. As frames enter a switch from an end node, the switch prepends a tag to
indicate the VLAN membership of the ingress port (known as the Port VLAN Identier
(PVID)). The tag remains intact until the frame reaches the egress switch port that connects
the destination end node. The switch removes the tag and transmits the frame to the
destination end node. Ethernet switches use PVIDs to ensure that no frames are forwarded
between VLANs. Thus, VLAN boundaries mimic physical LAN boundaries. User data can
be forwarded between VLANs only via OSI Layer 3 entities.
Note that the PVID can be assigned dynamically via the Generic Attribute Registration
Protocol (GARP) VLAN Registration Protocol (GVRP). When GVRP is used, the PVID is
140
typically determined by the MAC address of the end node attached to the switch port, but
other classiers are permitted. GVRP allows end nodes to be mobile while ensuring that
each end node is always assigned to the same VLAN regardless of where the end node
attaches to the network. Note also that a switch port can belong to multiple VLANs if the
switch supports VLAN trunking as specied in IEEE 802.1Q-2003. This is most commonly
used on ISLs, but some NICs support VLAN trunking. An end node using an 802.1Qenabled NIC may use a single MAC address in all VLANs or a unique MAC address in each
VLAN. In the interest of MAC address conservation, some 802.1Q-enabled NICs use a
single MAC address in all VLANs. This method allows NIC vendors to allocate only
one MAC address to each 802.1Q-enabled NIC. For these end nodes, GVRP cannot be
congured to use the MAC address as the PVID classier. Also, switch vendors must take
special measures to forward frames correctly in the presence of this type of end node. These
are the same measures required in environments where a host operating system advertises
a single MAC address on all NICs installed in a multihomed host. An end node using an
802.1Q-enabled NIC may not forward frames between VLANs except via an OSI Layer 3
process.
Length in Bytes
7
Preamble
Start of
Destination
Frame
Address
Delimiter
Source
Address
Length/
Type
461500
Data/
Pad
Frame
Check
Sequence
The Preamble and Start of Frame Delimiter are not considered part of the actual frame.
These elds are discussed in this section for the sake of completeness. A brief description
of each eld follows:
Preamble7 bytes long and contains seven repetitions of the sequence 10101010.
This eld is used by the receiver to achieve steady-state synchronization.
Start of Frame Delimiter (SFD)1 byte long and contains the sequence 10101011,
which indicates the start of a frame.
Destination Address (DA)6 bytes long and indicates the node(s) that should accept
and process the frame. The DA eld may contain an individual, multicast or broadcast
address.
Source Address (SA)6 bytes long and indicates the transmitting node. The SA eld
may only contain an individual address.
Ethernet
141
Length/Type2 bytes long and has two possible interpretations. If the numeric value
is equal to or less than 1500, this eld is interpreted as the length of the Data/Pad eld
expressed in bytes. If the numeric value is greater than or equal to 1536, this eld
is interpreted as the Ethertype. For jumbo frames, which are not yet standardized,
this eld must specify the Ethertype to be compliant with the existing rules of
interpretation.
Data/Padvariable in length and contains either ULP data or pad bytes. If no ULP
data is transmitted, or if insufcient ULP data is transmitted to meet the minimum
frame size requirement, this eld is padded. The format of pad bytes is not specied.
Minimum frame size requirements stem from CSMA/CD, but these requirements
still apply to full-duplex communication for backward compatibility.
Frame Check Sequence (FCS)4 bytes long and contains a CRC value. This value is
computed on the DA, SA, Length/Type and Data/Pad elds.
The other variation of the basic frame format is the Ethernet II frame format. Most Ethernet
networks continue to use the Ethernet II frame format. The only differences between the
Ethernet II format and the 802.3-2002 format are the SFD eld and the Length/Type eld.
In the Ethernet II format, the recurring preamble bit pattern continues for eight bytes and
is immediately followed by the DA eld. The Ethernet II format does not support the length
interpretation of the Length/Type eld, so the eld is called Type. Figure 5-7 illustrates
the Ethernet II frame format.
Figure 5-7
Length in Bytes
Preamble
6
Destination
Address
Source
Address
Type
461500
Data/
Pad
Frame
Check
Sequence
When the IEEE rst standardized Ethernet, the Length/Type eld could only be interpreted
as length. A mechanism was needed to facilitate ULP multiplexing to maintain backward
compatibility with Ethernet II. So, an optional subheader was dened. The current version
is specied in IEEE 802.2-1998. This subheader embodies the data component of the
Logical Link Control (LLC) sublayer. This subheader is required only when the 802.3-2002
frame format is used and the Length/Type eld species the length of the data eld. When
present, this subheader occupies the rst three or four bytes of the Data/Pad eld and
therefore reduces the maximum amount of ULP data that the frame can transport. Figure 5-8
illustrates the IEEE 802.2-1998 subheader format.
Figure 5-8
1 or 2
Destination Service
Access Point
Source Service
Access Point
Control
142
Destination Service Access Point (DSAP)1 byte long. It indicates the ULP(s) that
should accept and process the frames ULP data.
Source Service Access Point (SSAP)1 byte long. It indicates the ULP that
transmitted the ULP data.
Like Ethertypes, service access points (SAPs) are administered by the IEEE to ensure
global uniqueness. Because the Type eld in the Ethernet II header is 16 bits, the 8-bit
DSAP eld in the LLC subheader cannot accommodate as many ULPs. So, another
optional subheader was dened by the IETF via RFC 1042 and was later incorporated into
the IEEE 802 Overview and Architecture specication. Referred to as the Sub-Network
Access Protocol (SNAP), this subheader is required only when the 802.3-2002 frame
format is used, the Length/Type eld species the length of the data eld, the 802.2-1998
subheader is present, and the ULP is not an IEEE registered SAP. When present, this
subheader follows a 3-byte LLC subheader and occupies an additional 5 bytes of the Data/
Pad eld. Thus, the maximum amount of ULP data that the frame can transport is further
reduced. The DSAP and SSAP elds of the LLC subheader each must contain the value
0xAA or 0xAB, and the CTL eld must contain the value 0x03 to indicate that the SNAP
subheader follows. The two elds of the SNAP subheader are sometimes collectively called
the Protocol Identier (PID) eld. Figure 5-9 illustrates the IEEE 802-2001 subheader
format.
Figure 5-9
OUI
EtherType
OUI3 bytes long. It contains the IEEE-assigned identier of the organization that
created the ULP.
In shared media environments, frames of different formats can traverse a shared link.
However, each Ethernet interface is normally congured to use only one frame format. All
devices using a given frame format can communicate, but they are isolated from all devices
using other frame formats. When a device receives a frame of a different format, the frame
is not understood and is dropped. One notable exception is a protocol analyzer that can
support promiscuous mode. Promiscuous mode enables a device to transmit and receive
all frame formats simultaneously. In switched environments, a similar phenomenon of
isolation occurs. Each switch port must be congured to use only one frame format. Each
end node must use the same frame format as the switch port to which it is attached. When
Ethernet
143
a switch forwards multicast and broadcast trafc, only those switch ports using the same
frame format as the source node can transmit the frame without translation. All other switch
ports must translate the frame format or drop the frame. Translation of every frame can
impose unacceptable performance penalties on a switch, and translation is not always
possible. For example, some Ethernet II frames cannot be translated to LLC format in the
absence of the SNAP subheader. So, Ethernet switches do not translate frame formats.
(VLAN trunking ports are a special case.) Thus, Ethernet switches drop frames when the
frame format of the egress port does not match the frame format of the source node. This
prevents ARP and other protocols from working properly and results in groups of devices
becoming isolated. For this reason, most Ethernet networks employ a single frame format
on all switch ports and attached devices.
As previously stated, VLANs require each frame sent between switches to be tagged to
indicate the VLAN ID of the transmitting node. This prevents frames from being improperly
delivered across VLAN boundaries. There are two frame formats for Ethernet trunking: the
IEEEs 802.1Q-2003 format and Cisco Systems proprietary ISL format. Today, most
Ethernet networks use the 802.1Q-2003 frame format, which was rst standardized in
1998. So, Cisco Systems proprietary frame format is not discussed herein. Figure 5-10
illustrates the IEEE 802.1Q-2003 frame format.
Figure 5-10 IEEE 802.1Q-2003 Frame Format
Length
in Bytes
7
Preamble
Start of
Destination
Frame
Address
Delimiter
Length in Bits
16
EtherType
Source
Address
Tag
Length/
Type
461500
Data/
Pad
Frame
Check
Sequence
12
Priority
Canonical
Format
Indicator (CFI)
VLAN ID
(VID)
EtherType2 bytes long and must contain the value 0x8100 to indicate that the
following two bytes contain priority and VLAN information. This allows Ethernet
switches to recognize tagged frames so special processing can be applied.
VLAN ID (VID)12 bits long. It contains a binary number between 2 and 4094,
inclusive. VIDs 0, 1, and 4095 are reserved.
Canonical Format Indicator (CFI) bitfacilitates use of a common tag header for
multiple, dissimilar network types (for example, Ethernet and Token Ring).
144
The brief eld descriptions provided in this section do not encompass all the functionality
provided by each of the elds. For more information, readers are encouraged to consult the
IEEE 802.3-2002, 802.2-1998, 802-2001, and 802.1Q-2003 specications.
Most Ethernet switches provide only unacknowledged, connectionless service (Type 1),
which contributes to the publics misunderstanding of Ethernets full capabilities. Because
the other two service types are rarely used, the delivery mechanisms employed by the LLC
sublayer to provide those types of service are outside the scope of this book. Ethernet
networks that provide Type 1 service implement the following delivery mechanisms:
Ethernet devices do not detect frames dropped in transit. When an Ethernet device
drops a frame, it does not report the drop to ULPs or peer nodes. ULPs are expected
to detect the drop via their own mechanisms.
Ethernet devices can detect corrupt frames via the FCS eld. Upon detection of a
corrupt frame, the frame is dropped. Regardless of whether an intermediate switch
or the destination node drops the frame, no notication is sent to any node or ULP.
Some Ethernet switches employ cut-through switching techniques and are unable to
detect corrupt frames. Thus, corrupt frames, are forwarded to the destination node and
subsequently dropped. However, most Ethernet switches employ a storeand-forward architecture capable of detecting and dropping corrupt frames.
Ethernet
145
The IEEE 802.3-2002 specication does not dene methods for fragmentation or
reassembly because the necessary header elds do not exist. An MTU mismatch
results in frame drop. Thus, each physical Ethernet network must have a common
MTU on all links. That means PMTU discovery is not required within an Ethernet
network. MTU mismatches between physically separate Ethernet networks are
handled by an ULP in the device that connects the Ethernet networks (for example,
IP in a router). Likewise, an ULP is expected to provide end-to-end PMTU
discovery.
In-order delivery is not guaranteed. Ethernet devices do not support frame reordering.
ULPs are expected to detect out-of-order frames and provide frame reordering.
All links in a port channel must use the same aggregation protocol (LACP or PAgP).
All links in a non-trunking port channel must belong to the same VLAN.
All links in a port channel must connect a single pair of devices (that is, only pointto-point congurations are permitted).
All links in a port channel must operate at the same transmission rate.
If any link in a port channel is congured as non-trunking, all links in that port channel
must be congured as non-trunking. Likewise, if any link in a port channel is
congured as trunking, all links in that port channel must be congured as trunking.
All links in a trunking port channel must trunk the same set of VLANs.
146
All links in a non-trunking port channel must use the same frame format.
All links in a trunking port channel must use the same trunking frame format.
Some of these restrictions are not specied in 802.3-2002, but they are required for proper
operation. Similarly, there is no de jure limit on the maximum number of links that may be
grouped into a single port channel or the maximum number of port channels that may
be congured on a single switch. However, product design considerations may impose
practical limits that vary from vendor to vendor. The 802.3-2002 specication seeks to
minimize the probability of duplicate and out-of-order frame delivery across an Ethernet
port channel. However, it is possible for these outcomes to occur during reconguration or
recovery from a link failure.
Ethernet
147
Cong_Reg, continues transmitting until the Link_Timer expires (10ms by default) and
begins resolving a common parameter set. If a matching conguration is resolved, normal
communication ensues upon expiration of the Link_Timer. If successful negotiation cannot
be accomplished for any reason, the network administrator must intervene. Figure 5-11
illustrates the 1000BASE-X Conguration ordered sets.
Figure 5-11 1000BASE-X Conguration Ordered Sets
Length in Bytes
Length
in Bits
/C1/
K28.5
D21.5
Config_Reg
/C2/
K28.5
D2.2
Config_Reg
Reserved
Full
Duplex
(FD)
Half
Duplex
(HD)
1
Pause 1
(PS1)
1
Pause 2
(PS2)
Reserved
Remote
Fault 1
(RF1)
Remote
Fault 2
(RF2)
Full duplex (FD) bitused to indicate whether full duplex mode is supported.
Remote Fault 1 (RF1) and Remote Fault 2 (RF2) bitsused together to indicate
to the remote device whether a fault has been detected by the local device and, if so,
the type of fault (ofine, link error, or auto-negotiation error).
Next Page (NP) bitindicates that one or more /C/ ordered sets follow, and each
contains parameter information in one of two alternative formats: message page or
unformatted page. A message page must always precede an unformatted page to
indicate how to interpret the unformatted page(s). An unformatted page can be used
for several purposes.
Half duplex (HD) bitused to indicate whether half duplex mode is supported.
Pause 1 (PS1) and Pause 2 (PS2) bitsused together to indicate the supported owcontrol modes (asymmetric, symmetric, or none).
148
auto-negotiation in 100-Mbps twisted-pair based Ethernet implementations (100BASETX, 100BASE-T2, and 100BASE-T4). A special mechanism is dened for 10BASE-T
implementations because 10BASE-T does not support the FLP. Because 10BASE-T is
irrelevant to modern storage networks, only the FLP mechanism is discussed in this section.
The 16 data bits in the FLP are collectively called the link code word (LCW). The LCW
represents the transmitters 16-bit advertisement register (Register 4), which is equivalent
to the 1000BASE-X Cong_Reg. Like 1000BASE-X, all capabilities are advertised to
the peer device by default, but it is possible to mask some capabilities. If more than one set
of operating parameters is common to a pair of connected devices, a predened priority
policy determines which parameter set will be used. The highest common capabilities are
always selected. Unlike 1000BASE-X, the FLP is independent of the bit-level encoding
scheme used during normal communication. That independence enables twisted-pair based
Ethernet implementations to auto-negotiate the transmission rate. Of course, it also means
that all operating parameters must be negotiated prior to bit-level synchronization. So, the
FLP is well dened to allow receivers to achieve temporary bit-level synchronization on a
per-FLP basis. The FLP is transmitted immediately following link power-on and is repeated
at a specic time interval.
In contrast to the 1000BASE-X procedure, wherein /C/ ordered sets are initially transmitted
without conveying the Cong_Reg, twisted-pair based implementations convey Register 4
via the LCW in every FLP transmitted. Upon recognition of three consecutive matching
LCWs without error, the receiving device sets the Acknowledge bit to one in its LCW,
transmits another six to eight FLPs, and begins resolving a common parameter set. If a
matching conguration is resolved, transmission of the Idle symbol begins after the nal
FLP is transmitted. Transmission of Idles continues until bit-level synchronization is
achieved followed by symbol alignment. Normal communication then ensues. If successful
negotiation cannot be accomplished for any reason, the network administrator must
intervene. Figure 5-12 illustrates the Ethernet FLP LCW.
Figure 5-12 Ethernet FLP Link Code Word
Length in Bits
Technology Remote
Selector
Acknowledge Next Page
Ability
Fault
Length
in Bits
10BASE-T
10BASE-T
100BASE-TX
100BASE-TX
100BASE-T4 Pause ASM_DIR Reserved
Full
Full
Duplex
Duplex
Selector5 bits long. It indicates the technology implemented by the local device.
Valid choices include 802.3, 802.5, and 802.9.
Ethernet
149
Technology Ability8 bits long. It indicates the abilities of the local device.
Abilities that can be advertised include transmission rate (10-Mbps or
100-Mbps), duplex mode (half or full), and ow-control mode (asymmetric,
symmetric, or none). To negotiate 1000-Mbps operation, the Next Page eld must
be used.
Remote Fault bitused to indicate to the remote device that a fault has been
detected by the local device. When a fault is detected, the Remote Fault bit is set to 1,
and auto-negotiation is re-initiated. The Next Page eld may be optionally used to
indicate the nature of the fault.
Next Page bitindicates that one or more FLPs follow, containing LCW
information in one of two alternative formats: message page or unformatted page.
A message page must always precede an unformatted page to indicate how to
interpret the unformatted page(s). An unformatted page can be used for several
purposes, including negotiation of 1000-Mbps operation.
The preceding description of the twisted-pair based Ethernet link initialization procedure is
simplied for the sake of clarity. For more detail about FLP usage, Next Page formats, eld
interpretations, and auto-negotiation states, readers are encouraged to consult clause 28
and all associated annexes of IEEE 802.3-2002.
The IEEE 802.3-2002 specication recommends that manual conguration be achieved not
by disabling auto-negotiation, but by masking selected capabilities when advertising to the
peer device. This choice is vendor dependent. The remainder of this paragraph describes
the procedures followed when auto-negotiation is disabled. When manually conguring an
interface, the network administrator typically is allowed to specify the transmission rate and
duplex mode of each twisted-pair interface. For ber interfaces, the transmission rate is
xed and cannot be altered, but the duplex mode can be specied. Some products allow
additional granularity in manual conguration mode. In the absence of additional granularity,
network administrators must consult the product documentation to determine the default
values of operating parameters that cannot be explicitly congured.
As previously stated, the order of events following power-on depends on the media type.
For 1000BASE-X, bit level synchronization is achieved followed by word alignment.
Normal communication is then attempted. If compatible operating parameters are
congured, successful communication ensues. Otherwise, the link might come up, but
frequent errors occur. For twisted-pair interfaces, bit-level synchronization is attempted.
If successful, symbol alignment occurs. Otherwise, the link does not come online. Once
symbol alignment is achieved, normal communication is attempted. If compatible operating
parameters are congured, successful communication ensues. Otherwise, the link might
come up, but frequent errors occur. If a manually congured link cannot come up or
experiences frequent errors because of operating parameter mismatch, the network
administrator must intervene.
150
Fibre Channel
Compared to Ethernet, FC is a complex technology. FC attempts to provide functionality
equivalent to that provided by Ethernet plus elements of IP, UDP, and TCP. So, it is difcult
to compare FC to just Ethernet. FC promises to continue maturing at a rapid pace and is
currently considered the switching technology of choice for block-level storage protocols. As
block-level SANs proliferate, FC is expected to maintain its market share dominance in the
short term. The long term is difcult to predict in the face of rapidly maturing IPS protocols,
but FC already enjoys a sufciently large installed base to justify a detailed examination of
FCs inner workings. This section explores the details of FC operation at OSI Layers 1 and 2.
FC Variant
Medium
Modal
Bandwidth
Connectors
Transceiver
Operating
Range (m)
100-SM-LL-V
9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC
1550nm laser
250k
100-SM-LL-L
9 m SMF
N/A
Duplex SC
1300nm laser
210k
100-SM-LC-L
9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
1300nm laser
210k
(Cost Reduced)
100-SM-LL-I
9 m SMF
N/A
Duplex SC
1300nm laser
22k
100-M5-SN-I
50 m MMF
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5500
100-M5-SN-I
50 m MMF
400 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5450
Fibre Channel
Table 5-7
151
FC Variant
Medium
Modal
Bandwidth
Connectors
Transceiver
Operating
Range (m)
100-M5-SL-I
50 m MMF
500 MHz*km
Duplex SC
780nm laser
2500
100-M6-SN-I
62.5 m MMF
200 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5300
100-M6-SN-I
62.5 m MMF
160 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
2300
100-M6-SL-I
62.5 m MMF
160 MHz*km
Duplex SC
780nm laser
2175
200-SM-LL-V
9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC
1550nm laser
250k
200-SM-LC-L
9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
1300nm laser
210k
(Cost Reduced)
200-SM-LL-I
9 m SMF
N/A
Duplex SC
1300nm laser
22k
200-M5-SN-I
50 m MMF
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5300
200-M5-SN-I
50 m MMF
400 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5260
200-M6-SN-I
62.5 m MMF
200 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5150
200-M6-SN-I
62.5 m MMF
160 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5120
400-SM-LL-V
9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC
1550nm laser
250k
continues
152
Table 5-7
FC Variant
Medium
Modal
Bandwidth
400-SM-LC-L
9 m SMF
400-SM-LL-I
Operating
Range (m)
Connectors
Transceiver
N/A
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
1300nm laser
210k
(Cost Reduced)
9 m SMF
N/A
Duplex SC
1300nm laser
22k
400-M5-SN-I
50 m MMF
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
2175
400-M5-SN-I
50 m MMF
400 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5130
400-M6-SN-I
62.5 m MMF
200 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.570
400-M6-SN-I
62.5 m MMF
160 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.555
1200-SM-LL-L
9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
1310nm laser
210k
1200-SM-LC4-L 9 m SMF
N/A
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
12691356nm
CWDM lasers
210k
1200-M5E-SN4-I 50 m Enhanced
MMF
1500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
772857nm
CWDM lasers
0.5550
1200-M5ESN4P-I
50 m Enhanced
MMF
2000 MHz*km
MPO
850nm Parallel
lasers
0.5300
1200-M5E-SN-I
50 m Enhanced
MMF
2000 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.5300
Fibre Channel
Table 5-7
153
FC Variant
Medium
Modal
Bandwidth
1200-M5-LC4-L
50 m MMF
1200-M5-LC4-L
1200-M5-SN4-I
Operating
Range (m)
Connectors
Transceiver
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
12691356nm
CWDM lasers
0.5290
50 m MMF
400 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
12691356nm
CWDM lasers
0.5230
50 m MMF
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
772-857nm
CWDM lasers
0.5290
1200-M5-SN4P-I 50 m MMF
500 MHz*km
MPO
850nm Parallel
lasers
0.5150
1200-M5-SN-I
50 m MMF
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.582
1200-M5-SN-I
50 m MMF
400 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.566
1200-M6-LC4-L
62.5 m MMF
500 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
12691356nm
CWDM lasers
0.5290
1200-M6-SN4-I
62.5 m MMF
200 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
772857nm
CWDM lasers
0.5118
200 MHz*km
MPO
850nm Parallel
lasers
0.575
1200-M6-SN-I
62.5 m MMF
200 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.533
1200-M6-SN-I
62.5 m MMF
160 MHz*km
Duplex SC,
Duplex SG,
Duplex LC,
Duplex MT-RJ
850nm laser
0.526
154
Encoding Scheme
BER Objective
100-SM-LL-V
8B/10B
1012
100-SM-LL-L
8B/10B
1012
100-SM-LC-L
8B/10B
1012
100-SM-LL-I
8B/10B
1012
100-M5-SN-I
8B/10B
1012
100-M5-SL-I
8B/10B
1012
100-M6-SN-I
8B/10B
1012
100-M6-SL-I
8B/10B
1012
200-SM-LL-V
8B/10B
1012
200-SM-LC-L
8B/10B
1012
200-SM-LL-I
8B/10B
1012
200-M5-SN-I
8B/10B
1012
200-M6-SN-I
8B/10B
1012
400-SM-LL-V
8B/10B
1012
400-SM-LC-L
8B/10B
1012
400-SM-LL-I
8B/10B
1012
400-M5-SN-I
8B/10B
1012
400-M6-SN-I
8B/10B
1012
1200-SM-LL-L
64B/66B
1012
1200-SM-LC4-L
8B/10B
1012
1200-M5E-SN4-I
8B/10B
1012
1200-M5E-SN4P-I
8B/10B
1012
1200-M5E-SN-I
64B/66B
1012
1200-M5-LC4-L
8B/10B
1012
1200-M5-SN4-I
8B/10B
1012
1200-M5-SN4P-I
8B/10B
1012
1200-M5-SN-I
64B/66B
1012
Fibre Channel
Table 5-8
155
Encoding Scheme
BER Objective
1200-M6-LC4-L
8B/10B
1012
1200-M6-SN4-I
8B/10B
1012
1200-M6-SN4P-I
8B/10B
1012
1200-M6-SN-I
64B/66B
1012
156
FC Addressing Scheme
FC employs an addressing scheme that directly maps to the SAM addressing scheme. FC
uses WWNs to positively identify each HBA and port, which represent the equivalent of
SAM device and port names, respectively. An FC WWN is a 64-bit value expressed in
colon-separated hexadecimal notation such as 21:00:00:e0:8b:08:a5:44. There are many
formats for FC WWNs, most of which provide universal uniqueness. Figure 5-13 illustrates
the basic ANSI T11 WWN address format.
Figure 5-13 Basic ANSI T11 WWN Address Format
Length in Bits
60
NAA
Name
NAA4 bits long. It indicates the type of address contained within the Name eld
and the format of the Name eld.
The Name eld can contain a locally assigned address in any format, a mapped
external address in the format dened by the NAA responsible for that address type, or
a mapped external address in a modied format dened by the ANSI T11 subcommittee.
External addresses are mapped into the Name eld according to the rules dened
in the FC-PH and FC-FS series of specications. Six mappings are dened: IEEE
MAC-48, IEEE extended, IEEE registered, IEEE registered extended, IEEE EUI-64,
and IETF IPv4. The FC-DA specication series mandates the use of the IEEE MAC-48,
IEEE extended, IEEE registered, or IEEE EUI-64 format to ensure universal uniqueness and interoperability. All six formats are described herein for the sake of completeness.
Figure 5-14 illustrates the format of the Name eld for containment of a locally assigned
address.
Fibre Channel
157
Figure 5-14 ANSI T11 Name Field Format for Locally Assigned Addresses
Length in Bits
60
NAA
Vendor
Assigned
Figure 5-15 illustrates the format of the Name eld for containment of an IEEE MAC-48
address.
Figure 5-15 ANSI T11 Name Field Format for IEEE MAC-48 Addresses
Length in Bits
12
48
NAA
Reserved
MAC-48
Figure 5-16 illustrates the format of the Name eld for containment of an IEEE extended
address.
Figure 5-16 ANSI T11 Name Field Format for IEEE Extended Addresses
Length in Bits
12
48
NAA
Vendor
Assigned
MAC-48
Vendor Assigned12 bits long. It can contain any series of bits in any format as
dened by the vendor.
158
Figure 5-17 illustrates the format of the Name eld for containment of an IEEE Registered
address.
Figure 5-17 ANSI T11 Name Field Format for IEEE Registered Addresses
Length in Bits
24
36
IEEE
Vendor
Assigned OUI Assigned
NAA
Vendor Assigned36 bits long. It can contain any series of bits in any format as
dened by the vendor.
OUI24 bits long. It contains the vendors IEEE assigned identier. The U/L and
I/G bits have no signicance and are set to 0.
The IEEE registered extended format is atypical because it is the only WWN format that is
not 64 bits long. An extra 64-bit eld is appended, yielding a total WWN length of 128 bits.
The extra length creates some interoperability issues. Figure 5-18 illustrates the format of
the Name eld for containment of an IEEE registered extended address.
Figure 5-18 ANSI T11 Name Field Format for IEEE Registered Extended Addresses
Length in Bits
24
NAA
IEEE
Assigned OUI
36
64
Vendor Assigned36 bits long. It can contain any series of bits in any format as
dened by the vendor.
Vendor Assigned Extension64 bits long. It can contain any series of bits in any
format as dened by the vendor.
OUI24 bits long. It contains the vendors IEEE assigned identier. The U/L and
I/G bits have no signicance and are set to 0.
Figure 5-19 illustrates the format of the Name eld for containment of an IEEE EUI-64
address.
Figure 5-19 ANSI T11 Name Field Format for IEEE EUI-64 Addresses
Length in Bits
22
40
NAA
Modified
IEEE
Assigned OUI
Vendor
Assigned
Fibre Channel
159
NAA2 bits long. It is set to 11. Because the EUI-64 format is the same length
as the FC WWN format, the NAA bits must be taken from the EUI-64 address. To
make this easier to accomplish, all NAA values beginning with 11 are designated as
EUI-64 indicators. This has the effect of shortening the NAA eld to 2 bits.
Therefore, only 2 bits need to be taken from the EUI-64 address.
OUI22 bits long. It contains a modied version of the vendors IEEE assigned
identier. The U/L and I/G bits are omitted from the rst byte of the OUI, and the
remaining 6 bits of the rst byte are right-shifted two bit positions to make room for
the 2 NAA bits.
Vendor Assigned40 bits long. It can contain any series of bits in any format as
dened by the vendor.
Figure 5-20 illustrates the format of the Name eld for containment of an IETF IPv4
address.
Figure 5-20 ANSI T11 Name Field Format for IETF IPv4 Addresses
Length in Bits
4
NAA
28
32
Reserved
IPv4
Address
The FC equivalent of the SAM port identier is the FC port identier (Port_ID). The FC
Port_ID is embedded in the FC Address Identier. The FC Address Identier consists of a
hierarchical 24-bit value, and the lower 8 bits make up the Port_ID. The entire FC Address
Identier is sometimes referred to as the FCID (depending on the context). In this book, the
phrase FC Address Identier and the term FCID are used interchangeably except in the
context of address assignment. The format of the 24-bit FCID remains unchanged from its
original specication. This simplies communication between FC devices operating at
different speeds and preserves the legacy FC frame format. FCIDs are expressed in spaceseparated hexadecimal notation such as 0x64 03 E8. Some devices omit the spaces when
displaying FCIDs. Figure 5-21 illustrates the ANSI T11 Address Identier format.
160
Domain_ID
Area_ID
Port_ID
Domain ID is the rst level of hierarchy. This eld is 8 bits long. It identies one or
more FC switches.
Area ID is the second level of hierarchy. This eld is 8 bits long. It identies one or
more end nodes.
Port ID is the third level of hierarchy. This eld is 8 bits long. It identies a single end
node. This eld is sometimes called the FCID, which is why the term FCID can be
confusing if the context is not clear.
The SAM denes the concept of domain as the entire system of SCSI devices that interact
with one another via a service delivery subsystem. FC implements this concept of domain
via the rst level of hierarchy in the FCID (Domain_ID). In a single-switch fabric, the
Domain_ID represents the single switch. In a multi-switch fabric, the Domain_ID should
represent all interconnected switches according to the SAM denition of domain. However,
FC forwards frames between switches using the Domain_ID. So, each switch must be
assigned a unique Domain_ID. To comply with the SAM denition of domain, the ANSI
T11 FC-SW-3 specication explicitly allows multiple interconnected switches to share a
single Domain_ID. However, the Domain_ID is not implemented in this manner by any
FC switch currently on the market.
The Area_ID can identify a group of ports attached to a single switch. The Area_ID
may not span FC switches. The FC-SW-3 specication does not mandate how fabric ports
should be grouped into an Area_ID. One common technique is to assign all ports in
a single slot of a switch chassis to the same Area_ID. Other techniques can be
implemented.
The Port_ID provides a unique identity to each HBA port within each Area_ID. Alternately,
the Area_ID eld can be concatenated with the Port_ID eld to create a 16-bit Port_ID.
In this case, no port groupings exist.
The FC standards allow multiple FC Address Identiers to be associated with a single
HBA. This is known as N_Port_ID virtualization (NPIV). NPIV enables multiple virtual
initiators to share a single HBA by assigning each virtual initiator its own FC Address
Identier. The normal FLOGI procedure is used to acquire the rst FC Address Identier.
Additional FC Address Identiers are acquired using the discover F_port service
parameters (FDISC) ELS. When using NPIV, all virtual initiators must share the receive
buffers on the HBA. NPIV enhances server virtualization techniques by enabling FCSAN security policies (such as zoning) and QoS policies to be enforced independently
for each virtual server. Note that some HBA vendors call their NPIV implementation
virtual HBA technology.
Fibre Channel
161
162
clears its internal Domain_ID list. The Domain_ID list is a cached record of all
Domain_IDs that have been assigned and the switch NWWN associated with each.
Clearing the Domain_ID list has no effect during the initial conguration of a fabric
because each switchs Domain_ID list is already empty.
2 Each switch transmits a build fabric (BF) switch fabric internal link service (SW_ILS)
frame on each ISL. A SW_ILS is an ELS that may be transmitted only between
fabric elements (such as FC switches). Most SW_ILSs are dened in the
FC-SW specication series. If a BF SW_ILS frame is received on an ISL before
transmission of a BF SW_ILS frame on that ISL, the recipient switch does not
transmit a BF SW_ILS frame on that ISL.
3 Each switch waits for the fabric stability time-out value (F_S_TOV) to expire before
originating exchange fabric parameters (EFP) SW_ILS frames. This allows the BF
SW_ILS frames to ood throughout the entire fabric before any subsequent action.
4 Each switch transmits an EFP SW_ILS frame on each ISL. If an EFP SW_ILS frame
is received on an ISL before transmission of an EFP SW_ILS frame on that ISL, the
recipient switch transmits a switch accept (SW_ACC) SW_ILS frame on that ISL
instead of transmitting an EFP SW_ILS frame. Each EFP and associated SW_ACC
SW_ILS frame contains a PS_Priority eld, a PS_Name eld and a Domain_ID_List
eld. The PS_Priority and PS_Name elds of an EFP SW_ILS frame initially contain
the priority and NWWN of the transmitting switch. The priority and NWWN are
concatenated to select the PS. The lowest concatenated value wins. Upon receipt of
an EFP or SW_ACC SW_ILS frame containing a priority-NWWN value lower than
the recipient switchs value, the F_S_TOV timer is reset, the new priority-NWWN
value is cached, and the recipient switch transmits an updated EFP SW_ILS frame
containing the cached priority-NWWN value on all ISLs except the ISL on which
the lower value was received. This ooding continues until all switches agree on the
PS. Each switch determines there is PS agreement upon expiration of F_S_TOV. The
Domain_ID_List eld remains empty during the PSS process but is used during the
subsequent Domain_ID Distribution process.
Fibre Channel
163
Upon successful completion of the PSS process, the Domain_ID Distribution process
ensues. Domain_IDs can be manually assigned by the network administrator, but the
Domain_ID Distribution process still executes so the PS (also known as the domain
address manager) can compile a list of all assigned Domain_IDs, ensure there are
no overlapping Domain_IDs, and distribute the complete Domain_ID_List to all other
switches in the fabric. The Domain_ID Distribution process involves the following events,
which occur in the order listed:
1 The PS assigns itself a Domain_ID.
2 The PS transmits a Domain_ID Assigned (DIA) SW_ILS frame on all ISLs. The
DIA SW_ILS frame indicates that the transmitting switch has been assigned a
Domain_ID. A received DIA SW_ILS frame is never forwarded by the recipient
switch.
3 Each recipient switch replies to the DIA SW_ILS frame with an SW_ACC SW_ILS
frame.
4 Each switch that replied to the DIA SW_ILS frame transmits a Request Domain_ID
(RDI) SW_ILS frame to the PS. The RDI SW_ILS frame may optionally contain one
or more preferred Domain_IDs. During reconguration of a previously operational
fabric, each switch may list its previous Domain_ID as its preferred Domain_ID.
Alternatively, a preferred or static Domain_ID can be manually assigned to each
switch by the network administrator. If the transmitting switch does not have a
preferred or static Domain_ID, it indicates this in the RDI SW_ILS frame by listing
its preferred Domain_ID as 0x00.
5 The PS assigns a Domain_ID to each switch that transmitted an RDI SW_ILS
ISLs except the ISL that connects to the PS (called the upstream principal ISL).
7 Each recipient switch replies to the DIA SW_ILS frame with an SW_ACC SW_ILS
frame.
8 Each switch that replied to the DIA SW_ILS frame transmits an RDI SW_ILS frame
principal ISL.
10 The PS assigns a Domain_ID to each switch that transmitted an RDI SW_ILS frame
164
SW_ACC SW_ILS frame and forwards the EFP SW_ILS frame on all ISLs except
the Upstream Principal ISL. Thus, the EFP SW_ILS frame propagates outward
from the PS until all switches have received it.
The preceding descriptions of the PSS and Domain_ID Distribution processes are simplied
to exclude error conditions and other contingent scenarios. For more information about
these processes, readers are encouraged to consult the ANSI T11 FC-SW-3 specication.
The eight-bit Domain_ID eld mathematically accommodates 256 Domain_IDs, but some
Domain_IDs are reserved. Only 239 Domain_IDs are available for use as FC switch
identiers. Table 5-9 lists all FC Domain_ID values and the status and usage of each.
Table 5-9
Status
Usage
0x00
Reserved
FC-AL Environments
0x01-EF
Available
Switch Domain_IDs
0xF0-FE
Reserved
None
0xFF
Reserved
As the preceding table indicates, some Domain_IDs are reserved for use in WKAs. Some
WKAs facilitate access to fabric services. Table 5-10 lists the currently dened FC WKAs
and the fabric service associated with each.
Table 5-10
Fabric Service
0xFF FF F5
Multicast Server
0xFF FF F6
0xFF FF F7
0xFF FF F8
Alias Server
0xFF FF F9
0xFF FF FA
Management Server
0xFF FF FB
Time Server
Fibre Channel
Table 5-10
165
Fabric Service
0xFF FF FC
Directory Server
0xFF FF FD
Fabric Controller
0xFF FF FE
In the context of address assignment mechanisms, the term FCID refers only to the Area_ID
and Port_ID elds of the FC Address Identier. These two values can be assigned
dynamically by the FC switch or statically by either the FC switch or the network
administrator. Dynamic FCID assignment can be uid or persistent. With dynamic-uid
assignment, FCID assignments may be completely randomized each time an HBA port
boots or resets. With dynamic-persistent assignment, the rst assignment of an FCID to
an HBA port may be completely randomized, but each subsequent boot or reset of that
HBA port will result in reassignment of the same FCID. With static assignment, the rst
assignment of an FCID to an HBA port is predetermined by the software design of the FC
switch or by the network administrator, and persistence is inherent in both cases.
FC Address Identiers are not required to be universally unique. In fact, the entire FC
address space is available for use within each physical fabric. Likewise, the entire
FC address space is available for use within each VSAN. This increases the scalability of
each physical fabric that contains multiple VSANs. However, reusing the entire FC address
space can prevent physical fabrics or VSANs from being non-disruptively merged due to
potential address conicts. Reusing the entire FC address space also prevents communication
between physical fabrics via SAN routers and between VSANs via inter-VSAN routing
(IVR) unless network address translation (NAT) is employed. NAT improves scalability by
allowing reuse of the entire FC address space while simultaneously facilitating communication
across physical fabric boundaries and across VSAN boundaries. However, because NAT
negates universal FC Address Identier uniqueness, potential address conicts can still
exist, and physical fabric/VSAN mergers can still be disruptive. NAT also increases
conguration complexity, processing overhead and management overhead. So, NAT
represents a tradeoff between communication exibility and conguration simplicity.
Address reservation schemes facilitate communication between physical fabrics or VSANs
without using NAT by ensuring that there is no overlap between the addresses assigned
within each physical fabric or VSAN. A unique portion of the FC address space is used
within each physical fabric or VSAN. This has the effect of limiting the scalability of all
interconnected physical fabrics or VSANs to a single instance of the FC address space.
However, address reservation schemes eliminate potential address conicts, so physical
fabrics or VSANs can be merged non-disruptively.
Note that some host operating systems use the FC Address Identier of target ports to
positively identify target ports, which is the stated purpose of PWWNs. Such operating
systems require the use of dynamic-persistent or static FCIDs in combination with
dynamic-persistent or static Domain_IDs. Note also that the processing of preferred
166
Domain_IDs during the PSS process guarantees Domain_ID persistence in most cases
without administrative intervention. In other words, the PSS process employs a dynamicpersistent Domain_ID assignment mechanism by default. However, merging two physical
fabrics (or two VSANs) into one can result in Domain_ID conicts. Thus, static Domain_ID
assignment is required to achieve the highest availability of targets in the presence of host
operating systems that use FC Address Identiers to positively identify target ports. As long
as static Domain_IDs are used, and the network administrator takes care to assign unique
Domain_IDs across physical fabrics (or VSANs) via an address reservation scheme,
dynamic-persistent FCID assignment can be used in place of static FCIDs without risk of
address conicts during physical fabric (or VSAN) mergers.
An HBAs FC Address Identier is used as the destination address in all unicast frames
sent to that HBA and as the source address in all frames (unicast, multicast or broadcast)
transmitted from that HBA. Two exceptions to the source address rule are dened: one
related to FCID assignment (see the FC Link Initialization section) and another related to
Class 6 multicast frames. FC multicast addressing is currently outside the scope of this
book. Broadcast trafc is sent to the reserved FC Address Identier 0xFF FF FF. Broadcast
trafc delivery is subject to operational parameters such as zoning policy and class of
service. All FC devices that receive a frame sent to the broadcast address accept the frame
and process it accordingly.
NOTE
In FC, multicast addresses are also called Alias addresses. This should not be confused with
PWWN aliases that are optionally used during zoning operations. Another potential point
of confusion is Hunt Group addressing, which involves the use of Alias addresses in a
particular manner. Hunt Groups are currently outside the scope of this book.
Fibre Channel
167
FC Media Access
As stated in Chapter 3, Overview of Network Operating Principles, FC-AL is a shared
media implementation, so it requires some form of media access control. However, we use
FC-AL primarily for embedded applications (such as connectivity inside a tape library)
today, so the FC-AL arbitration mechanism is currently outside the scope of this book. In
switched FC implementations, arbitration is not required because full-duplex communication
is employed. Likewise, the FC PTP topology used for DAS congurations supports fullduplex communication and does not require arbitration.
FC Network Boundaries
Traditional FC-SANs are physically bounded by media terminations (for example, unused
switch ports) and end node interfaces (for example, HBAs). No control information or user
data can be transmitted between FC-SANs across physical boundaries. Figure 5-22
illustrates the physical boundaries of a traditional FC-SAN.
Figure 5-22 Traditional FC-SAN Boundaries
Network Boundary
Attenuation
Domain
Attenuation
Domain
FC
FC
HBA
Computer
FC
FC Switch
Storage
Array
FC-SANs also have logical boundaries, but the denition of a logical boundary in Ethernet
networks does not apply to FC-SANs. Like the Ethernet architecture, the FC architecture
does not dene any native functionality at OSI Layer 3. However, Ethernet is used in
conjunction with autonomous OSI Layer 3 protocols as a matter of course, so logical
boundaries can be easily identied at each OSI Layer 3 entity. By contrast, normal FC
communication does not employ autonomous OSI Layer 3 protocols. So, some OSI
Layer 2 control information must be transmitted between FC-SANs across logical
boundaries to facilitate native communication of user data between FC-SANs. Currently,
there is no standard method of facilitating native communication between FC-SANs.
Leading FC switch vendors have created several proprietary methods. The ANSI T11
subcommittee is considering all methods, and a standard method is expected in 2006 or
2007. Because of the proprietary and transitory nature of the current methods, further
exploration of this topic is currently outside the scope of this book. Note that network
technologies autonomous from FC can be employed to facilitate communication between
FC-SANs. Non-native FC transports are dened in the FC-BB specication series. Chapter
8, OSI Session, Presentation and Application Layers, discusses one such transport (Fibre
Channel over TCP/IP [FCIP]) in detail.
168
FC-SANs also can have virtual boundaries. There is currently only one method of
creating virtual FC-SAN boundaries. Invented in 2002 by Cisco Systems, VSANs are
now widely deployed in the FC-SAN market. In 2004, ANSI began researching
alternative solutions for virtualization of FC-SAN boundaries. In 2005, ANSI selected
Ciscos VSAN technology as the basis for the only standards-based solution (called
Virtual Fabrics). The new standards (FC-SW-4, FC-FS-2, and FC-LS) are expected to be
nalized in 2006. VSANs are similar to VLANs in the way trafc isolation is provided.
Typically, each switch port is statically assigned to a single VSAN by the network
administrator. Alternately, each switch port can be dynamically assigned to a VSAN via
Ciscos dynamic port VSAN membership (DPVM) technology. DPVM is similar in
function to Ethernets GVRP. Like Ethernet, an FC switch port can belong to multiple
VSANs. However, this is used exclusively on ISLs; HBAs do not currently support
VSAN trunking. As frames enter a switch from an end node, the switch prepends a tag to
indicate the VSAN membership of the ingress port. The tag remains intact until the frame
reaches the egress switch port that connects the destination end node. The switch removes
the tag and transmits the frame to the destination end node. FC switches made by Cisco
Systems use VSAN tags to ensure that no frames are forwarded between VSANs. Thus,
VSAN boundaries mimic physical FC-SAN boundaries. User data can be forwarded
between VSANs only via IVR. IVR is one of the native FC logical boundaries alluded to
in the preceding paragraph. IVR can be used with all of the non-native FC transports
dened in the FC-BB specication series.
VSANs provide additional functionality not provided by VLANs. The FC specications
outline a model in which all network services (for example, the zone server) may run on
one or more FC switches. This contrasts the TCP/IP model, in which network services other
than routing protocols typically run on one or more hosts attached to the network (for
example, a DHCP server). The FC service model enables switch vendors to instantiate
independent network services within each VSAN during the VSAN creation process. This
is the case with FC switches made by Cisco Systems. A multi-VSAN FC switch has an
instance of each network service operating independently within each VSAN. This enables
network administrators to achieve higher availability, security, and exibility by providing
complete isolation between VSANs. When facilitating communication between VSANs,
IVR selectively exports control information bidirectionally between services in the affected
VSANs without fusing the services. This is similar in concept to route redistribution
between dissimilar IP routing protocols. The result is preservation of the service isolation
model.
FC Frame Formats
FC uses one general frame format for many purposes. The general frame format has not
changed since the inception of FC. The specic format of an FC frame is determined
by the function of the frame. FC frames are word-oriented, and an FC word is 4 bytes.
Figure 5-23 illustrates the general FC frame format.
Fibre Channel
169
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
Word #16
Data Field
Word #7534
Start of Frame
Header
Optional ESP Header
8
Bytes
16
Bytes
32
Bytes
16, 32 or
64 Bytes
02112
Bytes
Payload
Optional Fill Bytes
13
Bytes
1232
Bytes
Word #535
Word #536
End of Frame
Start of Frame (SOF) ordered set4 bytes long. It delimits the beginning of a
frame, indicates the Class of Service, and indicates whether the frame is the rst frame
of a new sequence or a subsequent frame in an active sequence.
Optional Network header16 bytes long. It is used by devices that connect FC-SANs
to non-native FC networks.
170
Optional Device header16, 32, or 64 bytes long. It is used by some ULPs. The
format of the Device Header is variable and is specied by each ULP that makes use
of the header. FCP does not use this header.
CRC4 bytes long. It contains a CRC value calculated on the FC Header eld and
Data eld.
End of Frame (EOF) ordered set4 bytes long. It delimits the end of a frame,
indicates whether the frame is the last frame of an active sequence, indicates the
termination status of an Exchange that is being closed, and sets the running disparity
to negative.
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Word #0
R_CTL
D_ID
Word #1
CS_CTL/Priority
S_ID
Word #2
Type
F_CTL
Word #3
SEQ_ID
Word #4
Word #5
SEQ_CNT
DF_CTL
OX_ID
RX_ID
Parameter
Fibre Channel
171
Routing Control (R_CTL)1 byte long. It contains two sub-elds: Routing and
Information. The Routing sub-eld is 4 bits and indicates the whether the frame is a
data frame or link-control frame. This aids the receiving node in routing the frame to
the appropriate internal process. Two types of data frames can be indicated: frame
type zero (FT_0) and frame type one (FT_1). Two types of link-control frames can
be indicated: Acknowledge (ACK) and Link_Response. The value of the Routing subeld determines how the Information sub-eld and Type eld are interpreted. The
Information sub-eld is 4 bits. It indicates the category of data contained within a data
frame or the specic type of control operation contained within a link-control frame.
Source ID (S_ID)3 bytes long. It contains the FC Address Identier of the source
node.
172
may initiate a new sequence. Either the initiator or target possesses the sequence
initiative at each point in time. In FC vernacular, streamed sequences are simultaneously
outstanding sequences transmitted during a single possession of the sequence
initiative, and consecutive non-streamed sequences are successive sequences
transmitted during a single possession of the sequence initiative. If a device transmits
only one sequence during a single possession of the sequence initiative, that sequence
is simply called a sequence.
Data Field Control (DF_CTL)1 byte long. It indicates the presence or absence of
each optional header. In the case of the Device Header, this eld also indicates the size
of the optional header.
Responder Exchange ID (RX_ID)2 bytes long. It is similar to the OX_ID eld but
is assigned by the target.
Parameter4 bytes long. When the Routing sub-eld of the R_CTL eld indicates
a control frame, the Parameter eld contains operation-specic control information.
When the Routing sub-eld of the R_CTL eld indicates a data frame, the
Fibre Channel
173
interpretation of this eld is determined by the Relative Offset Present bit in the
F_CTL eld. When the Relative Offset Present bit is set to 1, this eld indicates
the position of the rst byte of data carried in this frame relative to the rst byte of
all the data transferred by the associated SCSI command. This facilitates payload
segmentation and reassembly. When the Relative Offset Present bit is set to 0,
this eld may contain ULP parameters that are passed to the ULP indicated in the
Type eld.
The S_ID, D_ID, OX_ID, and RX_ID elds are collectively referred to as the fully
qualied exchange identier (FQXID). The S_ID, D_ID, OX_ID, RX_ID, and SEQ_ID
elds are collectively referred to as the sequence qualier. The elds of the sequence
qualier can be used together in several ways. The preceding descriptions of these elds are
highly simplied and apply only to FCP. FC implements many control frames to facilitate
link, fabric, and session management. Many of the control frames carry additional
information within the Data eld. Comprehensive exploration of all the control frames and
their payloads is outside the scope of this book, but certain control frames are explored
in subsequent chapters. For more information about the general FC frame format, readers
are encouraged to consult the ANSI T11 FC-FS-2 specication. For more information
about control frame formats, readers are encouraged to consult the ANSI T11 FC-FS-2,
FC-SW-3, FC-GS-3, FC-LS, and FC-SP specications.
FC Delivery Mechanisms
Like Ethernet, FC supports several delivery mechanisms. Each set of delivery mechanisms
is called a class of service (CoS). Currently, there are six CoS denitions:
NOTE
Class 5 was abandoned before completion. Class 5 was never included in any ANSI
standard.
Classes 1, 2, 3, 4, and 6 are referred to collectively as Class N services; the N stands for
node. The F in Class F stands for fabric because Class F trafc can never leave the fabric.
In other words, Class F trafc can never be accepted from or transmitted to a node and
174
may be exchanged only between fabric infrastructure devices such as switches and
bridges. FC devices are not required to support all six classes. Classes 1, 4, and 6 are not
currently supported on any modern FC switches. Classes 2 and 3 are supported on all
modern FC switches. Class 3 is currently the default service on all modern FC switches,
and most FC-SANs operate in Class 3 mode. Class F support is mandatory on all
FC switches.
Class 1 provides a dedicated circuit between two end nodes (conceptually similar to ATM
CES). Class 1 guarantees full bandwidth end-to-end. Class 2 provides reliable delivery
without requiring a circuit to be established. All delivered frames are acknowledged, and
all delivery failures are detected and indicated to the source node. Class 3 provides
unreliable delivery that is roughly equivalent to Ethernet Type 1 service. Class 4 provides
a virtual circuit between two end nodes. Class 4 is similar to Class 1 but guarantees only
fractional bandwidth end-to-end. Class 6 essentially provides multiple Class 1 circuits
between a single initiator and multiple targets. Only the initiator transmits data frames, and
targets transmit acknowledgements. Class F is essentially the same as Class 2 but is
reserved for fabric-control trafc.
Class 3 is currently the focus of this book. The following paragraphs describe Class 3 in
terms applicable to all ULPs. For details of how Class 3 delivery mechanisms are used by
FCP, see Chapter 8, OSI Session, Presentation, and Application Layers. Class 3
implements the following delivery mechanisms:
Destination nodes can detect frames dropped in transit. This is accomplished via the
SEQ_CNT eld and the error detect time-out value (E_D_TOV). When a drop is
detected, all subsequently received frames within that sequence (and possibly within
that exchange) are discarded, and the ULP within the destination node is notied of
the error. The source node is not notied. Source node notication is the responsibility
of the ULP. For this reason, ULPs that do not implement their own delivery failure
notication or delivery acknowledgement schemes should not be deployed in Class 3
networks. (FCP supports delivery failure detection via timeouts and Exchange status
monitoring.) The frames of a sequence are buffered to be delivered to the ULP as a
group. So, it is not possible for the ULP to receive only part of a sequence. The
decision to retransmit just the affected sequence or the entire Exchange is made by the
ULP within the initiator before originating the Exchange. The decision is conveyed
to the target via the Abort Sequence Condition sub-eld in the F_CTL eld in the FC
Header. It is called the Exchange Error Policy.
Destination nodes can detect duplicate frames. However, the current specications do
not explicitly state how duplicate frames should be handled. Duplicates can result
only from actions taken by a sequence initiator or from frame forwarding errors
within the network. A timer, the resource allocation time-out value (R_A_TOV), is
used to avoid transmission of duplicate frames after a node is unexpectedly reset.
However, a node that has a software bug, virus, or other errant condition could
transmit duplicate frames. It is also possible for frame-forwarding errors caused by
Fibre Channel
175
FC devices can detect corrupt frames via the CRC eld. Upon detection of a corrupt
frame, the frame is dropped. If the frame is dropped by the destination node, the
ULP is notied within the destination node, but the source node is not notied. If the
frame is dropped by a switch, no notication is sent to the source or destination node.
Some FC switches employ cut-through switching techniques and are unable to detect
corrupt frames. Thus, corrupt frames are forwarded to the destination node and
subsequently dropped. All FC switches produced by Cisco Systems employ a storeand-forward architecture capable of detecting and dropping corrupt frames.
Retransmission is not supported. (Note that some other Classes of Service support
retransmission.) ULPs are expected to retransmit any data lost in transit. FCP
supports retransmission. Likewise, SCSI supports retransmission by reissuing failed
commands.
The specications do not dene methods for fragmentation or reassembly because the
necessary header elds do not exist. An MTU mismatch results in frame drop. To
avoid MTU mismatches, end nodes discover the MTU of intermediate network links
via fabric login (FLOGI) during link initialization. A single MTU value is provided
to end nodes during FLOGI, so all network links must use a common MTU size.
End nodes also exchange MTU information during PLOGI (see Chapter 7, OSI
Transport Layer). Based on this information, transmitters do not send frames that
exceed the MTU of any intermediate network link or the destination node.
176
NOTE
Note that the ability of a destination node to reorder frames is present in every CoS because
the Sequence Qualier elds and SEQ_CNT eld are contained in the general header
format used by every CoS. However, the requirement for a recipient to reorder frames is
established per CoS. This contrasts the IP model wherein each transport layer protocol uses
a different header format. Thus, in the IP model, the choice of transport layer protocol
determines the recipients ability and requirement to reorder packets.
Fibre Channel
177
FC Link Aggregation
Currently, no standard exists for aggregation of multiple FC links into a port channel.
Consequently, some FC switch vendors have developed proprietary methods. Link
aggregation between FC switches produced by different vendors is possible, but functionality
is limited by the dissimilar nature of the load-balancing algorithms. No FC switch vendors
currently allow port channels between heterogeneous switches. Cisco Systems supports FC
port channels in addition to automation of link aggregation. Automation of link aggregation
is accomplished via Ciscos FC Port Channel Protocol (PCP). PCP is functionally similar
to LACP and PAgP. PCP employs two sub-protocols: the bringup protocol and the
autocreation protocol. The bringup protocol validates the conguration of the ports at each
end of an ISL (for compatibility) and synchronizes Exchange status across each ISL to
ensure symmetric data ow. The autocreation protocol aggregates compatible ISLs into a
port channel. The full details of PCP have not been published, so further disclosure of PCP
within this book is not possible. As with Ethernet, network administrators must be wary of
several operational requirements. The following restrictions apply to FC port channels
connecting two switches produced by Cisco Systems:
All links in a port channel must connect a single pair of devices. In other words, only
point-to-point congurations are permitted.
All links in a port channel must operate at the same transmission rate.
All links in a non-trunking port channel must belong to the same VSAN.
If any link in a port channel is congured as non-trunking, all links in that port channel
must be congured as non-trunking. Likewise, if any link in a port channel is
congured as trunking, all links in that port channel must be congured as trunking.
All links in a trunking port channel must trunk the same set of VSANs.
The rst two restrictions also apply to other FC switch vendors. The three VSAN-related
restrictions only apply to Cisco Systems because VSANs are currently supported only by
Cisco Systems. Several additional restrictions that do not apply to Cisco Systems do apply
to other FC switch vendors. For example, one FC switch vendor mandates that only
contiguous ports can be aggregated, and distance limitations apply because of the possibility
of out-of-order frame delivery. Similar to Ethernet, the maximum number of links that
may be grouped into a single port channel and the maximum number of port channels
that may be congured on a single switch are determined by product design. Cisco Systems
supports 16 links per FC port channel and 128 FC port channels per switch. These numbers
currently exceed the limits of all other FC switch vendors.
FC Link Initialization
When a FC device is powered on, it begins the basic FC link initialization procedure.
Unlike Ethernet, the media type is irrelevant to basic FC link initialization procedures.
Like Ethernet, FC links may be manually congured or dynamically congured via
178
auto-negotiation. 10GFC does not currently support auto-negotiation. Most HBAs and
switch ports default to auto-negotiation mode. FC auto-negotiation is implemented in a
peer-to-peer fashion. Following basic FC link initialization, one of several extended FC
link initialization procedures occurs. The sequence of events that transpires is determined
by the device types that are connected. The sequence of events differs for node-to-node,
node-to-switch, switch-to-switch, and switch-to-bridge connections. Node-to-node
connections are used for DAS congurations and are not discussed in this book.
The following basic FC link initialization procedure applies to all FC device types. Three
state machines govern the basic FC link-initialization procedure: speed negotiation
state machine (SNSM), loop port state machine (LPSM), and FC_Port state machine
(FPSM). The SNSM executes rst, followed by the LPSM, followed by the FPSM. This
book does not discuss the LPSM. When a port (port A) is powered on, it starts its receiver
transmitter time-out value (R_T_TOV) timer and begins transmitting OLS at its
maximum supported transmission rate. If no receive signal is detected before R_T_TOV
expiration, port A begins transmitting NOS at its maximum supported transmission rate and
continues until another port is connected and powered on. When another port (port B) is
connected and powered on, auto-negotiation of the transmission rate begins. The duplex
mode is not auto-negotiated because switch-attached FC devices always operate in fullduplex mode. Port B begins transmitting OLS at its maximum supported transmission rate.
Port A continues transmitting NOS at its maximum supported transmission rate. This
continues for a specied period of time, then each port drops its transmission rate to the next
lower supported rate and continues transmitting for the same period of time. This cycle
repeats until a transmission rate match is found or all supported transmission rates (up to a
maximum of four) have been attempted by each port.
During each transmission rate cycle, each port attempts to achieve bit-level synchronization
and word alignment at each of its supported reception rates. Reception rates are cycled at
least ve times as fast as transmission rates so that ve or more reception rates can be
attempted during each transmission cycle. Each port selects its transmission rate based on
the highest reception rate at which word alignment is achieved and continues transmission
of OLS/NOS at the newly selected transmission rate. When both ports achieve word
alignment at the new reception rate, auto-negotiation is complete. When a port is manually
congured to operate at a single transmission rate, auto-negotiation remains enabled, but
only the congured transmission rate is attempted. Thus, the peer port can achieve bit-level
synchronization and word alignment at only one rate. If the congured transmission rate
is not supported by the peer device, the network administrator must intervene. After autonegotiation successfully completes, both ports begin listening for a Primitive Sequence.
Upon recognition of three consecutive OLS ordered sets without error, port A begins
transmitting LR. Upon recognition of three consecutive LR ordered sets without error,
port B begins transmitting LRR to acknowledge recognition of the LR ordered sets.
Upon recognition of three consecutive LRR ordered sets without error, port A begins
transmitting Idle ordered sets. Upon recognition of the rst Idle ordered set, port B
begins transmitting Idle ordered sets. At this point, both ports are able to begin normal
communication.
Fibre Channel
179
NOTE
Cisco Systems supports a feature called switch port analyzer (SPAN) on its Ethernet and
FC switches. On FC switches, SPAN makes use of SPAN destination (SD) and SPAN trunk
(ST) ports. These port types are currently proprietary to Cisco Systems. The SD port type
and the SPAN feature are discussed in Chapter 14, Storage Protocol Decoding and
Analysis.
When an N_Port is attached to a switch port, the extended link initialization procedure
known as FLOGI is employed. FLOGI is mandatory for all N_Ports regardless of CoS, and
communication with other N_Ports is not permitted until FLOGI completes. FLOGI is
accomplished with a single-frame request followed by a single-frame response. In switched
FC environments, FLOGI accomplishes the following tasks:
180
Provides the operating characteristics of the entire network to the requesting N_Port
Assigns an FCID to the requesting N_Port
Initializes the BB_Credit mechanism for link-level ow control
The ANSI T11 specications do not explicitly state a required minimum or maximum
number of Idles that must be transmitted before transmission of the FLOGI request. So, the
amount of delay varies widely (between 200 microseconds and 1500 milliseconds) from
one HBA model to the next. When the N_Port is ready to begin the FLOGI procedure, it
transmits a FLOGI ELS frame with the S_ID eld set to 0. Upon recognition of the FLOGI
request, the switch port assumes the role of F_Port and responds with a FLOGI Link
Services Accept (LS_ACC) ELS frame. The FLOGI LS_ACC ELS frame species the
N_Ports newly assigned FCID via the D_ID eld. Upon recognition of the FLOGI
LS_ACC ELS frame, the FLOGI procedure is complete, and the N_Port is ready to
communicate with other N_Ports. The FLOGI ELS and associated LS_ACC ELS use the
same frame format, which is a standard FC frame containing link parameters in the data
eld. Figure 5-25 illustrates the data eld format of an FLOGI/LS_ACC ELS frame.
Figure 5-25 Data Field Format of an FC FLOGI/LS_ACC ELS Frame
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Word #0
LS Command Code
Word #14
Word #56
N_Port Name
Word #78
Word #912
Word #1316
Word #1720
Word #2124
Word #2528
Word #2930
Services Availability
Word #31
Word #3261
Word #6263
Fibre Channel
181
LS Command Code4 bytes long. It contains the 1-byte FLOGI command code
(0x04) followed by 3 bytes of zeros when transmitted by an N_Port. This eld
contains the 1-byte LS_ACC command code (0x02) followed by 3 bytes of zeros
when transmitted by an F_Port.
N_Port Name8 bytes long. It contains the PWWN of the N_Port. This eld is not
used by the responding F_Port.
Node Name/Fabric Name8 bytes long. It contains the NWWN associated with the
N_Port (FLOGI) or the switch (LS_ACC).
Class 1/6, 2, 3, and 4 Service ParametersEach is 16 bytes long. They contain classspecic parameters that affect network operation. Key parameters relevant to Class 3
include indication of support for Class 3, in-order delivery, priority/preemption,
CS_CTL preference, DiffServ, and clock synchronization. Some parameters can be
manually congured by the network administrator. If manually congured, only the
values congured by the administrator will be advertised to the peer device.
Login Extension Data Length4 bytes long. It indicates the length of the Login
Extension Data eld expressed in 4-byte words.
Login Extension Data120 bytes long. It contains the vendor identity and other
vendor-specic information.
If the operating characteristics of an N_Port change after the N_Port completes FLOGI,
the N_Port can update the switch via the FDISC ELS command. The FDISC ELS and
associated LS_ACC ELS use the exact same frame format as FLOGI. The meaning of each
eld is also identical. The LS Command Code eld contains the FDISC command code
(0x51). The FDISC ELS enables N_Ports to update the switch without affecting any
sequences or exchanges that are currently open. For the new operating characteristics to
take affect, the N_Port must log out of the fabric and perform FLOGI again. An N_Port may
also use FDISC to request assignment of additional FC Address Identiers.
182
When a switch port is attached to another switch port, the switch port mode initialization
state machine (SPMISM) governs the extended link initialization procedure. The SPMISM
cannot be invoked until the LPSM and FPSM determine that there is no FC-AL or N_Port
attached. Because the delay between basic link initialization and FLOGI request transmission
is unspecied, each switch vendor must decide how its switches will determine whether an
FC-AL or N_Port is attached to a newly initialized link. All FC switches produced by Cisco
Systems wait 700 ms after link initialization for a FLOGI request. If no FLOGI request is
received within that time, the LPSM and FPSM relinquish control to the SPMISM.
All FC switches behave the same once the SPMISM takes control. An exchange link
parameters (ELP) SW_ILS frame is transmitted by one of the connected switch ports (the
requestor). Upon recognition of the ELP SW_ILS frame, the receiving switch port (the
responder) transmits an SW_ACC SW_ILS frame. Upon recognition of the SW_ACC
SW_ILS frame, the requestor transmits an ACK frame. The ELP SW_ILS and SW_ACC
SW_ILS both use the same frame format, which is a standard FC frame containing link
parameters in the data eld. The data eld of an ELP/SW_ACC SW_ILS frame is
illustrated in Figure 5-26.
Figure 5-26 Data Field Format of an FC ELP/SW_ACC SW_ILS Frame
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Revision
Flags
R_A_TOV
Word #3
E_D_TOV
Word #45
Word #67
Word #811
Word #12
Word #13
Word #14
Reserved
Word #1519
Word #21n
BB_SC_N
Word #2
Word #20
Word #0
Word #1
Fibre Channel
NOTE
183
Each SW_ILS command that expects an SW_ACC response denes the format of the
SW_ACC payload. Thus, there are many SW_ACC SW_ILS frame formats.
SW_ILS Command Code4 bytes long. This eld contains the ELP command code
(0x10000000) when transmitted by a requestor. This eld contains the SW_ACC
command code (0x02000000) when transmitted by a responder.
BB_SC_N1 byte long. It indicates the BB_SC interval. The value of this eld
is meaningful only if the ISL Flow Control Mode eld indicates that the R_RDY
mechanism is to be used.
E_D_TOV4 bytes long. It indicates the transmitters required value for errordetection timeout. All devices within a physical SAN or VSAN must agree upon a
common E_D_TOV. Some FC switches allow the network administrator to congure
this value manually.
184
Flow Control Parameter Length2 bytes long. It indicates the length of the Flow
Control Parameters eld expressed in bytes.
ISL Flow Control Mode2 bytes long. It indicates whether the R_RDY mechanism
or a vendor-specic mechanism is supported. On some FC switches, the ow-control
mechanism is determined by the switch operating mode (native or interoperable).
Following ELP, the ISL is reset to activate the new operating parameters. The ELP
requestor begins transmitting LR ordered sets. Upon recognition of three consecutive LR
ordered sets without error, the ELP responder begins transmitting LRR to acknowledge
recognition of the LR ordered sets. Upon recognition of three consecutive LRR ordered sets
without error, the ELP requestor begins transmitting Idle ordered sets. Upon recognition of
the rst Idle ordered set, the ELP responder begins transmitting Idle ordered sets. At this
point, the switches are ready to exchange information about the routing protocols that they
support via the exchange switch capabilities (ESC) procedure. The ESC procedure is
optional, but all modern FC switches perform ESC. The ELP requestor transmits an ESC
SW_ILS frame with the S_ID and D_ID elds each set to 0xFFFFFD. The ESC payload
contains a list of routing protocols supported by the transmitter. Upon recognition of
the ESC SW_ILS frame, the receiver selects a single routing protocol and transmits an
SW_ACC SW_ILS frame indicating its selection in the payload. The S_ID and D_ID elds
of the SW_ACC SW_ILS frame are each set to 0xFFFFFD. The ESC SW_ILS frame
format is a standard FC frame containing a protocol list in the data eld. The data eld of
an ESC SW_ILS frame is illustrated in Figure 5-27.
Figure 5-27 Data Field Format of an FC ESC SW_ILS Frame
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Word #0
SW_ILS Command
Code
Reserved
Payload Length
Word #12
Vendor ID String
Word #35
Protocol Descriptor #n
Fibre Channel
185
SW_ILS Command Code1 byte long. It contains the rst byte of the ESC
command code (0x30). The rst byte of the ESC command code is unique to the ESC
command, so the remaining 3 bytes are truncated.
Vendor ID String8 bytes long. It contains the unique vendor identication string
assigned by ANSI T10 to the manufacturer of the transmitting switch.
Payload Length2 bytes long. It indicates the total length of all payload elds
expressed in bytes.
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
SW_ILS Command
Code
Word #0
0
8
Reserved
Word #12
Vendor ID String
Word #35
SW_ILS Command Code1 byte long. It contains the rst byte of the SW_ACC
command code (0x02). The rst byte of the SW_ACC command code is unique to the
SW_ACC command, so the remaining 3 bytes are truncated.
186
Vendor ID String8 bytes long. It contains the unique vendor identication string
assigned by ANSI T10 to the manufacturer of the transmitting switch.
Following ESC, the switch ports optionally authenticate each other (see Chapter 12,
Storage Network Security). The port-level authentication procedure is relatively new.
Thus, few modern FC switches support port-level authentication. That said, all FC switches
produced by Cisco Systems support port-level authentication. Upon successful
authentication (if supported), the ISL becomes active. Next, the PSS process ensues,
followed by domain address assignment. After all Domain_IDs have been assigned, the
zone exchange and merge procedure begins. Next, the FSPF routing protocol converges.
Finally, RSCNs are generated. To summarize:
When a switch port is attached to a bridge port, the switch-to-switch extended link
initialization procedure is followed, but ELP is exchanged between the switch port and the
bridge port. An equivalent SW_ILS, called exchange B_access parameters (EBP), is
exchanged between the bridge ports across the WAN. For details about the EBP SW_ILS,
see Chapter 8, OSI Session, Presentation, and Application Layers. Bridge ports are
transparent to all inter-switch operations after ELP. Following ELP, the link is reset. ESC is
then performed between the switch ports. Likewise, port-level authentication is optionally
performed between the switch ports. The resulting ISL is called a virtual ISL (VISL).
Figure 5-29 illustrates this topology.
Any SW_ILS command may be rejected by the responding port via the switch internal
link service reject (SW_RJT). A common SW_RJT format is used for all SW_ILS.
Figure 5-30 illustrates the data eld format of an SW_RJT frame.
Fibre Channel
187
ELP
EBP
Non-FC Network
Bridge
Device
FC Switch
E_Port
ELP
Bridge
Device
B_Port
FC Switch
B_Port
E_Port
VISL
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
8
Word #0
Reserved
Word #1
Reason Code
Reason Code
Explanation
Vendor Specific
SW_ILS Command Code4 bytes long. This eld contains the SW_RJT command
code (0x01000000).
Vendor Specic1 byte long. When the Reason Code eld is set to 0xFF, this
eld provides a vendor-specic reason code. When the Reason Code eld is set to any
value other than 0xFF, this eld is ignored.
188
Table 5-11
Table 5-12
Description
0x01
0x02
0x03
Logical Error
0x04
0x05
Logical Busy
0x07
Protocol Error
0x09
0x0B
0x0C
Invalid Attachment
0xFF
Description
0x00
No Additional Explanation
0x01
0x03
0x04
0x05
0x0D
Invalid Port_Name
0x0E
Invalid Switch_Name
0x0F
0x10
Invalid Domain_ID_List
0x19
0x29
0x2A
0x2B
Invalid Domain_ID
0x2C
0x2D
0x2E
0x2F
E_Port Is Isolated
Summary
Table 5-12
189
Description
0x31
Authorization Failed
0x32
Authentication Failed
0x33
0x34
Checks In Progress
0x35
0x36
0x41
0x42
Unsupported Command
0x44
Not Authorized
0x45
Invalid Request
0x46
Fabric Changing
0x47
0x48
0x49
Invalid Data
0x4A
Unable To Merge
0x4B
0x50
0x58
The preceding descriptions of the FC link initialization procedures are simplied for the
sake of clarity. For more detail about Primitive Sequence usage, speed negotiation states,
the FPSM, port types, the SPMISM, frame formats, or B_Port operation, readers are
encouraged to consult the ANSI T11 FC-FS, FC-LS, and FC-SW specication series.
Summary
This chapter provides in-depth analysis of the physical layer and data-link layer
technologies employed by the SPI, Ethernet, and FC. Many of the details provided are
ancillaries to design and troubleshooting efforts rather than daily network operation. That
said, a thorough understanding of the details provided can enable network administrators
to optimize their daily operations.
Though the SPI is waning in popularity, it is still important to understand that the capabilities
and limitations of the SPI inuenced the evolution of SCSI performance requirements that
190
must now be met by modern storage networks built on Ethernet or FC. For this reason, the
SPI is included in this chapter. Insight to the functional capabilities of Ethernet and FC
relative to each other is provided to enable readers to properly employ the use of each
technology based on application requirements. Readers should remember that Ethernet is
deployed in conjunction with TCP/IP to provide a complete solution. So, the limitations
of Ethernet presented in this chapter should not be construed as barriers to deployment.
Likewise, many of the capabilities of FC are not examined in this chapter. The subsequent
chapters of Part II, OSI Layers, build upon this chapter to provide a complete picture of
each solution.
Review Questions
1 What two components compose a SCSI logical unit?
2 Which SAM addressing construct facilitates communication with an object?
3 What is a standards body that denes an address format commonly called?
4 Which organization assigns organizationally unique identiers (OUIs)?
5 If fragmentation occurs, where does it occur?
6 Is it possible for a network protocol to guarantee that frames or packets will never be
dropped?
7 Why do most packet-switching technologies support bandwidth reservation schemes?
8 Does the SAM explicitly require in-order delivery of frames or packets composing a
Review Questions
191
initialization procedures?
23 What is the most common ber-optic connector used by 2-Gbps FC devices?
24 What is the maximum operating range of 2-Gbps FC on 62.5 micron MMF?
25 How many 8B/10B control characters does 4-Gbps FC use?
26 What information does the NAA eld of an FC WWN provide?
27 What do FC attached SCSI initiators do following receipt of an RSCN frame?
28 Does the header of a FC frame provide functionality at OSI Layers other the data-link
layer?
29 Which FC CoS is used by default in most FC-SANs?
30 Does FC currently support automation of link aggregation?
31 What determines the sequence of events during extended FC link initialization?
Draw parallels between common data-link layer terms and network layer terms
Differentiate between the IP naming scheme and the SAM naming scheme
CHAPTER
Internet Protocol
IP is the most ubiquitous network layer protocol in the world. IPs robust support of
data-link layer technologies, ability to service multiple transport layer protocols, and
functional extensibility have contributed largely to its success. Leveraging IP to transport
Small Computer System Interface (SCSI) was inevitable once SCSI was adapted to serial
networking technologies.
IPv4 Overview
IP is an open protocol that facilitates network layer communication across internetworks in
a packet-switched manner. One or more of the underlying data-link layer networks may
operate in a circuit-switched manner, but IP operation at the network layer remains
packet-switched. IP provides just the functionality required to deliver packets from a source
device to a destination device. Even path determination is left to other network layer
protocols called routing protocols (see Chapter 10, Routing and Switching Protocols).
Thus, IP is a routed protocol. Likewise, IP relies on other network layer protocols for data
condentiality (see the IPsec section of Chapter 12, Storage Network Security) and
control messaging (see the ICMP section of this chapter).
IPv4 is currently the most widely deployed version of IP. A newer version has been
developed (IPv6), but it is not yet widely deployed. IP version numbers are somewhat
misleading. The IP header contains a eld that identies the protocol version number. Valid
values are 0 through 15. Version numbers 0 and 1 were once assigned to the rst and second
revisions of a protocol that combined transport layer and network layer functionality. With
the third revision, the combined protocol was separated into TCP and IP. That revision
of IP was assigned the next available version number (2). Two more revisions of IP were
produced and assigned version numbers 3 and 4. When the next revision of IP was produced, it was decided that the version numbers should be reassigned. Version 4 was
reassigned to the latest revision of IP, which was the fourth revision of IP as an independent
protocol. That revision is now commonly called IPv4. Version number 0 was reserved for
194
intuitive reasons. Version numbers 1 through 3 were left unassigned. Note the rst three
independent revisions of IP were unstable and never adopted. Because the original two
revisions of IP were actually a combined protocol, those revisions were not considered
during version-number reassignment. Thus, IPv4 is really the rst version of IP ever
deployed as a standard protocol.
IPv4 originally was adopted via RFC 760. IPv4 has since been revised, but the protocol
version number has not been incremented. Instead, the RFC numbering system is used to
identify the current revision of IPv4. RFC 791 is the current revision of IPv4. Subsequent
to IPv4 adoption, a completely different network layer protocol was developed by IETF and
assigned protocol version number 5. When development of the next generation of IP was
undertaken, the new protocol was assigned version number 6 (IPv6). Thus, IPv6 is really
the second version of IP ever deployed as a standard protocol. IPv6 was revised multiple
times before adoption and one time since adoption, but the protocol version number has not
been incremented. Like IPv4, the current revision of IPv6 is tracked via RFC numbers.
Originally adopted via RFC 1883, the current revision of IPv6 is RFC 2460.
Development of IPv6 was motivated primarily by the need to expand the address space
of IPv4. To extend the life of IPv4 during development of IPv6, new techniques were
developed to improve the efciency of address consumption. Chief among these techniques
are private addressing, Network Address Translation (NAT), and variable-length subnet
masking (VLSM). These techniques have been so successful that they have signicantly
slowed adoption of IPv6. Because IPv6 is not yet used in storage networks, the remainder
of this book focuses on IPv4 when discussing IP.
When discussing IP, it is customary to use the term interface rather than port. The purpose
is to distinguish between the network layer functionality (the interface) and the data-link
layer functionality (the port) in an IP-enabled device. For example, an Ethernet switch can
contain a logical IP interface that is used for management access via any Ethernet port. We
use customary IP terminology in this chapter when discussing switches or routers because
SAM terminology does not apply to networking devices. Networking devices do not
implement SCSI; they merely forward encapsulated SCSI packets. However, we use SAM
terminology in this chapter when discussing end nodes, to maintain consistency with
previous chapters. In the SAM context, an IP interface is a logical SAM port associated
with a physical SAM port (such as Ethernet). We discuss this topic in detail in the
addressing scheme section of this chapter.
The line between end nodes and networking devices is becoming fuzzy as hybrid devices
enter the storage market. Hybrid devices implement SCSI but are installed in networking
devices. For example, Cisco Systems produces a variety of storage virtualization modules
that are deployed in the MDS9000 family of switches. These virtualization modules should
be viewed as end nodes, whereas the MDS9000 switches should be viewed as networking
devices. Further discussion of hybrid devices is outside the scope of this book, but readers
should be aware of the distinction made herein to avoid confusion when planning the
deployment of hybrid devices.
Internet Protocol
195
Another difference in terminology between data-link layer technologies and network layer
technologies arises in the context of ow control and QoS. The term buffer is typically used
when discussing data-link layer technologies, whereas the term queue is typically used when
discussing network layer technologies. A buffer and a queue are essentially the same
thing. That said, network layer technologies implement queue management policies that
are far more sophisticated than data-link layer buffer management policies. For additional
architectural information about the TCP/IP suite, readers are encouraged to consult IETF
RFC 1180.
Data-Link Support
One of the most benecial features of IP is its ability to operate on a very broad range of
data-link layer technologies. The IETF is very diligent in adapting IP to new data-link layer
technologies as they emerge. Of the many data link layer technologies supported, the most
commonly deployed are Ethernet, PPP, high-level data-link control (HDLC), frame relay,
asynchronous transfer mode (ATM), and multiprotocol label switching (MPLS). Most
Internet Small Computer System Interface (iSCSI) deployments currently employ Ethernet
end-to-end. Because Fibre Channel over TCP/IP (FCIP) is a point-to-point technology, most
current deployments employ PPP over time-division multiplexing (TDM) circuits for WAN
connectivity. Though Internet Fibre Channel Protocol (iFCP) supports mesh topologies, most
deployments are currently congured as point-to-point connections that employ PPP over
TDM circuits for WAN connectivity. Thus, we discuss only Ethernet and PPP in this section.
Ethernet
As discussed in Chapter 3, Overview of Network Operating Principles, an Ethertype
value is assigned to each protocol carried by Ethernet. Before the IEEE became the ofcial
authority for Ethertype assignments, Xerox Corporation fullled the role. Xerox assigned
Ethertype 0x0800 to IP. When the IEEE took control, they listed many of the Ethertype
values as assigned to Xerox. To this day, the IEEE listing of Ethertype assignments still
shows 0x0800 assigned to Xerox. However, RFC 894 documents Ethertype 0x0800 as
assigned to IP, and the public at large accepts RFC 894 as the nal word on this issue. IANA
maintains an unofcial listing of Ethertype assignments that properly shows 0x0800
assigned to IP. As discussed in Chapter 5, OSI Physical and Data-Link Layers, Address
Resolution Protocol (ARP) is used to resolve IP addresses to Ethernet addresses. Xerox
assigned Ethertype 0x0806 to ARP, but this is not documented via RFC 894, and the IEEE
Ethertype listing shows Xerox as the assignee for this value. The unofcial IANA listing of
Ethertype assignments properly shows 0x0806 assigned to ARP.
The Ethernet padding mechanism discussed in Chapter 5, OSI Physical and DataLink Layers, is sometimes used for IP packets. The minimum size of an Ethernet frame is
64 bytes, but the 802.3-2002 header and trailer are only 18 bytes. The IP header (without
options) is only 20 bytes, and there is no minimum length requirement for the payload of
196
an IP packet. Some upper-layer protocols (ULPs) can generate IP packets with payloads
less than 26 bytes. Examples include TCP during connection establishment and many
ICMP messages. When this occurs, the minimum Ethernet frame length requirement is
not met. So, Ethernet inserts padding when the IP packet is framed for transmission. The
Ethernet padding is not part of the IP packet, so it does not affect any elds in the IP header.
PPP
In the early 1970s, IBM invented the Synchronous Data Link Control (SDLC) protocol to
facilitate mainframe-to-peripheral communication. SDLC proved to be very effective, but it
could not be used by open systems. So, IBM submitted SDLC to the ISO. In 1979, the ISO
developed the HDLC protocol, which used SDLCs frame format, but differed from SDLC
in operation. Like all protocols, HDLC has limitations. One of HDLCs limitations is lack
of support for multiple ULPs. To address this and other shortcomings, the IETF developed
PPP in 1989 based on HDLC. The original PPP frame format was derived from HDLC but
was modied to support multiple ULPs. Each ULP is identied using an IANA-assigned
16-bit protocol number. The protocol number for IPv4 is 0x0021. PPP also enhances HDLC
operationally in several ways. The most recent version of PPP is RFC 1661.
PPP is used on serial, point-to-point circuits that inherently provide in-order delivery.
Modern storage networks that employ PPP typically do so on DS-1, DS-3, and Synchronous
Optical Network (SONET) circuits. PPP consists of three components: a frame format
denition, the Link Control Protocol (LCP), and a suite of Network Control Protocols
(NCPs). The standard frame format can be used by all ULPs, or deviations from the standard
frame format can be negotiated during connection establishment via LCP. LCP is also used
to open and close connections, negotiate the MTU (can be symmetric or asymmetric),
authenticate peer nodes, test link integrity, and detect conguration errors. Each ULP has its
own NCP. An NCP is responsible for negotiating and conguring ULP operating parameters.
The NCP for IP is called the IP Control Protocol (IPCP). IPCP was rst dened in RFC
1134, which is the original PPP RFC. IPCP was later separated into its own RFC. The most
recent version of IPCP is dened in RFC 1332. The PPP protocol number for IPCP is
0x8021. IPCP currently negotiates only four conguration parameters: header compression,
IP address assignment, name server assignment, and mobility. Header compression on PPP
links is designed for low-speed circuits used for dialup connectivity. Header compression is
rarely used on high-speed circuits such as DS-1 and above. IP address assignment, nameserver assignment, and mobility are designed for end nodes that require remote access to a
network. None of these options apply to FCIP or iFCP deployments.
The original PPP frame format is shown in Figure 6-1.
Figure 6-1
Length in Bytes
1
Flag
1
Address
1
Control
0x
Protocol
2 or 4
Frame
Information/
Check
Pad
Sequence
1
Flag
Internet Protocol
197
The original PPP frame format has since been modied as shown in Figure 6-2.
Figure 6-2
1 or 2
0x
Protocol
Information /Pad
The new frame format allows PPP to be encapsulated easily within a wide variety of
other data-link layer frames. This enables multi-protocol support in data-link layer
technologies that do not natively support multiple ULPs. The new frame format also allows
the leading byte of the protocol number to be omitted if the value is 0x00, which improves
protocol efciency. This improvement applies to IP. When PPP is used on DS-1, DS-3, and
SONET circuits, HDLC framing is used to encapsulate the PPP frames. This is called PPP
in HDLC-like Framing and is documented in RFC 1662. Some additional requirements
apply to PPP when operating on SONET circuits, as documented in RFC 2615 and RFC
3255. Figure 6-3 illustrates PPP in HDLC-like Framing.
Figure 6-3
Length in Bytes
1
Flag
1
Address
1
Control
1 or 2
0x
Protocol
2 or 4
Frame
Information/
Check
Pad
Sequence
1
Flag
Control1 byte long. It identies the type of frame. For PPP in HDLC-like Framing,
this eld must contain the Unnumbered Information (UI) code 0x03.
Address1 byte long. It contains all ones, which is designated as the broadcast
address. All PPP nodes must recognize and respond to the broadcast address.
Addressing Scheme
IP does not implement an equivalent to SAM device or port names. However, IP does
implement a naming mechanism that provides similar functionality to SAM port names under
certain circumstances. The primary objective of the Domain Name System (DNS) is to allow
a human-friendly name to be optionally assigned to each IP address. Doing so allows humans
198
to use the DNS name of a port or interface instead of the IP address assigned to the port or
interface. For example, DNS names usually are specied in web browsers rather than
IP addresses. Even though this is a major benet, it is not the function that SAM port names
provide.
Another benet is that DNS names facilitate persistent identication of ports in the context of
dynamically assigned IP addresses. This is accomplished by statically assigning a name to each
port within the host operating system, then dynamically registering each port name in DNS
along with the IP address dynamically assigned to the port during the boot process. Each time
a new IP address is assigned, the port name is re-registered. This permits a single DNS name
to persistently identify a port, which is the function that SAM port names provide. However,
DNS names are assigned to IP addresses rather than to ports, and IP addresses are routinely
reassigned among many ports. Additionally, DNS name assignments are not guaranteed to
be permanent. For example, it is possible to change the DNS name of an IP address that is
assigned to a port and to reassign the old DNS name to a different IP address that is assigned
to a different port. After doing so, the old DNS name no longer identies the original port.
Thus, the SAM port-name objective is not met. The primary purpose of a SAM port name is
to positively and persistently identify a single port; therefore each name must be permanently
assigned to a single port. It is also possible for a port to be assigned multiple IP addresses.
Because each IP address is assigned its own DNS name, multiple DNS names can identify a
single port. All these factors illustrate why DNS names are not analogous to SAM port names.
Likewise, DNS names are not analogous to SAM device names. Each host is assigned a
host name in its operating system. The host name represents the host chassis and everything
contained within it. Host names can be extended to DNS by using the host name as the DNS
name of an IP address assigned to the host. That DNS name then represents the host chassis
and everything contained within it. As such, that DNS name represents a superset of a SAM
device name. Even when a DNS name represents just an IP address (not a host name) on a
single-port interface with only one IP address, the DNS name is not analogous to a SAM
device name. Because the primary purpose of a SAM device name is to positively and
persistently identify a single interface (network interface card [NIC] or host bus adapter
[HBA]), each name must be permanently assigned to a single interface. As outlined in the
previous paragraph, DNS names do not comply with this requirement.
IP implements the equivalent of SAM port identiers via IP addresses, which are used to
forward packets. In the context of IP storage (IPS) protocols, a nodes IP address and datalink layer address both provide SAM port identier functionality. An IP address facilitates
end-to-end forwarding of packets, and a data-link layer address facilitates forwarding of
frames between IP interfaces and ports across a single link (in the case of PPP) or multiple
links (in the case of Ethernet).
The IPv4 address format is very exible and well documented in many books, white papers,
product manuals, and RFCs. So this section provides only a brief review of the IPv4 address
format for readers who do not have a strong networking background. To understand the
current IPv4 address format, rst it is useful to understand the IPv4 address format used
Internet Protocol
199
in early implementations. Figure 6-4 illustrates the IPv4 address format used in early
implementations as dened in RFC 791.
Figure 6-4
824
824
Network
Number
Rest
The Network Number eld could be 1, 2, or 3 bytes long. It contained the address
bits that routers used to forward packets. Three classes of networks were dened: A,
B, and C. A class A network, identied by a single-byte network address, contained
16,777,216 addresses. A class B network was identied by a 2-byte network address.
It contained 65,536 addresses. A class C network was identied by a 3-byte network
address. It contained 256 addresses. The total length of an IPv4 address was always
4 bytes regardless of the class of network. In the rst byte, a range of values was
reserved for each class of network. This was known as a self-describing address
format because the value of the network address described the class of the network.
Self-describing addresses enabled IP routers to determine the correct number of bits
(8, 16, or 24) to inspect when making forwarding decisions. This procedure was
known as classful routing.
The Rest eld contained the address bits of individual interfaces and ports on each
network. The length of the Rest eld was determined by the class of the network. A
class A network used a 3-byte Rest eld. A class B network used a 2-byte Rest eld.
A class C network used a 1-byte Rest eld. In all network classes, the value 0 was
reserved in the Rest eld as the identier of the network itself. Likewise, the value
equal to all ones in the Rest eld was reserved as the broadcast address of the network.
This addressing scheme was called classful because the interpretation of addresses was
determined by the network class. Network numbers were assigned by IANA to companies,
government agencies, and other organizations as requested. As the assignment of network
numbers continued, the Internet grew, and the limitations of classful addressing and routing
were discovered. The primary concerns were the difculty of scaling at networks, the rate
of address consumption, and degraded router performance. Organizations that had been
assigned class A or B network numbers were struggling to scale their at networks because
of the effect of broadcast trafc. The nite supply of network numbers began to dwindle at
an unforeseen pace, which raised concerns about future growth potential. Routing between
organizations (a process called inter-domain routing) became problematic because route
lookups and protocol convergence required more time to complete as routing tables grew.
To resolve these issues, several changes were made to the addressing scheme, address
allocation procedures, and routing processes.
One of the changes introduced was a method of segregating large networks into smaller
networks called subnetworks or subnets. Subnetting limited the effect of broadcast trafc
200
by creating multiple broadcast domains within each classful network. Before subnetting,
each classful network represented a single IP broadcast domain. Class A and B networks
experienced severe host and network performance degradation as the number of host
attachments increased because
Another challenge resulted from data-link layer limitations. The most popular LAN
technologies severely limited the number of hosts that could be attached to each segment.
To accommodate the number of IP addresses available in class A and B networks, multiple
LAN segments had to be bridged together. Disparate LAN technologies were often used,
which created bridging challenges. Subnetting mitigated the need for bridging by enabling
routers to connect the disparate LAN technologies.
Subnetting is accomplished via a masking technique. The subnet mask indicates the
number of bits in the IPv4 address that make up the subnet address within each classful
network. Subnet masking provides IP routers with a new way to determine the correct number
of bits to inspect when making forwarding decisions. Initially, the subnet mask used within
each classful network was xed-length as described in RFC 950. In other words, a classful
network could be subdivided into two subnets of equal size, or four subnets of equal size,
or eight subnets of equal size, and so on. Figure 6-5 illustrates the subnetted classful IPv4
address format. Note that the total length of the address remained 4 bytes.
Figure 6-5
824
824
Network
Number
Locally
Assigned
Length in Bits
022
224
Subnet
Number
Host
Number
Internet Protocol
201
extreme granularity in the allocation of subnets within a network. VLSM resolves the
issues associated with scaling at networks, but only partially resolves the issue of address
consumption. So, RFC 1174 and RFC 1466 changed the way network numbers are assigned.
Additionally, a method for translating addresses was dened in RFC 1631 (later updated by
RFC 3022), and specic network numbers were reserved for private use via RFC 1918.
To complement these changes, the concept of address masking was extended to the network
portion of the IPv4 address format via RFC 1338. This technique was originally called
supernetting, but was later renamed classless inter-domain routing (CIDR) in RFC 1519.
CIDR resolves the issues associated with degraded router performance by aggregating
multiple contiguous network routes into a summary route. CIDR-enabled routers provide
the same functionality as classful routers, but CIDR-enabled routers contain far fewer
routes in their routing tables. Eventually, CIDR replaced classful addressing and routing.
Today, IP addresses are called classless because the historically reserved network number
ranges in the rst byte no longer have meaning. When an organization is assigned a network
number, it is also assigned a network prex. A network prex is conceptually similar to
a subnet mask, but it indicates the number of bits that make up the network number as
opposed to the number of bits that make up the network and subnet numbers. Prex
granularity is achieved by supporting 1-bit increments. CIDR complements VLSM; an
organization may subnet its network if needed. Figure 6-6 illustrates the classless IPv4
address format. Note that the total length of the address is still 4 bytes.
Figure 6-6
825
724
Network
Number
Locally
Assigned
Length in Bits
023
124
Subnet
Number
Host
Number
202
IPv4 addresses are expressed in dotted decimal notation such as 172.45.9.36. A network
prex also can be expressed in dotted decimal notation, but it is called a network mask or
netmask when expressed in this notation. The valid decimal values of a netmask are limited
to a specic set of numbers that includes 0, 128, 192, 224, 240, 248, 252, 254, and 255. This
results from the convention of masking network numbers in a bit-contiguous manner. In
other words, the bits that make up a network number are always the leftmost contiguous bits
of an IPv4 address. For example, the network number 172.45.8 is expressed as 172.45.8.0
netmask 255.255.248.0. All bit positions in the netmask that are set to 1 represent network
number bit positions in the IPv4 address. Thus, 172.45.9.0 netmask 255.255.255.0 indicates
the network number 172.45.9. Alternatively, 172.45.9.36 netmask 255.255.248.0
indicates IP address 172.45.9.36 within the 172.45.8 network. In the CIDR context, network
prexes are expressed as /nn where nn equals the number of leftmost contiguous bits in the
IPv4 address that compose the network number. For example, 172.45.8.0/21 is the network
prex notation equivalent to 172.45.8.0 netmask 255.255.248.0. If subnetting is used within
a network, the netmask and network prex must be increased by the assignee organization
to include the subnet bits within the Locally Assigned eld of the IPv4 address. An extended
netmask is called a subnet mask. Likewise, an extended network prex is called a subnet
prex. To clarify the concepts introduced in this paragraph, Table 6-1 presents an example
of dotted decimal notation with equivalent dotted binary notation.
Table 6-1
IPv4 Address
172.45.9.36
10101100.00101101.00001001.00100100
Netmask
255.255.248.0
11111111.11111111.11111000.00000000
Network Number
172.45.8.0
10101100.00101101.00001000.00000000
The preceding discussion of the IPv4 addressing scheme is highly simplied for the sake
of brevity. Comprehensive exploration of the IPv4 addressing scheme is outside the scope
of this book. For more information, readers are encouraged to consult IETF RFCs 791, 950,
1009, 1174, 1338, 1466, 1517, 1519, 1520, 1918, and 3022.
Internet Protocol
203
protocols is responsible for implementing its own SAM name assignment and resolution
mechanisms (if required). Alternately, an external protocol may be leveraged for SAM
name assignment and resolution. Chapter 8, OSI Session, Presentation, and Application
Layers, discusses the IPS protocols in detail.
As previously mentioned, DNS does not relate to SAM names. However, DNS is an integral
component of every IP network, so network administrators undertaking IPS protocol
deployment should have a basic understanding of DNS semantics and mechanics. Even
though it is possible for system administrators to create static name-to-address mappings in
the HOST table on each host, this is typically done only in special situations to accomplish
a particular goal. Most name resolution is accomplished dynamically via DNS. DNS
employs a hierarchical name space. Except for the lowest level, each level in the hierarchy
corresponds to an administrative domain or sub-domain. The lowest level corresponds to
individual nodes within a domain or sub-domain. The hierarchy appears as an inverted tree
when diagrammed. The root is represented by the . symbol. The . symbol also follows
the name of each level to signify the boundary between levels. For example, the DNS name
www.cisco.com. indicates the host port named www exists in the .cisco domain,
which exists in the .com domain, which exists in the root domain. Note that the .
symbol does not precede www, which indicates that www is a leaf (that is, an end
node) in the DNS tree. In practice, the root symbol is omitted because all top-level domains
(TLDs) inherently exist under the root, and TLD names are easily recognizable. The most
common TLD names are .com, .net, .org, .edu, .gov, and .mil, but others are dened. TLDs
are tightly restricted by IANA. The . symbol is also omitted when referring to the name
of an individual level in the hierarchy. For example, www exists in the cisco domain.
Such name references are called unqualied names. A fully qualied domain name
(FQDN) includes the full path from leaf to root such as www.cisco.com.
DNS was designed to be extensible. DNS is used primarily to resolve names to IP
addresses, but DNS can be used for other purposes. Each datum associated with a name is
tagged to indicate the type of datum. New tags can be dened to extend the functionality
of DNS. Indeed, many new tags have been dened since the inception of DNS. The DNS
database is too large for the entire Internet to be served by a single DNS server. So, the
DNS database is designed to be distributed. Each organization is responsible for and has
authority over its own domain and sub-domains. This enables the overhead of creating and
deleting DNS records to be distributed among all DNS participants. Each organizations
DNS servers are authoritative for that organizations domain. As needed, each DNS server
retrieves and temporarily caches foreign records from the authoritative DNS servers of
foreign domains. This minimizes server memory requirements and network bandwidth
requirements. When a DNS client queries the DNS, the query is usually sent to the
topologically nearest DNS server. If a client queries a name that does not exist in the local
domain (that is, the domain over which the local server has authority), then the local server
queries the authoritative server of the domain directly above the local domain in the DNS
hierarchy (called the parent domain). The reply contains the IP address of the authoritative
server of the domain in which the queried name exists. The local DNS server then queries
the authoritative server of the domain in which the queried name exists. Upon receiving a
204
reply, the local server caches the record. The local server then sends a non-authoritative
reply to the DNS client. For more information about DNS, readers are encouraged to
consult IETF RFC 1034 and RFC 1035.
Within each network, the network number may be subnetted as desired. Subnetting is
accomplished by manually conguring the appropriate subnet prex on each router
interface. Within each subnet, individual IP addresses can be statically or dynamically
assigned. It is customary to manually congure each router interface with a statically
assigned IP address. Although the same can be done for host ports, the preferred method is
to automate the process using dynamically assigned addresses. Each host port requires at
least three conguration parameters: an IP address, a subnet prex, and a default gateway.
The IP address of the default gateway is used to forward packets to destination addresses
that are not connected to the local subnet. These three required parameters (and possibly
many other optional parameters) usually are assigned to end nodes via DHCP.
Based on the Bootstrap Protocol (BOOTP), DHCP was originally dened in 1993 via
RFC 1531. The most recent DHCP RFC is 2131, which is complemented by RFC 2132
(DHCP options). DHCP is a client-server protocol that employs a distributed database
of conguration information. End nodes can access the DHCP database during and
after the boot process. When booting, clients discover DHCP servers by transmitting a
DHCPDISCOVER message to the local IP broadcast address of 255.255.255.255. DHCP
clients do not have an IP address during the boot process, so they use 0.0.0.0 as their IP
Internet Protocol
205
Network Boundaries
An IP network can be logically or virtually bounded. Logical boundaries are delimited by
interfaces in networking devices (such as routers, multilayer switches, and rewalls) and
by ports in hosts. OSI Layer 3 control information can be transmitted between IP networks.
When a router generates control information, it uses one of its own IP addresses as the
source address in the header of the IP packets. When OSI Layer 3 control information is
forwarded from one IP network to another, the IP packets are forwarded like user data
packets. When user data packets are forwarded from one IP network to another, the source
and destination IP addresses in the IP header are not modied, but a new data-link layer
header and trailer are generated. If a network is subnetted, each subnet operates as an
independent network. Figure 6-7 illustrates the logical boundaries of IP networks.
Figure 6-7
Computer
Ethernet Switch
Network Boundary
IP Router
IP Router
206
technique called Layer 3 switching that collapses the router functionality into the
Ethernet switch.
Figure 6-8
Computer
Computer
IP Router
Ethernet Switch
Computer
Computer
VLAN 2 - IP Network 2
Boundary
IP packets sent to the local broadcast address of 255.255.255.255 do not cross IP network
boundaries. However, routers can be manually congured to convert local broadcast
packets to unicast packets and forward the unicast packets. This is typically accomplished
on a per-protocol basis. The ULP identier and destination IP address must be congured
on each router interface (or sub-interface) expected to receive and forward local broadcast
packets. This conguration promotes service scalability in large environments by enabling
organizations to centralize services that must otherwise be accessed via local broadcast
packets. For example, an organization may choose to forward all DHCP broadcasts from
every subnet to a centralized DHCP server.
In addition to the local broadcast address, IP packets can also be sent to a subnet broadcast
address. This is called a directed broadcast because IP routers forward such packets to
the destination subnet. No special conguration is required on any of the routers. Upon
receiving a directed broadcast, the router connected to the destination subnet converts the
packet to a local broadcast and then transmits the packet on the destination subnet. An
example of a directed broadcast address is 172.45.9.255 for the subnet 172.45.9.0/24. Local
broadcast packets can be forwarded to a directed broadcast address instead of a unicast
address. This further promotes service scalability in large environments. For example, an
organization that forwards all DHCP broadcasts from every subnet to a centralized DHCP
server might overload the server. By using a directed broadcast, multiple DHCP servers can
be connected to the destination subnet, and any available server can reply. DHCP and many
other services are designed to support this conguration.
Internet Protocol
207
Packet Formats
IP uses a header but does not use a trailer. The IP header format dened in RFC 791 is still
in use today, but some elds have been redened. IP packets are word-oriented, and an IP
word is 4 bytes. Figure 6-9 illustrates the current IP packet format.
Figure 6-9
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
Word #2
Version
IHL
DiffServ
ECN
Identification
Time to Live
Total Length
Flags
Header Checksum
Protocol
Word #3
Source Address
Word #4
Destination Address
Word #513
Options
Padding
Word #14
Word #516383
Fragment Offset
Data
The Version eld4 bits long. It indicates the IP version number as previously
discussed in this chapter. By parsing this eld rst, the format of the header can be
determined.
The Internet Header Length (IHL) eld4 bits long. It indicates the total length
of the IP header expressed in 4-byte words. Valid values are 5 through 15. Thus, the
minimum length of an IP header is 20 bytes, and the maximum length is 60 bytes. This
eld is necessary because the IP header length is variable due to the Options eld.
The Differentiated Services (DiffServ) eld6 bits long. It indicates the level of
service that the packet should receive from each router. Each of the possible values
of this eld can be mapped to a QoS policy. Each mapping is called a differentiated
services codepoint (DSCP). See Chapter 9, Flow Control and Quality of Service, for
more information about QoS in IP networks.
The Explicit Congestion Notication (ECN) eld2 bits long. It reactively indicates
to source nodes that congestion is being experienced. See Chapter 9, Flow Control and
Quality of Service, for more information about ow control in IP networks.
The Total Length eldindicates the overall length of the packet (header plus data)
expressed in bytes. An indication of the total packet length is required because the
length of the Data eld can vary. Because this eld is 16 bits long, the maximum
length of an IP packet is 65,536 bytes.
208
The Identication eld16 bits long. It contains a value assigned to each packet
by the source node. The value is unique within the context of each source address,
destination address, and protocol combination. The value is used to associate
fragments of a packet to aid reassembly at the destination node.
The Flags eld3 bits long. It contains a reserved bit, the dont fragment (DF) bit and
the more fragments (MF) bit. The DF bit indicates whether a packet may be fragmented
by intermediate devices such as routers and rewalls. A value of 0 permits fragmentation,
and a value of 1 requires that the packet be forwarded without fragmentation. The MF
bit indicates whether a packet contains the nal fragment. A value of 0 indicates either
that the original packet is unfragmented, or the original packet is fragmented, and this
packet contains the last fragment. A value of 1 indicates that the original packet is
fragmented, and this packet does not contain the last fragment.
The Fragment Offset eldindicates the offset of the data eld in each fragment from
the beginning of the data eld in the original packet. This eld is only 13 bits long, so the
offset is expressed in 8-byte units to accommodate the maximum IP packet length.
Thus, packets are fragmented on 8-byte boundaries. The minimum fragment length is
8 bytes except for the last fragment, which has no minimum length requirement.
The Time To Live (TTL) eld8 bits long. The original intent of the TTL eld was to
measure the lifespan of each packet in 1-second increments. However, implementation of
the TTL eld as a clock proved to be impractical. So, the TTL eld is now used to
count the number of routers the packet may pass through (called hops) before the packet
must be discarded. The value of this eld is set by the source node, and each router
decrements the value by 1 before forwarding the packet. By limiting the maximum
number of hops, innite forwarding of packets is avoided in the presence of routing loops.
The Protocol eld8 bits long. It contains the number of the network layer protocol
or ULP to which the data should be delivered. IANA assigns the IP protocol numbers.
Some common network layer protocols are ICMP (protocol 1), Enhanced Interior
Gateway Routing Protocol (EIGRP) (protocol 88), and Open Shortest Path First (OSPF)
(protocol 89). The most common ULPs are TCP (protocol 6) and UDP (protocol 17).
The Header Checksum eld16 bits long. It contains a checksum that is calculated
on all header elds. The value of the checksum eld is 0 for the purpose of calculating
the checksum. The checksum must be recalculated by each router because the TTL
eld is modied by each router. Likewise, NAT devices must recalculate the checksum
because the source or destination address elds are modied.
The Source Address eld32 bits long. It contains the IP address of the source node.
The Options eldif present, contains one or more options. Options enable
negotiation of security parameters, recording of timestamps generated by each router
along a given path, specication of routes by source nodes, and so forth. Options vary
in length, and the minimum length is 1 byte. The length of this eld is variable, with
no minimum length and a maximum length of 40 bytes.
The Destination Address eld32 bits long. It contains the IP address of the
destination node.
Internet Protocol
209
The Padding eldused to pad the header to the nearest 4-byte boundary. The length
of this eld is variable, with no minimum length and a maximum length of
3 bytes. This eld is required only if the Options eld is used, and the Options eld
does not end on a 4-byte boundary. If padding is used, the value of this eld is set to 0.
The Data eldif present, may contain another network layer protocol (such as
ICMP or OSPF) or an ULP (such as TCP or UDP). The length of this eld is variable,
with no minimum length and a maximum length of 65,516 bytes.
The preceding eld descriptions are simplied for the sake of clarity. For more information
about the IPv4 packet format, readers are encouraged to consult IETF RFCs 791, 815, 1122,
1191, 1812, 2474, 2644, 3168, and 3260.
Delivery Mechanisms
IP supports only one set of delivery mechanisms that provide unacknowledged,
connectionless service. Transport layer protocols compliment IP to provide other delivery
services. IP implements the following delivery mechanisms:
Because IP was not designed to provide reliable delivery, ICMP was created to provide
a means of notifying source nodes of delivery failure. Under certain circumstances, IP
must notify source nodes via ICMP when a packet is dropped. Examples include the
PMTU discovery process, a TTL expiration, and a packet reassembly timeout. However,
IP is not required to send notication to source nodes under all drop conditions.
Examples include a queue overrun and an IP Header Checksum error. Additionally,
IP does not detect packet drops resulting from external causes. Examples include an
Ethernet CRC error and a malformed PPP frame resulting from SONET path failover.
The IP process within a source node must notify the appropriate Network Layer
protocol or ULP upon receipt of an ICMP error message that indicates a drop has
occurred. In the absence of source node notication, detection of dropped packets is
the responsibility of the network layer protocol or ULP that generated the packets.
Additionally, the subsequent recovery behavior is determined by the network layer
protocol or ULP that generated the packets.
The elds in the IP header can facilitate detection of duplicate packets, but IP makes no
effort to detect duplicates for two reasons. First, IP is not responsible for detection of
duplicate packets. Second, the Identication eld is not used by source nodes in a manner
that guarantees that each duplicate is assigned the same identication number as the
original packet. So, if a duplicate packet is received, IP delivers the packet to the
appropriate network layer protocol or ULP in the normal manner. Detection of duplicates
is the responsibility of the network layer protocol or ULP that generated the packets.
IP devices can detect corrupt IP headers via the Header Checksum eld, but IP devices
cannot detect corruption in the Data eld. Upon detection of a corrupt header, the
packet is dropped.
210
IP does not support retransmission. Each network layer protocol and ULP is expected
to dene its own retransmission mechanism.
Bandwidth is not guaranteed by default, but QoS mechanisms are dened that enable
bandwidth guarantees to be implemented. See Chapter 9, Flow Control and Quality
of Service, for more information about QoS. Monitoring and trending of bandwidth utilization on shared links is required to ensure optimal network operation.
Oversubscription on shared links must be carefully calculated to avoid bandwidth
starvation during peak periods.
Consistent latency is not guaranteed by default, but QoS mechanisms are dened that
enable jitter to be minimized. See Chapter 9, Flow Control and Quality of Service,
for more information about QoS.
In-order delivery is not guaranteed. IP does not support packet reordering. Each
network layer protocol and ULP is expected to dene its own out-of-order packet
detection and reordering mechanism.
Internet Protocol
211
ICMP
As previously stated, ICMP compliments IP. ICMP is an integral part of IP, yet ICMP uses
the services of IP for packet delivery. Figure 6-10 illustrates the architectural relationship
of ICMP to IP.
Figure 6-10 Architectural Relationship of ICMP to IP
OSI
Reference Model
Application
Presentation
Session
Transport
ICMP
Network
IP
Data Link
Physical
ICMP can be used for many purposes including error notication, congestion notication,
route redirection, route verication, address discovery, and so on. Many ICMP message
types are dened to accomplish these functions. A common packet format is dened for all
ICMP message types. Figure 6-11 illustrates the ICMP packet format.
Figure 6-11 ICMP Packet Format
Byte #
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
IP Header
Word #04
Word #5
Word #616383
Type
Code
Checksum
Type/Code Specific
The Type eld1 byte long. It indicates the type of ICMP message.
The Code eld1 byte long. It indicates the specic ICMP message within each type.
The Checksum eld2 bytes long. It contains a checksum that is calculated on all
ICMP elds including the Type/Code Specic eld. The value of the checksum eld
is zero for the purpose of calculating the checksum. If the total length of the ICMP
packet is odd, one byte of padding is added to the Type/Code Specic eld for the
purpose of calculating the checksum. If padding is used, the value of the pad byte is
set to zero.
212
The Type/Code Specic eld is variable in length. It contains additional elds that
are dened specically for each ICMP message.
Currently, 22 types of ICMP message are dened for IPv4. Of these, 15 are in widespread
use today. Table 6-2 lists the currently dened ICMP message types for IPv4 and the codes
associated with each message type.
Table 6-2
Type Description
Code
Code Description
Echo Reply
12
Unassigned
Destination Unreachable
Network Unreachable
Host Unreachable
Protocol Unreachable
Port Unreachable
10
11
12
Source Quench
Redirect
Network Redirect
Host Redirect
Unassigned
Internet Protocol
Table 6-2
213
Type Description
Code
Code Description
Echo Request
Router/Mobile Agent
Advertisement
Normal Router
16
Mobility Agent
10
Router/Mobile Agent
Solicitation
11
Time Exceeded
12
Parameter Problem
13
Timestamp Request
14
Timestamp Reply
15
Information Request
16
Information Reply
17
18
1929
Unassigned
30
31
Datagram Conversion
Error; Not Used
3236
Unassigned
37
38
39
40
41255
Unassigned
Comprehensive exploration of all ICMP messages and their payloads is outside the scope
of this book. For more information about ICMP, readers are encouraged to consult IETF
RFCs 792, 950, 1122, 1256, 1812, and 3344.
214
Fibre Channel
As previously stated, Fibre Channel (FC) does not inherently provide any OSI Layer 3
functionality. However, the FC-BB specication series enables the use of other network
technologies to connect geographically dispersed Fibre Channel storage area networks
(FC-SANs). This is commonly called FC-SAN extension. Such extensions are often
deployed for disaster recovery applications, and often leverage IP networks.
Summary
The Internet has a long history, which has culminated in a very robust protocol suite. With
the advent of IPS protocols, storage network administrators now can leverage that robust
protocol suite. An in-depth analysis of IP and ICMP is provided in this chapter to enable
storage network administrators to understand the network layer functionality that
underlies all IPS protocols. FC does not inherently provide any network layer functionality,
so this chapter does not discuss FC. Chapter 7, OSI Transport Layer, builds upon this
chapter by exploring the transport layer functionality of TCP and UDP. Chapter 7 also
explores the transport layer functionality of FC.
Review Questions
1 Is IP a routed protocol or a routing protocol?
2 What IP term is equivalent to buffer?
3 What is the Ethertype of IP?
4 Does Ethernet padding affect an IP packet?
5 What are the three components of PPP?
6 Which of the IPCP negotiated options affect FCIP and iFCP?
Review Questions
215
header?
Explain the major differences between User Datagram Protocol (UDP) and
Transmission Control Protocol (TCP)
CHAPTER
TCP/IP Suite
As Chapter 6, OSI Network Layer, stated, IP does not provide reliable delivery. IP
delivers packets on a best-effort basis and provides notication of non-delivery in some
situations. This is because not all applications require reliable delivery. Indeed, the nature
of some applications suits them to unreliable delivery mechanisms. For example, a network
management station (NMS) that must poll thousands of devices using the Simple Network
Management Protocol (SNMP) could be heavily burdened by the overhead associated with
reliable delivery. Additionally, the affect of dropped SNMP packets is negligible as long
as the proportion of drops is low. So, SNMP uses unreliable delivery to promote NMS
scalability. However, most applications require reliable delivery. So, the TCP/IP suite
places the burden of reliable delivery on the transport layer and provides a choice of
transport layer protocols. For unreliable delivery, UDP is used. For reliable delivery, TCP
is used. This provides exibility for applications to choose the transport layer protocol that
best meets their delivery requirements. For additional architectural information about the
TCP/IP suite, readers are encouraged to consult IETF RFC 1180.
UDP
When packet loss has little or no negative impact on an application, UDP can be used.
Broadcast trafc is usually transmitted using UDP encapsulation (for example, DHCPDISCOVER packets dened in IETF RFC 2131). Some unicast trafc also uses UDP encapsulation. Examples of UDP-based unicast trafc include SNMP commands and DNS queries.
218
TIP
The meaning of the term port varies depending on the context. Readers should not confuse
the meaning of port in the context of UDP and TCP with the meaning of port in other
contexts (such as switching hardware or the SAM addressing scheme).
IANA assigns WKPs. WKPs derive their name from their reserved status. Once IANA assigns
a port number in the WKP range to a session layer protocol, that port number is reserved for
use by only that session layer protocol. Assigned WKPs are used on servers to implement
popular services for clients. By reserving port numbers for popular services, clients always
know which port number to contact when requesting a service. For example, a server-based
implementation of DNS listens for incoming client queries on UDP port 53. All unassigned
WKPs are reserved by IANA and may not be used. IANA does not assign port numbers outside
the WKP range. However, IANA maintains a public register of the server-based port numbers
used by session layer protocols that have not been assigned a WKP. These are called Registered
Ports. The third category is Dynamic Ports. It is used by clients when communicating with
servers. Each session layer protocol within a client dynamically selects a port number in the
Dynamic Port range upon initiating communication with a server. This facilitates server-toclient communication for the duration of the session. When the session ends, the Dynamic Port
number is released and becomes available for reuse within the client.
UDP does not provide data segmentation, reassembly, or reordering for ULPs. So, ULPs
that use UDP might suffer performance degradation due to IP fragmentation of large UDP
packets. To avoid fragmentation, ULPs that use UDP are required to assess the maximum
data segment size that can be transmitted. To facilitate this, UDP transparently provides
Session Layer protocols access to the IP service interface. The session layer protocol in
TCP/IP Suite
219
the source host queries UDP, which in turn queries IP to determine the local MTU. The
session layer protocol then generates packets equal to or less than the local MTU, minus
the overhead bytes of the IP and UDP headers. Alternately, path maximum transmission
unit (PMTU) discovery can be used. The session layer protocol in the destination host must
reorder and reassemble received segments using its own mechanism.
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Source Port
Destination Port
Word #1
Length
Checksum
Word #216378
Data
Source Port16 bits long. It indicates the port number of the session layer protocol
within the transmitting host. The receiving host copies this value to the Destination
Port eld in reply packets.
Destination Port16 bits long. It indicates the port number of the session layer
protocol within the receiving host. This enables the receiving host to properly route
incoming packets to the appropriate session layer protocol. The receiving host copies
this value to the Source Port eld in reply packets.
Length16 bits long. It indicates the overall length of the packet (header plus data)
expressed in bytes. Because this eld is 16 bits long, the theoretical maximum length
of a UDP packet is 65,536 bytes. The actual maximum length of a UDP packet is
65,516 bytes (the maximum size of the Data eld in an IP packet).
220
Figure 7-2
IP Pseudo-Header Format
Byte #
Bit #
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Source IP Address
Word #1
Destination IP Address
Padding
Word #2
Protocol
UDP Length
For more information about the UDP packet format, readers are encouraged to consult
IETF RFCs 768 and 1122. A new variation of UDP (called UDP-Lite) is documented in
RFC 3828. UDP-Lite is optimized for streaming media such as voice and video. As such,
UDP-Lite is not applicable to modern storage networks. UDP-Lite is mentioned herein only
to prevent confusion.
UDP does not detect packets dropped in transit. When the UDP process within a
destination host drops a packet because of checksum failure, UDP does not report the
drop to the ULP or source host. When the UDP process within a destination host drops
a packet because the destination port number is not active, IP may report the drop
to the source host by transmitting an ICMP Type 3 Code 3 message. However,
notication is not required. In the absence of notication, ULPs are expected to detect
drops via their own mechanisms. Likewise, recovery behavior is determined by the
ULP that generated the packets.
UDP does not detect duplicate packets. If a duplicate packet is received, UDP delivers
the packet to the appropriate ULP. ULPs are expected to detect duplicates via their
own mechanisms.
TCP/IP Suite
221
UDP can detect corrupt packets via the Checksum eld. Upon detection of a corrupt
packet, the packet is dropped.
UDP does not provide acknowledgement of successful packet delivery. Each ULP is
expected to dene its own acknowledgement mechanism (if required).
UDP does not provide retransmission. Each ULP is expected to dene its own
retransmission mechanism (if required).
Guaranteeing xed or minimal latency is not a transport layer function. UDP relies on
QoS policies implemented by IP for latency guarantees.
UDP does not guarantee in-order delivery. UDP does not provide packet reordering.
Each ULP is expected to dene its own out-of-order packet detection and reordering
mechanism (if required).
Guaranteeing reserved bandwidth is not a transport layer function. UDP relies on QoS
policies implemented by IP for bandwidth guarantees.
For more information about UDP delivery mechanisms, readers are encouraged to consult
IETF RFCs 768, 1122, and 1180.
TCP
When packet loss affects an application negatively, TCP is used. Most unicast trafc uses
TCP encapsulation.
222
TCP/IP Suite
223
conduct multiple simultaneous sessions with several iSCSI initiators (hosts). The storage
arrays iSCSI socket is reused over time as new sessions are established.
Clients may also reuse sockets. For example, a single-homed host may select TCP port
55,160 to open a TCP connection for le transfer via FTP. Upon completing the le transfer,
the client may terminate the TCP connection and reuse TCP port 55,160 for a new TCP
connection. Reuse of Dynamic Ports is necessary because the range of Dynamic Ports is
nite and would otherwise eventually exhaust. A potential problem arises from reusing
sockets. If a client uses a given socket to connect to a given server socket more than once,
the rst incarnation of the connection cannot be distinguished from the second incarnation.
Reincarnated TCP connections can also result from certain types of software crashes. TCP
implements multiple mechanisms that work together to resolve the issue of reincarnated
TCP connections. Note that each TCP host can open multiple simultaneous connections
with a single peer host by selecting a unique Dynamic Port number for each connection.
Byte #
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Source Port
Word #0
Destination Port
Word #1
Sequence Number
Word #2
Acknowledgment Number
Word #3
Data
Offset
Reserved
Window
Control Bits
Checksum
Word #4
Urgent Pointer
Options
Word #513
Padding
Word #14
Data
Word #516378
Control Bits
Byte #
Bit #
1
8
CWR
10
11
12
13
14
ECE
URG
ACK
PSH
RST
SYN
15
FIN
224
Source Port16 bits long. It indicates the port number of the session layer protocol
within the transmitting host. The receiving host copies this value to the Destination
Port eld in reply packets.
Destination Port16 bits long. It indicates the port number of the session layer
protocol within the receiving host. This enables the receiving host to properly route
incoming packets to the appropriate session layer protocol. The receiving host copies
this value to the Source Port eld in reply packets.
Sequence Number32 bits long. It contains the sequence number of the rst byte
in the Data eld. TCP consecutively assigns a sequence number to each data byte
that is transmitted. Sequence numbers are unique only within the context of a single
connection. Over the life of a connection, this eld represents an increasing counter
of all data bytes transmitted. The Sequence Number eld facilitates detection of
dropped and duplicate packets, and reordering of packets. The purpose and use of the
TCP Sequence Number eld should not be confused with the FC Sequence_Identier
(SEQ_ID) eld.
Data Offset (also known as the TCP Header Length eld)4 bits long. It indicates
the total length of the TCP header expressed in 4-byte words. This eld is necessary
because the TCP header length is variable due to the Options eld. Valid values are 5
through 15. Thus, the minimum length of a TCP header is 20 bytes, and the maximum
length is 60 bytes.
ECN Echo (ECE) bitwhen set to 1, it informs the transmitting host that the
receiving host received notication of congestion from an intermediate router. This bit
works in conjunction with the CWR bit. See Chapter 9, Flow Control and Quality of
Service, for more information about ow control in IP networks.
Urgent (URG) bitwhen set to 1, it indicates the presence of urgent data in the Data
eld. Urgent data may ll part or all of the Data eld. The denition of urgent is left
to ULPs. In other words, the ULP in the source host determines whether to mark data
TCP/IP Suite
225
urgent. The ULP in the destination host determines what action to take upon receipt
of urgent data. This bit works in conjunction with the Urgent Pointer eld.
Push (PSH) bitwhen set to 1, it instructs TCP to act immediately. In the source
host, this bit forces TCP to immediately transmit data received from all ULPs. In the
destination host, this bit forces TCP to immediately process packets and forward
the data to the appropriate ULP. When an ULP requests termination of an open
connection, the push function is implied. Likewise, when a packet is received with the
FIN bit set to 1, the push function is implied.
Reset (RST) bitwhen set to 1, it indicates a request to reset the connection. This bit
is used to recover from communication errors such as invalid connection requests,
invalid protocol parameters, and so on. A connection reset results in abrupt termination
of the connection without preservation of state or in-ight data.
Urgent Pointer16 bits long. It contains a number that represents an offset from the
current Sequence Number. The offset marks the last byte of urgent data in the Data
eld. This eld is meaningful only when the URG bit is set to 1.
Window16 bits long. It indicates the current size (expressed in bytes) of the
transmitting hosts receive buffer. The size of the receive buffer uctuates as packets
are received and processed, so the value of the Window eld also uctuates. For this
reason, the TCP ow-control mechanism is called a sliding window. This eld
works in conjunction with the Acknowledgement Number eld.
226
The preceding descriptions of the TCP header elds are simplied for the sake of clarity.
For more information about the TCP packet format, readers are encouraged to consult IETF
RFCs 793, 1122, 1323, and 3168.
TCP Options
TCP options can be comprised of a single byte or multiple bytes. All multi-byte options use
a common format. Figure 7-4 illustrates the format of TCP multi-byte options.
Figure 7-4
Byte #
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Kind
Word #0
Length
Option Data
Word #19
Length8 bits long. It indicates the total length of the option (including the Kind and
Length elds) expressed in bytes.
Table 7-1 lists the currently dened TCP options that are in widespread use.
Table 7-1
TCP Options
Kind
Length In Bytes
Description
No-Operation (No-Op)
SACK-Permitted
SACK
10
Timestamps (TSopt)
19
18
MD5 Signature
TCP/IP Suite
227
Of the options listed in Table 7-1, only the MSS, WSopt, SACK-Permitted, SACK, and
TSopt are relevant to modern storage networks. So, only these options are discussed in
detail in this section.
Byte #
Bit #
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Kind
Word #0
Length
Kindset to 2.
Lengthset to 4.
Maximum Segment Size16 bits long. It indicates the largest segment size that the
sender of this option is willing to accept. The value is expressed in bytes.
Window Scale
One drawback of all proactive ow control mechanisms is that they limit throughput based
on the amount of available receive buffers. As previously stated, a host may not transmit
when the receiving host is out of buffers. To understand the effect of this requirement on
throughput, it helps to consider a host that has enough buffer memory to receive only one
228
packet. The transmitting host must wait for an indication that each transmitted packet has
been processed before transmitting an additional packet. This requires one round-trip per
transmitted packet. While the transmitting host is waiting for an indication that the receive
buffer is available, it cannot transmit additional packets, and the available bandwidth is
unused. So, the amount of buffer memory required to sustain maximum throughput can be
calculated as:
Bandwidth * Round-trip time (RTT)
where bandwidth is expressed in bytes per second, and RTT is expressed in seconds. This
is known as the bandwidth-delay product. Although this simple equation does not account
for the protocol overhead of lower-layer protocols, it does provide a reasonably accurate
estimate of the TCP memory requirement to maximize throughput.
As TCP/IP matured and became widely deployed, the maximum window size that could be
advertised via the Window eld proved to be inadequate in some environments. So-called
long fat networks (LFNs), which combine very long distance with high bandwidth, became
common as the Internet and corporate intranets grew. LFNs often exceed the original design
parameters of TCP, so a method of increasing the maximum window size is needed. The
Window Scale option (WSopt) was developed to resolve this issue. WSopt works by
shifting the bit positions in the Window eld to omit the least signicant bit(s). To derive
the peer hosts correct window size, each host applies a multiplier to the window size
advertisement in each packet received from the peer host. Figure 7-6 illustrates the WSopt
format.
Figure 7-6
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Word #0
Kind
Length
Shift Count
Kindset to 3.
Lengthset to 3.
Shift Count8 bits long. It indicates how many bit positions the Window eld is
shifted. This is called the scale factor.
Both hosts must send WSopt for window scaling to be enabled on a connection. Sending
this option indicates that the sender can both send and receive scaled Window elds. Note
that WSopt may be sent only during connection establishment, so window scaling cannot
be enabled during the life of an open connection.
TCP/IP Suite
229
Selective Acknowledgement
TCPs cumulative acknowledgement mechanism has limitations that affect performance
negatively. As a result, an optional selective acknowledgement mechanism was developed.
The principal drawback of TCPs cumulative acknowledgement mechanism is that only
contiguous data bytes can be acknowledged. If multiple packets containing non-contiguous
data bytes are received, the non-contiguous data is buffered until the lowest gap is lled
by subsequently received packets. However, the receiving host can acknowledge only the
highest sequence number of the received contiguous bytes. So, the transmitting host must
wait for the retransmit timer to expire before retransmitting the unacknowledged bytes.
At that time, the transmitting host retransmits all unacknowledged bytes for which the
retransmit timer has expired. This often results in unnecessary retransmission of some bytes
that were successfully received at the destination. Also, when multiple packets are dropped,
this procedure can require multiple roundtrips to ll all the gaps in the receiving hosts
buffer. To resolve these deciencies, TCPs optional SACK mechanism enables a receiving
host to acknowledge receipt of non-contiguous data bytes. This enables the transmitting
host to retransmit multiple packets at once, with each containing non-contiguous data needed
to ll the gaps in the receiving hosts buffer. By precluding the wait for the retransmit timer
to expire, retransmitting only the missing data, and eliminating multiple roundtrips from
the recovery procedure, throughput is maximized.
SACK is implemented via two TCP options. The rst option is called SACK-Permitted and
may be sent only during connection establishment. The SACK-Permitted option informs
the peer host that SACK is supported by the transmitting host. To enable SACK, both hosts
must include this option in the initial packets transmitted during connection establishment.
Figure 7-7 illustrates the format of the SACK-Permitted option.
Figure 7-7
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Word #0
Kind
Length
Kindset to 4.
Lengthset to 2.
After the connection is established, the second option may be used. The second option is
called the SACK option. It contains information about the data that has been received.
Figure 7-8 illustrates the format of the SACK option.
230
Figure 7-8
Byte #
Bit #
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Kind
Word #0
Length
Word #1
Word #2
Word #3
Word #4
Word #5
Word #6
Word #7
Word #8
Kindset to 5.
Left Edge of Block X32 bits long. It contains the sequence number of the rst
byte of data in this block.
Right Edge of Block X32 bits long. It contains the sequence number of the rst
byte of missing data that follows this block.
TCP/IP Suite
231
Timestamps
The Timestamps option (TSopt) provides two functions. TSopt augments TCPs traditional
duplicate packet detection mechanism and improves TCPs RTT estimation in LFN
environments. TSopt may be sent during connection establishment to inform the peer host
that timestamps are supported. If TSopt is received during connection establishment,
timestamps may be sent in subsequent packets on that connection. Both hosts must support
TSopt for timestamps to be used on a connection.
As previously stated, a TCP connection may be reincarnated. Reliable detection of
duplicate packets can be challenging in the presence of reincarnated TCP connections.
TSopt provides a way to detect duplicates from previous incarnations of a TCP connection.
This topic is discussed fully in the following section on duplicate detection.
TCP implements a retransmission timer so that unacknowledged data can be retransmitted
within a useful timeframe. When the timer expires, a retransmission time-out (RTO) occurs,
and the unacknowledged data is retransmitted. The length of the retransmission timer is
derived from the mean RTT. If the RTT is estimated too high, retransmissions are delayed.
If the RTT is estimated too low, unnecessary retransmissions occur. The traditional method
of gauging the RTT is to time one packet per window of transmitted data. This method
works well for small windows (that is, for short distance and low bandwidth), but severely
inaccurate RTT estimates can result from this method in LFN environments. To accurately
measure the RTT, TSopt may be sent in every packet. This enables TCP to make real-time
adjustments to the retransmission timer based on changing conditions within the network
(such as uctuating levels of congestion, route changes, and so on). Figure 7-9 illustrates
the TSopt format.
Figure 7-9
Byte #
Bit #
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Kind
Word #0
Length
Word #1
Timestamp Value
Word #2
Timestamp Value
Timestamp Echo Reply
Kindset to 8.
Lengthset to 10.
Timestamp Value (TSval)32 bits long. It records the time at which the packet is
transmitted. This value is expressed in a unit of time determined by the transmitting
host. The unit of time is usually seconds or milliseconds.
Timestamp Echo Reply (TSecr)32 bits long. It is used to return the TSval eld
to the peer host. Each host copies the most recently received TSval eld into the TSecr
232
eld when transmitting a packet. Copying the peer hosts TSval eld eliminates the
need for clock synchronization between the hosts.
The preceding descriptions of the TCP options are simplied for the sake of clarity. For
more information about the TCP option formats, readers are encouraged to consult IETF
RFCs 793, 1122, 1323, 2018, 2385, and 2883.
TCP can detect packets dropped in transit via the Sequence Number eld. A packet
dropped in transit is not acknowledged to the transmitting host. Likewise, a packet
dropped by the receiving host because of checksum failure or other error is not
acknowledged to the transmitting host. Unacknowledged packets are eventually
retransmitted. TCP is connection-oriented, so packets cannot be dropped because of
an inactive destination port number except for the rst packet of a new connection (the
connection request packet). In this case, TCP must drop the packet, must reply to the
source host with the RST bit set to 1, and also may transmit an ICMP Type 3 Code 3
message to the source host.
TCP can detect duplicate packets via the Sequence Number eld. If a duplicate packet
is received, TCP drops the packet. Recall that TCP is permitted to aggregate data any
way TCP chooses. So, a retransmitted packet may contain the data of the dropped
packet and additional untransmitted data. When such a duplicate packet is received,
TCP discards the duplicate data and forwards the additional data to the appropriate
ULP. Since the Sequence Number eld is nite, it is likely that each sequence number
will be used during a long-lived or high bandwidth connection. When this happens,
the Sequence Number eld cycles back to 0. This phenomenon is known as wrapping.
It occurs in many computing environments that involve nite counters. The time
required to wrap the Sequence Number eld is denoted as Twrap and varies depending
on the bandwidth available to the TCP connection. When the Sequence Number eld
wraps, TCPs traditional method of duplicate packet detection breaks, because valid
packets can be mistakenly identied as duplicate packets. Problems can also occur
with the packet reordering process. So, RFC 793 denes a maximum segment lifetime
(MSL) to ensure that old packets are dropped in the network and not delivered to
hosts. The MSL mechanism adequately protected TCP connections against lingering
duplicates in the early days of TCP/IP networks. However, many modern LANs
provide sufcient bandwidth to wrap the Sequence Number eld in just a few seconds.
Reducing the value of the MSL would solve this problem, but would create other
problems. For example, if the MSL is lower than the one-way delay between a pair of
communicating hosts, all TCP packets are dropped. Because LFN environments
combine high one-way delay with high bandwidth, the MSL cannot be lowered
TCP/IP Suite
233
sufciently to prevent Twrap from expiring before the MSL. Furthermore, the MSL is
enforced via the TTL eld in the IP header, and TCPs default TTL value is 60. For
reasons of efciency and performance, most IP networks (including the Internet) are
designed to provide any-to-any connectivity with fewer than 60 router hops. So, the
MSL never expires in most IP networks. To overcome these deciencies, RFC 1323
denes a new method of protecting TCP connections against wrapping. The method
is called protection against wrapped sequence numbers (PAWS). PAWS uses TSopt to
logically extend the Sequence Number eld with additional high-order bits so that the
Sequence Number eld can wrap many times within a single cycle of the TSopt eld.
TCP can detect corrupt packets via the Checksum eld. Upon detection of a corrupt
packet, the packet is dropped.
TCP acknowledges receipt of each byte of ULP data. Acknowledgement does not
indicate that the data has been processed; that must be inferred by combining the
values of the Acknowledgement Number and Window elds. When data is being
transmitted in both directions on a connection, acknowledgment is accomplished
without additional overhead by setting the ACK bit to one and updating the
Acknowledgement Number eld in each packet transmitted. When data is being
transmitted in only one direction on a connection, packets containing no data must
be transmitted in the reverse direction for the purpose of acknowledgment.
TCP provides proactive end-to-end ow control via the Window eld. TCP implements
a separate window for each connection. Each time a host acknowledges receipt of data
on a connection, it also advertises its current window size (that is, its receive buffer
size) for that connection. A transmitting host may not transmit more data than the
receiving host is able to buffer. After a host has transmitted enough data to ll the
receiving hosts receive buffer, the transmitting host cannot transmit additional data
until it receives a TCP packet indicating that additional buffer space has been allocated
or in-ight packets have been received and processed. Thus, the interpretation of the
Acknowledgement Number and Window elds is somewhat intricate. A transmitting
host may transmit packets when the receiving hosts window size is zero as long as
the packets do not contain data. In this scenario, the receiving host must accept and
process the packets. See Chapter 9, Flow Control and Quality of Service, for more
information about ow control. Note that in the IP model, the choice of transport layer
protocol determines whether end-to-end ow control is used. This contrasts the FC
model in which the class of service (CoS) determines whether end-to-end ow control
is used.
234
Guaranteeing reserved bandwidth is not a transport layer function. TCP relies on QoS
policies implemented by IP for bandwidth guarantees.
Guaranteeing xed or minimal latency is not a transport layer function. TCP relies on
QoS policies implemented by IP for latency guarantees. The PSH bit can be used to
lower the processing latency within the source and destination hosts, but the PSH bit
has no affect on network behavior.
TCP guarantees in-order delivery by reordering packets that are received out of order.
The receiving host uses the Sequence Number eld to determine the correct order
of packets. Each byte of ULP data is passed to the appropriate ULP in consecutive
sequence. If one or more packets are received out of order, the ULP data contained in
those packets is buffered by TCP until the missing data arrives.
Comprehensive exploration of all the TCP delivery mechanisms is outside the scope of this
book. For more information about TCP delivery mechanisms, readers are encouraged to
consult IETF RFCs 793, 896, 1122, 1180, 1191, 1323, 2018, 2309, 2525, 2581, 2873, 2883,
2914, 2923, 2988, 3042, 3168, 3390, 3517, and 3782.
TCP/IP Suite
235
Number eld contains the initiating hosts ISN incremented by one. Upon receipt of the
SYN/ACK segment, the initiating host considers the connection established. The initiating
host then transmits a reply with the SYN bit set to 0 and the ACK bit set to 1. This is called
an ACK segment. (Note that ACK segments occur throughout the life of a connection,
whereas the SYN segment and the SYN/ACK segment only occur during connection
establishment.) The Acknowledgement Number eld contains the responding hosts ISN
incremented by one. Despite the absence of data, the Sequence Number eld contains the
initiating hosts ISN incremented by one. The ISN must be incremented by one because the
sole purpose of the ISN is to synchronize the data byte counter in the responding hosts
TCB with the data byte counter in the initiating hosts TCB. So, the ISN may be used only
in the rst packet transmitted in each direction. Upon receipt of the ACK segment, the
responding host considers the connection established. At this point, both hosts may transmit
ULP data. The rst byte of ULP data sent by each host is identied by incrementing the ISN
by one. Figure 7-10 illustrates TCPs three-way handshake.
Figure 7-10 TCP Three-Way Handshake
TCP State
TCP State
Closed
Listening
SN 1600
AN 0
SYN 1
ACK 0
SYN
Segment
SYN
Sent
Initiating
Host
Three-Way
Handshake
SYN/
ACK
Segment
Established
SN 1601
AN 551
SYN 0
ACK 1
SN 550
AN 1601
SYN 1
ACK 1
SYN
Received
Responding
Host
ACK
Segment
Established
SN 1601
AN 551
SYN 0
ACK 1
Data
Segment
236
Fibre Channel
As previously stated, the FC network model includes some transport layer functionality.
End-to-end connections are established via the PLOGI ELS command. Communication
parameters may be updated during an active connection via the discover N_Port parameters
(PDISC) ELS command. Segmentation and reassembly services are also supported. The
following discussion considers only CoS 3 connections.
FC Operational Overview
PLOGI is mandatory for all N_Ports, and data transfer between N_Ports is not permitted
until PLOGI completes. PLOGI may be performed explicitly or implicitly. Only explicit
PLOGI is described in this book for the sake of clarity. PLOGI establishes an end-to-end
connection between the participating N_Ports. PLOGI is accomplished with a single-frame
request followed by a single-frame response. In switched FC environments, PLOGI
accomplishes the following tasks:
Provides the port worldwide name (PWWN) and node worldwide name (NWWN) of
each N_Port to the peer N_Port
Like a TCP host, an FC node may open and maintain multiple PLOGI connections
simultaneously. Unlike a TCP host, only a single PLOGI connection may be open between
a pair of nodes at any point in time. This requires all ULPs to share the connection. However,
ULPs may not share an Exchange. Similar to TCP connections, PLOGI connections are
long-lived. However, keep-alives are not necessary because of the registered state change
notication (RSCN) process discussed in Chapter 3, Overview of Network Operating
Principles. Another difference between TCP and FC is the manner in which ULP data is
segmented. In the FC model, ULPs specify which PDU is to be transmitted in each sequence.
If a PDU exceeds the maximum payload of an FC frame, the PDU is segmented, and each
segment is mapped to an FC frame. FC segments a discrete chunk of ULP data into each
sequence, so FC can reassemble the segments (frames of a sequence) into a discrete chunk
for a ULP (such as FCP). So, receiving nodes reassemble related segments into a PDU before
passing received data to a ULP. The segments of each PDU are tracked via the SEQ_CNT
eld in the FC header. In addition to the SEQ_CNT eld, the Sequence Qualier elds are
required to track all data transmitted on a connection. Together, these elds provide equivalent
data tracking functionality to the IP Source Address, IP Destination Address, TCP Source
Port, TCP Destination Port, and TCP Sequence Number elds.
Two methods are available for segmentation and reassembly of ULP data: SEQ_CNT
and relative offset. For each method, two modes are dened. The SEQ_CNT method
may use a per-sequence counter (called normal mode) or a per-Exchange counter (called
continuously increasing mode). The relative offset method may consider the value of the
SEQ_CNT eld in the FC header (called continuously increasing mode) or not (called
random mode). If the SEQ_CNT method is used, frames received within a single sequence
Fibre Channel
237
or a consecutive series of sequences are concatenated using the SEQ_CNT eld in the
FC header. If the relative offset method is used, frames received within a single sequence
are concatenated using the Parameter eld in the FC header. The method is specied per
sequence by the sequence initiator via the Relative Offset Present bit of the F_CTL eld in
the FC header.
Note that the value of the Relative Offset Present bit must be the same in each frame
within each sequence within an Exchange. When the Relative Offset Present bit is set
to 1, relative offset is used. When the Relative Offset Present bit is set to 0, SEQ_CNT
is used. FCP always uses the relative offset method. To use a particular method, both
N_Ports must support that method. The supported methods are negotiated during PLOGI.
The SEQ_CNT method must be supported by all N_Ports. The relative offset method
is optional. The SEQ_CNT bit in the Common Service Parameters eld in the PLOGI
ELS indicates support for SEQ_CNT continuously increasing mode. The Continuously
Increasing Relative Offset bit, the Random Relative Offset bit, and the Relative Offset By
Info Category bit in the Common Service Parameters eld in the PLOGI ELS indicate
support for each mode of the relative offset method, and for applicability per information
category.
Like TCP, FC supports multiple ULPs. So, FC requires a mechanism equivalent to TCP
port numbers. This is provided via the Type eld in the FC header, which identies the
ULP contained within each frame. Unlike TCP, FC uses the same Type code on the host
(initiator) and storage (target) to identify the ULP. Because no more than one PLOGI
connection can be open between a pair of nodes at any point in time, there is no need for
unique connection identiers between each pair of nodes. This contrasts with the TCP
model, in which multiple TCP connections may be open simultaneously between a pair of
nodes. Thus, there is no direct analog for the concept of TCP sockets in the FC model. The
S_ID and D_ID elds in the FC header are sufcient to uniquely identify each PLOGI
connection. If multiple ULPs are communicating between a pair of nodes, they must share
the connection. This, too, contrasts the TCP model in which ULPs are not permitted to
share TCP connections.
Multiple information categories are dened in FC. The most commonly used are Solicited
Data, Unsolicited Data, Solicited Control, and Unsolicited Control. (Chapter 8, OSI
Session, Presentation, and Application Layers, discusses Solicited Data and Unsolicited
Data.) FC permits the transfer of multiple information categories within a single sequence.
The ability to mix information categories within a single sequence is negotiated during
PLOGI. This is accomplished via the Categories Per Sequence sub-eld of the Class 3
Service Parameters eld of the PLOGI ELS.
The ability to support multiple simultaneous sequences and Exchanges can improve
throughput signicantly. However, each sequence within each Exchange consumes
resources in the end nodes. Because every end node has limited resources, every end node
has an upper limit to the number of simultaneous sequences and Exchanges it can support.
During PLOGI, each node informs the peer node of its own limitations. The maximum
number of concurrent sequences that are supported across all classes of service is indicated
238
via the Total Concurrent Sequences sub-eld of the Common Service Parameters eld
of the PLOGI ELS. The maximum number of concurrent sequences that are supported
within Class 3 is indicated via the Concurrent Sequences sub-eld of the Class 3 Service
Parameters eld of the PLOGI ELS. The maximum number of concurrent sequences that
are supported within each Exchange is indicated via the Open Sequences Per Exchange
sub-eld of the Class 3 Service Parameters eld of the PLOGI ELS.
FC Frame Formats
The PLOGI ELS and associated LS_ACC ELS use the exact same frame format as
fabric login (FLOGI), which is discussed in Chapter 5, OSI Physical and Data Link
Layers, and diagrammed in Figure 5-25. Figure 5-25 is duplicated in this section as
Figure 7-11 for the sake of convenience. Some elds have common meaning and
applicability for FLOGI and PLOGI. Other elds have unique meaning and applicability
for PLOGI. This section highlights the meaning of each eld for PLOGI.
Figure 7-11 Data Field Format of an FC PLOGI/LS_ACC ELS Frame
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
LS Command Code
Word #14
Word #56
N_Port Name
Word #78
Word #912
Word #1316
Word #1720
Word #2124
Word #2528
Word #2930
Services Availability
Word #31
Word #3261
Word #6263
Fibre Channel
239
LS command code4 bytes long. It contains the 1-byte PLOGI command code
(0x03) followed by 3 bytes of zeros when transmitted by an initiating N_Port. This
eld contains the 1-byte LS_ACC command code (0x02) followed by 3 bytes of zeros
when transmitted by a responding N_Port.
Login extension data120 bytes long. It contains the vendor identity and other
vendor-specic information. This eld is valid only if the Common Service
Parameters eld indicates that the PLOGI payload is 256 bytes long.
Node name/fabric name8 bytes long. It contains the NWWN associated with the
N_Port.
240
the Clock Synchronization service is supported by the switch. This eld is valid
only if the Common Service Parameters eld indicates that the PLOGI payload is
256 bytes long.
The PDISC ELS and associated LS_ACC ELS use the exact same frame format as
PLOGI. The meaning of each eld is also identical. The LS Command Code eld
contains the PDISC command code (0x50). The PDISC ELS enables N_Ports to
exchange operating characteristics during an active connection without affecting
any sequences or exchanges that are currently open. This is useful when operating
characteristics change for any reason. For the new operating characteristics to take
affect, the connection must be terminated and re-established. Thus, PDISC is merely
a notication mechanism.
FC Delivery Mechanisms
As previously stated, the FC model provides multiple sets of delivery mechanisms via
multiple Classes of Service. All Classes of Service are implemented primarily at the datalink layer. This contrasts with the TCP/IP model, in which the choice of transport layer
protocol determines which set of delivery mechanisms are supported. The majority of
modern FC-SANs are congured to use Class 3. Class 3 delivery mechanisms are discussed
in Chapter 5, OSI Physical and Data-Link Layers, so this section discusses only the
additional delivery mechanisms provided by PLOGI.
When a missing frame is detected, the behavior of the receiving node is determined by
the Exchange Error Policy in effect for the Exchange in which the error is detected. The
Exchange originator species the Exchange Error Policy on a per-Exchange basis via the
Abort Sequence Condition bits of the F_CTL eld of the FC header of the rst frame of
each new Exchange. The Exchange originator may specify any one of the Exchange Error
Policies that are supported by the target node. Initiators discover which Exchange Error
Policies are supported by each target node via the Error Policy Supported bits of the Class 3
Service Parameters eld of the PLOGI ELS. See Chapter 5, OSI Physical and Data-Link
Layers, for a description of each Exchange Error Policy. Target nodes may support all
three policies (Process, Discard Sequence, Discard Exchange) or only the two Discard
policies. Regardless of the Exchange Error Policy in effect, detection of a missing frame is
always reported to the ULP.
As discussed in Chapter 5, OSI Physical and Data-Link Layers, the CS_CTL/Priority
eld of the FC header can be interpreted as CS_CTL or Priority. The CS_CTL
interpretation enables the use of DiffServ QoS and frame preference. The Priority/
Preemption interpretation enables the use of priority QoS and Class 1 or 6 connection
preemption. Each node discovers the fabrics ability to support DiffServ QoS/frame
preference and Priority QoS/connection preemption during FLOGI via the DiffServ QoS
bit, Preference bit, and Priority/Preemption bit, respectively. All three of these bits are
contained in the Class 3 Service Parameters eld of the FLOGI ELS. Likewise, during
Fibre Channel
241
FLOGI, each node informs the fabric of its own support for each of these features. For each
feature supported by both the fabric and the node, the node is permitted to negotiate use of
the feature with each peer node during PLOGI. This is accomplished via the same three bits
of the Class 3 Service Parameters eld of the PLOGI ELS.
Each node discovers the PMTU during FLOGI. However, a node might not be able to
generate and accept frames as large as the PMTU. So, each node informs each peer node of
its own MTU via the Receive Data Field Size bits of the Class 3 Service Parameters eld
of the PLOGI ELS. The lower of the PMTU and node MTU is used for all subsequent
communication between each pair of nodes. Because PLOGI is required before any end-toend communication occurs, fragmentation is precluded.
Comprehensive exploration of all aspects of the delivery mechanisms facilitated by PLOGI
is outside the scope of this book. For more information about the end-to-end delivery
mechanisms employed by FC, readers are encouraged to consult the ANSI T11 FC-FS and
FC-LS specication series.
FC Connection Initialization
Class 3 service does not provide guaranteed delivery, so end-to-end buffering and
acknowledgements are not required. This simplies the connection establishment
procedure. Following FLOGI, each N_Port performs PLOGI with the FC switch. This is
required to facilitate Fibre Channel name server (FCNS) registration and subsequent FCNS
queries as discussed in Chapter 3, Overview of Network Operating Principles. Following
PLOGI with the switch, each target N_Port waits for PLOGI requests from initiator
N_Ports. Following PLOGI with the switch, each initiator N_Port performs PLOGI with
each target N_Port discovered via the FCNS. For this reason, proper FCNS zoning must be
implemented to avoid accidental target access, which could result in a breach of data
security policy or data corruption (see Chapter 12, Storage Network Security).
An initiator N_Port begins the PLOGI procedure by transmitting a PLOGI ELS frame to
the FC address identier (FCID) of a target N_Port. The FCID assigned to each target
N_Port is discovered via the FCNS. Upon recognition of the PLOGI request, the target
N_Port responds to the initiator N_Port by transmitting a PLOGI LS_ACC ELS. Upon
recognition of the PLOGI LS_ACC, the PLOGI procedure is complete, and the N_Ports
may exchange ULP data.
Note that FC does not implement a connection control block in the same manner as TCP.
As previously mentioned, TCP creates a TCB per connection to maintain state for each
connection. In FC, state is maintained per exchange rather than per connection. After
PLOGI completes, the initiator creates an exchange status block (ESB), assigns an
OX_ID to the rst Exchange, and associates the OX_ID with the ESB. The ESB tracks
the state of sequences within the Exchange. The initiator then creates a sequence status
block (SSB), assigns a SEQ_ID to the rst sequence, associates the SEQ_ID with the
242
SSB, and associates the SSB with the ESB. The SSB tracks the state of frames within
the sequence. The rst frame of the rst sequence of the rst Exchange may then be
transmitted by the initiator. Upon receipt of the rst frame, the target N_Port creates
an ESB and an SSB. The target N_Port then assigns an RX_ID to the Exchange and
associates the RX_ID with the ESB. The value of the RX_ID is often the same as the
value of the OX_ID. The target N_Port then associates the SEQ_ID received in the frame
with the SSB and associates the SSB with the ESB. The target N_Port may then process
the frame and pass the ULP data to the appropriate FC-4 protocol. This entire procedure
is repeated for each new Exchange.
Summary
A major difference between the TCP/IP model and the FC model is the manner in which
end-to-end delivery mechanisms are implemented. With TCP/IP, the transport layer
protocol determines which end-to-end delivery mechanisms are provided. By contrast,
FC does not implement distinct transport layer protocols. Instead, the Class of Service
determines which end-to-end delivery mechanisms are provided. One feature common
to both the TCP/IP and FC models is the ability to negotiate which end-to-end delivery
mechanisms will be used between each pair of communicating nodes. For TCP, this is
accomplished via the three-way handshake procedure. For UDP, no such negotiation
occurs. For FC, this is accomplished via the PLOGI procedure. Whereas the TCP/IP suite
offers multiple transport layer protocols, most data is transmitted using TCP and UDP.
TCP is connection-oriented and guarantees delivery, while UDP is connectionless and
does not guarantee delivery. The nature of the source application determines whether
TCP or UDP is appropriate. Similarly, FC offers multiple Classes of Service. However,
the vast majority of modern FC-SANs are congured to use only Class 3 service
regardless of the nature of the source application. Class 3 service blends the features of
TCP and UDP. Like TCP, Class 3 service is connection-oriented. Like UDP, Class 3
service does not guarantee delivery.
Review Questions
1 Is reliable delivery required by all applications?
2 Does UDP provide segmentation/reassembly services?
3 What is the purpose of the Destination Port eld in the UDP packet header?
4 Does UDP provide notication of dropped packets?
5 Is the segmentation process controlled by TCP or the ULP?
6 Why is the TCP ow-control mechanism called a sliding window?
7 What is the formula for calculating the bandwidth-delay product?
Review Questions
243
Relate the addressing schemes of Internet Small Computer System Interface (iSCSI),
Fibre Channel Protocol (FCP), and Fibre Channel over TCP/IP (FCIP) to the SCSI
architecture model (SAM) addressing scheme
Discuss the iSCSI name and address assignment and resolution procedures
Differentiate the FCIP functional models
List the session establishment procedures of iSCSI, FCP, and FCIP
Explicate the data transfer optimizations of iSCSI, FCP, and FCIP
Decode the packet and frame formats of iSCSI, FCP, and FCIP
Explain the delivery mechanisms supported by iSCSI, FCP, and FCIP
CHAPTER
246
iSCSI node name for each SCSI device. If the optional device specic string is not used,
only one SCSI device can exist within the naming authoritys namespace (as identied by
the reverse domain name itself). Therefore, the optional device specic string is a
practical requirement for real-world deployments. iSCSI node names of type IQN are
variable in length up to a maximum of 223 characters. Examples include
iqn.1987-05.com.cisco:host1
iqn.1987-05.com.cisco.apac.singapore:ccm-host1
iqn.1987-05.com.cisco.erp:dr-host8-vpar1
The EUI type provides globally unique iSCSI node names assuming that the Extension
Identier sub-eld within the EUI-64 string is not locally administered (see Chapter 5,
OSI Physical and Data-Link Layers). The EUI format has two components: a type
designator followed by a dot and a device specic string. The type designator is eui. The
device specic string is a valid IEEE EUI-64 string. Because the length of an EUI-64 string
is eight bytes, and EUI-64 strings are expressed in hexadecimal, the length of an iSCSI node
name of type EUI is xed at 20 characters. For example
eui.02004567A425678D
The NAA type is based on the ANSI T11 NAA scheme. As explained in Chapter 5, the
ANSI T11 NAA scheme supports many formats. Some formats provide global uniqueness;
others do not. In iSCSI, the NAA format has two components: a type designator followed
by a dot and a device specic string. The type designator is naa. The device specic string
is a valid ANSI T11 NAA string. Because the length of ANSI T11 NAA strings can be either
8 or 16 bytes, iSCSI node names of type NAA are variable in length. ANSI T11 NAA
strings are expressed in hexadecimal, so the length of an iSCSI node name of type NAA is
either 20 or 36 characters. Examples include
naa.52004567BA64678D
naa.62004567BA64678D0123456789ABCDEF
iSCSI allows the use of aliases for iSCSI node names. The purpose of an alias is to provide
an easily recognizable, meaningful tag that can be displayed in tools, utilities, and other
user interfaces. Although IQNs are text-based and somewhat intuitive, the device-specic
portion might need to be very long and even cryptic to adequately ensure a scalable
nomenclature in large iSCSI environments. Likewise, EUI and NAA formatted node names
can be very difcult for humans to interpret. Aliases solve this problem by associating a
human-friendly tag with each iSCSI node name. Aliases are used only by humans. Aliases
might not be used for authentication or authorization. Aliases are variable in length up to a
maximum of 255 characters.
SCSI port names and identiers are handled somewhat differently in iSCSI as compared to
other SAM Transport Protocols. iSCSI uses a single string as both the port name and port
identier. The string is globally unique, and so the string positively identies each port
within the context of iSCSI. This complies with the dened SAM functionality for port
names. However, the string does not contain any resolvable address to facilitate packet
247
forwarding. Thus, iSCSI requires a mapping of its port identiers to other port types that
can facilitate packet forwarding. For this purpose, iSCSI employs the concept of a network
portal. Within an iSCSI device, an Ethernet (or other) interface congured with an IP
address is called a network portal. Network portals facilitate packet forwarding, while
iSCSI port identiers serve as session endpoint identiers. Within an initiator device, each
network portal is identied by its IP address. Within a target device, each network portal is
identied by its IP address and listening TCP port (its socket). Network portals that share
compatible operating characteristics may form a portal group. Within a target device, each
portal group is assigned a target portal group tag (TPGT).
SCSI ports are implemented differently for iSCSI initiators versus iSCSI targets. Upon
resolution of a target iSCSI node name to one or more sockets, an iSCSI initiator logs into
the target. After login completes, a SCSI port is dynamically created within the initiator. In
response, the iSCSI port name and identier are created by concatenating the initiators
iSCSI node name, the letter i (indicating this port is contained within an initiator device)
and the initiator session identier (ISID). The ISID is 6 bytes long and is expressed in
hexadecimal. These three elds are comma-separated. For example
iqn.1987-05.com.cisco:host1,i,0x00023d000002
The order of these events might seem counter-intuitive, but recall that iSCSI port identiers
do not facilitate packet forwarding; network portals do. So, iSCSI port identiers do not
need to exist before the iSCSI login. Essentially, iSCSI login signals the need for a SCSI
port, which is subsequently created and then used by the SCSI Application Layer (SAL).
Recall from Chapter 5 that SAM port names must never change. One might think that iSCSI
breaks this rule, because initiator port names seem to change regularly. iSCSI creates and
destroys port names regularly, but never changes port names. iSCSI generates each new
port name in response to the creation of a new SCSI port. Upon termination of an iSCSI
session, the associated initiator port is destroyed. Because the iSCSI port name does not
change during the lifetime of its associated SCSI port, iSCSI complies with the SAM
persistence requirement for port names.
In a target device, SCSI ports are also created dynamically in response to login requests.
The target port name and identier are created by concatenating the targets iSCSI node
name, the letter t (indicating this port is contained within a target device) and the TPGT.
The target node infers the appropriate TPGT from the IP address at which the login request
is received. The TPGT is 2 bytes long and is expressed in hexadecimal. These three elds
are comma-separated. For example
iqn.1987-05.com.cisco:array1,t,0x4097
All network portals within a target portal group share the same iSCSI port identier and
represent the same SCSI port. Because iSCSI target port names and identiers are based
on TPGTs, and because a network portal may operate independently (not part of a portal
group), a TPGT must be assigned to each network portal that is not part of a portal group.
Thus, a network portal that operates independently forms a portal group of one.
248
Some background information helps us fully understand ISID and TPGT usage. According
to the SAM, the relationship between a SCSI initiator port and a SCSI target port is known
as the initiator-target nexus (I_T nexus). The I_T nexus concept underpins all sessionoriented constructs in all modern storage networking technologies. At any point in time,
only one I_T nexus can exist between a pair of SCSI ports. According to the SAM, an I_T
nexus is identied by the conjunction of the initiator port identier and the target port
identier. The SAM I_T nexus is equivalent to the iSCSI session. Thus, only one iSCSI
session can exist between an iSCSI initiator port identier and an iSCSI target port
identier at any point in time.
iSCSI initiators adhere to this rule by incorporating the ISID into the iSCSI port identier.
Each new session is assigned a new ISID, which becomes part of the iSCSI port identier of
the newly created SCSI port, which becomes part of the I_T nexus. Each ISID is unique
within the context of an initiator-target-TPGT triplet. If an initiator has an active session
with a given target device and establishes another session with the same target device via a
different target portal group, the initiator may reuse any active ISID. In this case, the new
I_T nexus is formed between a unique pair of iSCSI port identiers because the target port
identier includes the TPGT. Likewise, any active ISID may be reused for a new session
with a new target device. Multiple iSCSI sessions may exist simultaneously between an
initiator device and a target device as long as each session terminates on a different iSCSI
port identier (representing a different SCSI port) on at least one end of the session.
Initiators accomplish this by connecting to a different target portal group or assigning a new
ISID. Note that RFC 3720 encourages the reuse of ISIDs in an effort to promote initiator
SCSI port persistence for the benet of applications, and to facilitate target recognition of
initiator SCSI ports in multipath environments.
NOTE
RFC 3720 ofcially denes the I_T nexus identier as the concatenation of the iSCSI
initiator port identier and the iSCSI target port identier (initiator node name + i + ISID
+ target node name + t + TPGT). This complies with the SAM denition of I_T nexus
identier. However, RFC 3720 also denes a session identier (SSID) that can be used to
reference an iSCSI session. The SSID is dened as the concatenation of the ISID and the
TPGT. Because the SSID is ambiguous, it has meaning only in the context of a given
initiator-target pair.
iSCSI may be implemented as multiple hardware and software components within a single
network entity. As such, coordination of ISID generation in an RFC-compliant manner
across all involved components can be challenging. For this reason, RFC 3720 requires a
single component to be responsible for the coordination of all ISID generation activity. To
facilitate this rule, the ISID format is exible. It supports a namespace hierarchy that
enables coordinated delegation of ISID generation authority to various independent
components within the initiator entity. Figure 8-1 illustrates the general ISID format.
Figure 8-1
Byte #
Bit #
249
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
TThis is 2 bits long and indicates the format of the ISID. Though not explicitly
stated in RFC 3720, T presumably stands for type.
AThis is 6 bits long and may be concatenated with the B eld. Otherwise, the A
eld is reserved.
BThis is 16 bits long and may be concatenated with the A eld or the C eld.
Otherwise, the B eld is reserved.
CThis is 8 bits long and may be concatenated with the B eld or the D eld.
Otherwise, the C eld is reserved.
DThis is 16 bits long and may be concatenated with the C eld or used as an
independent eld. Otherwise, the D eld is reserved.
Table 8-1 summarizes the possible values of T and the associated eld descriptions.
Table 8-1
ISID Format
Field Descriptions
00b
OUI
A & B form a 22-bit eld that contains the OUI of the vendor
of the component that generates the ISID. The I/G and U/L bits
are omitted. C & D form a 24-bit qualier eld that contains a
value generated by the component.
01b
EN
10b
Random
11b
Reserved
The target also assigns a session identier to each new session. This is known as the target
session identifying handle (TSIH). During login for a new session, the initiator uses a
250
TSIH value of zero. The target generates the TSIH value during login and sends the new
value to the initiator in the nal login response. In all subsequent packets, the assigned
TSIH is used by the initiator to enable the target to associate received packets with the
correct session. The TSIH is two bytes long, but the format of the TSIH is not dened in
RFC 3720. Each target determines its own TSIH format. For more information about iSCSI
device names, port names, port identiers, and session identiers, readers are encouraged
to consult IETF RFCs 3720, 3721, 3722, and 3980, and ANSI T10 SAM-3.
251
252
Although the operational parameter negotiation stage is optional according to RFC 3720, it
is a practical requirement for real-world deployments. Each initiator and target device must
support the same operational parameters to communicate successfully. It is possible for the
default settings of every iSCSI device to match, but it is not probable. So, negotiable parameters
must be congured manually or autonegotiated. Manually setting all negotiable parameters on every iSCSI device can be operationally burdensome. Thus, the operational parameter
negotiation stage is implemented by all iSCSI devices currently on the market. Support for
unsolicited writes, the maximum burst length and various other parameters are negotiated
during this stage.
Following the login phase, the iSCSI session transitions to the full feature phase. During the
full feature phase of a normal session, initiators can issue iSCSI commands as well as send
SCSI commands and data. Additionally, certain iSCSI operational parameters can be renegotiated during the full feature phase. When all SCSI operations are complete, a normal
iSCSI session can be gracefully terminated via the iSCSI Logout command. If a normal
session is terminated unexpectedly, procedures are dened to clean up the session before
reinstating the session. Session cleanup prevents processing of commands and responses
that might have been delayed in transit, thus avoiding data corruption. Procedures are also
dened to re-establish a session that has been terminated unexpectedly, so SCSI processing
can continue from the point of abnormal termination. After all normal sessions have been
terminated gracefully, the discovery session (if extant) can be terminated gracefully via the
iSCSI Logout command. For more information about iSCSI session types, phases, and
stages, readers are encouraged to consult IETF RFC 3720. Figure 8-2 illustrates the ow of
iSCSI sessions, phases, and stages.
Figure 8-2
Optional
Operational
Parameter
Negotiation
Stage
Mandatory
Full
Feature
Phase
Optional
Security
Parameter
Negotiation
Stage
Optional
Operational
Parameter
Negotiation
Stage
Mandatory
Full
Feature
Phase
253
Next, the target sends the requested data to the initiator. Finally, the target sends a SCSI
status indicator to the initiator. The read command species the starting block address and
the number of contiguous blocks to transfer. If the data being retrieved by the application
client is fragmented on the storage medium (that is, stored in non-contiguous blocks), then
multiple read commands must be issued (one per set of contiguous blocks). For each set of
contiguous blocks, the initiator may issue more than one read command if the total data in
the set of contiguous blocks exceeds the initiators available receive buffer resources. This
eliminates the need for a ow-control mechanism for read commands. A target always
knows it can send the entire requested data set because an initiator never requests more data
than it is prepared to receive. When multiple commands are issued to satisfy a single
application client request, the commands may be linked together as a single SCSI task.
Such commands are called SCSI linked commands.
The basic procedure for a SCSI write operation involves four steps. First, the initiator sends a
SCSI write command to the target. Next, the target sends an indication that it is ready to
receive the data. Next, the initiator sends the data. Finally, the target sends a SCSI status
indicator to the initiator. The write command species the starting block address and the
number of contiguous blocks that will be transferred by this command. If the data being stored
by the application client exceeds the largest contiguous set of available blocks on the medium,
multiple write commands must be issued (one per set of contiguous blocks). The commands
may be linked as a single SCSI task. A key difference between read and write operations is
the initiators knowledge of available receive buffer space. When writing, the initiator does
not know how much buffer space is currently available in the target to receive the data. So, the
target must inform the initiator when the target is ready to receive data (that is, when receive
buffers are available). The target must also indicate how much data to transfer. In other words,
a ow-control mechanism is required for write operations. The SAM delegates responsibility
for this ow-control mechanism to each SCSI Transport Protocol.
In iSCSI parlance, the data transfer steps are called phases (not to be confused with phases
of an iSCSI session). iSCSI enables optimization of data transfer through phase-collapse.
Targets may include SCSI status as part of the nal data PDU for read commands. This does
not eliminate any round-trips across the network, but it does reduce the total number of
PDUs required to complete the read operation. Likewise, initiators may include data with
write command PDUs. This can be done in two ways. Data may be included as part of the write
command PDU. This is known as immediate data. Alternately, data may be sent in one or
more data PDUs immediately following a write command PDU without waiting for the
target to indicate its readiness to receive data. This is known as unsolicited data. In both
cases, one round-trip is eliminated across the network, which reduces the total time to
completion for the write operation. In the case of immediate data, one data PDU is also
eliminated. The initiator must negotiate support for immediate data and unsolicited data
during login. Each feature is negotiated separately. If the target supports phase-collapse for
write commands, the target informs the initiator (during login) how much data may be
sent using each feature. Both features may be supported simultaneously. Collectively,
immediate data and unsolicited data are called rst burst data. First burst data may be sent
only once per write command (the rst sequence of PDUs). For more information about
iSCSI phase-collapse, readers are encouraged to consult IETF RFC 3720.
254
In the generic sense, immediate data is actually a subset of unsolicited data. Unsolicited
data generically refers to any data sent to the target without rst receiving an indication
from the target that the target is ready for the data transfer. By that generic denition,
immediate data qualies as unsolicited data. However, the term unsolicited data has
specic meaning in the context of iSCSI. Note that data sent in response to an indication
of receiver readiness is called solicited data.
NOTE
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #011
Mandatory BHS
Word #12a
Optional AHS
Word #a+1b
Optional Header-Digest
Word #b+1c
Optional Data
Word #c+1d
Optional Data-Digest
BHSThis is 48 bytes long. It is the only mandatory eld. The BHS eld indicates
the type of PDU and contains most of the control information used by iSCSI.
255
The remainder of this section focuses on the BHS because the two dened AHSs are less
commonly used. Details of the BHS are provided for each of the primary iSCSI PDU types.
Figure 8-4 illustrates the general format of the iSCSI BHS. All elds marked with . are
reserved.
Figure 8-4
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
Word #23
. I
Opcode
Opcode-specific Sub-fields
TotalAHSLength
DataSegmentLength
LUN/Opcode-specific Sub-fields
ITT
Word #4
Opcode-specific Sub-fields
Word #511
ReservedThis is 1 bit.
OpcodeThis is 6 bits long. The Opcode eld contains an operation code that
indicates the type of PDU. Opcodes are dened as initiator opcodes (transmitted only
by initiators) and target opcodes (transmitted only by targets). RFC 3720 denes
18 opcodes (see Table 8-2).
FThis is 1 bit. F stands for nal PDU. When this bit is set to 1, the PDU is the nal
(or only) PDU in a sequence of PDUs. When this bit is set to 0, the PDU is followed by
one or more PDUs in the same sequence. The F bit is redened by some PDU types.
IThis is 1 bit. I stands for immediate delivery. When an initiator sends an iSCSI
command or SCSI command that should be processed immediately, the I bit is set
to 1. When this bit is set to 1, the command is called an immediate command. This
should not be confused with immediate data (phase-collapse). When this bit is set
to 0, the command is called a non-immediate command.
256
Opcode-specic Sub-eldsThese are 23 bits long. The format and use of all Opcodespecic sub-elds are determined by the value in the Opcode eld.
LUN eld and the Opcode-specic Sub-eldsThese are 64 bits (8 bytes) long. They
contain the destination LUN if the Opcode eld contains a value that is relevant to a
specic LUN (such as a SCSI command). When used as a LUN eld, the format
complies with the SAM LUN format. When used as Opcode-specic sub-elds, the
format and use of the sub-elds are opcode-specic.
Initiator Task Tag (ITT)This is 32 bits long. It contains a tag assigned by the
initiator. An ITT is assigned to each iSCSI task. Likewise, an ITT is assigned to each
SCSI task. A SCSI task can represent a single SCSI command or multiple linked
commands. Each SCSI command can have many SCSI activities associated with it. A
SCSI task encompasses all activities associated with a SCSI command or multiple
linked commands. Likewise, an ITT that represents a SCSI task also encompasses all
associated activities of the SCSI command(s). An ITT value is unique only within the
context of the current session. The iSCSI ITT is similar in function to the FC fully
qualied exchange identier (FQXID).
Opcode-specic Sub-eldsThese are 224 bits (28 bytes) long. The format and use
of the sub-elds are opcode-specic.
Table 8-2 summarizes the iSCSI opcodes that are currently dened in RFC 3720. All
opcodes excluded from Table 8-2 are reserved.
Table 8-2
Value
Description
Initiator
0x00
NOP-out
Initiator
0x01
SCSI command
Initiator
0x02
Initiator
0x03
Login request
Initiator
0x04
Text request
Initiator
0x05
SCSI data-out
Initiator
0x06
Logout request
Initiator
0x10
SNACK request
Initiator
0x1C-0x1E
Vendor-specic codes
Table 8-2
257
Value
Description
Target
0x20
NOP-in
Target
0x21
SCSI response
Target
0x22
Target
0x23
Login response
Target
0x24
Text response
Target
0x25
SCSI data-in
Target
0x26
Logout response
Target
0x31
Target
0x32
Asynchronous message
Target
0x3C-0x3E
Vendor-specic codes
Target
0x3F
Reject
The rst login request of a new session is called the leading login request. The rst TCP
connection of a new session is called the leading connection. Figure 8-5 illustrates the
iSCSI BHS of a Login Request PDU. Login parameters are encapsulated in the Data
segment (not shown) as text key-value pairs. All elds marked with . are reserved.
Figure 8-5
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. I
Opcode
T C . . CSG NSG
TotalAHSLength
Version-Max
DataSegmentLength
ISID
Word #2
TSIH
Word #3
ITT
Word #4
Word #5
CID
Reserved
Word #6
CmdSN
Word #7
ExpStatSN or Reserved
Word #811
Version-Min
Reserved
258
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
CThis is 1 bit. It indicates whether the set of text keys in this PDU is complete.
C stands for continue. When the set of text keys is too large for a single PDU, the C
bit is set to 1, and another PDU follows containing more text keys. When all text keys
have been transmitted, the C bit is set to 0. When the C bit is set to 1, the T bit must
be set to 0.
NSGThis is 2 bits long. It indicates the next stage of the login procedure.
TSIHThis is 16 bits long. For a new session, the initiator uses the value 0. Upon
successful completion of the login procedure, the target provides the TSIH value to
the initiator in the nal Login Response PDU. For a new connection within an existing
session, the value previously assigned to the session by the target must be provided by
the initiator in the rst and all subsequent Login Request PDUs.
CSGThis is 2 bits long. It indicates the current stage of the login procedure. The
value 0 indicates the security parameter negotiation stage. The value 1 indicates the
operational parameter negotiation stage. The value 2 is reserved. The value 3 indicates
the full feature phase. These values also used are by the NSG eld.
Version-MaxThis is 8 bits long. It indicates the highest supported version of the
iSCSI protocol. Only one version of the iSCSI protocol is currently dened. The
current version is 0x00.
259
Each Login Request PDU or sequence of Login Request PDUs precipitates a Login
Response PDU or sequence of Login Response PDUs. Figure 8-6 illustrates the iSCSI BHS
of a Login Response PDU. Login parameters are encapsulated in the Data segment (not
shown) as text key-value pairs. All elds marked with . are reserved.
260
Figure 8-6
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. .
Opcode
T C . . CSG NSG
TotalAHSLength
Version-Max
DataSegmentLength
ISID
Word #2
TSIH
Word #3
Word #4
ITT
Word #5
Reserved
Word #6
StatSN
Word #7
ExpCmdSN
Word #8
MaxCmdSN
Word #9
Version-Active
Status-Class
Status-Detail
Reserved
Word #1011
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
CThis is 1 bit.
261
262
indicates that the target has detected an initiator error. The login procedure should be
aborted, and a new login phase should be initiated if the initiator still requires access
to the target. A value of 3 indicates that the target has experienced an internal error.
The initiator may retry the request without aborting the login procedure. All other
values are currently undened but not explicitly reserved.
Following login, the initiator may send SCSI commands to the target. Figure 8-7 illustrates
the iSCSI BHS of a SCSI Command PDU. All elds marked with . are reserved.
Figure 8-7
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. I
Opcode
F R W . .
TotalAHSLength
ATTR
Reserved
DataSegmentLength
Word #23
LUN
Word #4
ITT
Word #5
Word #6
CmdSN
Word #7
ExpStatSN
Word #811
SCSI CDB
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
IThis is 1 bit.
OpcodeThis is 6 bits long. It is set to 0x01.
FThis is 1 bit.
RThis is 1 bit. It indicates a read command when set to 1. For bidirectional
commands, both the R and W bits are set to 1.
263
ATTRThis is 3 bits long. It indicates the SCSI Task Attribute. A value of 0 indicates
an untagged task. A value of 1 indicates a simple task. A value of 2 indicates an
ordered task. A value of 3 indicates a Head Of Queue task. A value of 4 indicates an
Auto Contingent Allegiance (ACA) task. All other values are reserved. For more
information about SCSI Task Attributes, see the ANSI T10 SAM-3 specication.
CmdSNThis is 32 bits long. It contains the current value of the CmdSN counter.
The CmdSN counter is incremented by 1 immediately following transmission of a
new non-immediate command. Thus, the counter represents the number of the next
non-immediate command to be sent. The only exception is when an immediate
command is transmitted. For an immediate command, the CmdSN eld contains the
current value of the CmdSN counter, but the counter is not incremented after
transmission of the immediate command. Thus, the next non-immediate command to
be transmitted carries the same CmdSN as the preceding immediate command. The
CmdSN counter is incremented by 1 immediately following transmission of the rst
non-immediate command to follow an immediate command. A retransmitted SCSI
Command PDU carries the same CmdSN as the original PDU. Note that a
retransmitted SCSI Command PDU also carries the same ITT as the original PDU.
SCSI CDBThis is 128 bits (16 bytes) long. Multiple SCSI CDB formats are dened
by ANSI. SCSI CDBs are variable in length up to a maximum of 260 bytes, but the
most common CDB formats are 16 bytes long or less. Thus, the most common CDB
formats can t into this eld. When a CDB shorter than 16 bytes is sent, this eld is
padded with zeros. When a CDB longer than 16 bytes is sent, the BHS must be
followed by an Extended CDB AHS containing the remainder of the CDB. All CDBs
longer than 16 bytes must end on a 4-byte word boundary, so the Extended CDB AHS
does not require padding.
The nal result of each SCSI command is a SCSI status indicator delivered in a SCSI
Response PDU. The SCSI Response PDU also conveys iSCSI status for protocol
264
operations. Figure 8-8 illustrates the iSCSI BHS of a SCSI Response PDU. All elds
marked with . are reserved.
Figure 8-8
Byte #
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. .
Opcode
F . . o u O U .
TotalAHSLength
Response
Status
DataSegmentLength
Reserved
Word #23
Word #4
ITT
Word #5
Word #6
StatSN
Word #7
ExpCmdSN
Word #8
MaxCmdSN
Word #9
ExpDataSN or Reserved
Word #10
Word #11
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
265
ReservedThis is 1 bit.
ResponseThis is 8 bits long and contains a code that indicates the presence or
absence of iSCSI protocol errors. The iSCSI response code is to the SCSI service
delivery subsystem what the SCSI status code is to SAL. An iSCSI response code of
0x00 is known as Command Completed at Target. It indicates the target has completed
processing the command from the iSCSI perspective. This iSCSI response code is
roughly equivalent to a SCSI service response of LINKED COMMAND COMPLETE
or TASK COMPLETE. This iSCSI response code is also roughly equivalent to an
FCP_RSP_LEN_VALID bit set to zero. An iSCSI response code of 0x00 conveys
iSCSI success but does not imply SCSI success. An iSCSI response code of 0x01 is
known as Target Failure. It indicates failure to process the command. This iSCSI
response code is roughly equivalent to a SCSI service response of SERVICE
DELIVERY OR TARGET FAILURE. This iSCSI response code is also roughly
equivalent to an FCP_RSP_LEN_VALID bit set to 1. iSCSI response codes 0x80-0xff
are vendor-specic. All other iSCSI response codes are reserved.
Note
The SCSI service response is passed to the SAL from the SCSI
service delivery subsystem within the initiator. The SCSI service
response indicates success or failure for delivery operations.
Whereas the iSCSI response code provides status between peer
layers in the OSI Reference Model, the SCSI service response
provides inter-layer status between provider and subscriber.
StatusThis is 8 bits long. This eld contains a status code that provides more
detail about the nal status of the SCSI command and the state of the logical unit that
executed the command. This eld is valid only if the Response eld is set to 0x00.
Even if the Response eld is set to 0x00, the target might not have processed the
command successfully. If the status code indicates failure to process the command
266
successfully, error information (called SCSI sense data) is included in the Data
segment. All iSCSI devices must support SCSI autosense. iSCSI does not dene status
codes. Instead, iSCSI uses the status codes dened by the SAM. Currently, 10 SCSI
status codes are dened in the SAM-3 specication (see Table 8-3). All other values
are reserved:
267
Residual Count or ReservedThis is 32 bits long. When either the O bit or the
U bit is set to 1, this eld indicates the residual read byte count for a read command,
the residual write byte count for a write command, or the residual write byte count for
a bidirectional command. When neither the O bit nor the U bit is set to 1, this eld is
reserved. This eld is valid only if the Response eld is set to 0x00.
Table 8-3 summarizes the SCSI status codes that are currently dened in the SAM-3
specication. All SCSI status codes excluded from Table 8-3 are reserved.
Table 8-3
Status Name
0x00
GOOD
TASK COMPLETE
0x02
CHECK CONDITION
TASK COMPLETE
0x04
CONDITION MET
TASK COMPLETE
0x08
BUSY
TASK COMPLETE
0x10
INTERMEDIATE
0x14
INTERMEDIATECONDITION MET
0x18
RESERVATION
CONFLICT
TASK COMPLETE
0x28
TASK COMPLETE
0x30
ACA ACTIVE
TASK COMPLETE
0x40
TASK ABORTED
TASK COMPLETE
Outbound data is delivered in Data-Out PDUs. Figure 8-9 illustrates the iSCSI BHS of a
Data-Out PDU. All elds marked with . are reserved. Each Data-Out PDU must include
a Data segment.
268
Figure 8-9
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. .
Opcode
Reserved
TotalAHSLength
DataSegmentLength
LUN or Reserved
Word #23
Word #4
ITT
Word #5
TTT or 0xFFFFFFFF
Word #6
Reserved
Word #7
ExpStatSN
Word #8
Reserved
Word #9
DataSN
Word #10
Buffer Offset
Word #11
Reserved
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
269
PDU. For Data-Out PDUs containing rst burst data, this eld contains the value
0xFFFFFFFF.
Buffer OffsetThis is 32 bits long. This eld indicates the position of the rst byte
of data delivered by this PDU relative to the rst byte of all the data transferred by the
associated SCSI command. This eld enables the target to reassemble the data properly.
Inbound data is delivered in Data-In PDUs. Figure 8-10 illustrates the iSCSI BHS of a Data-In
PDU. All elds marked with . are reserved. Each Data-In PDU must include a Data segment.
Figure 8-10 iSCSI Data-In BHS Format
Byte #
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
Word #23
. .
Opcode
TotalAHSLength
F A . . . O U S
Reserved
Status or Reserved
DataSegmentLength
LUN or Reserved
Word #4
ITT
Word #5
TTT or 0xFFFFFFFF
Word #6
StatSN or Reserved
Word #7
ExpCmdSN
Word #8
MaxCmdSN
Word #9
DataSN
Word #10
Buffer Offset
Word #11
Residual Count
270
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
AThis is 1 bit. The A stands for Acknowledge. The target sets this bit to 1 to
request positive, cumulative acknowledgment of all Data-In PDUs transmitted before
the current Data-In PDU. This bit may be used only if the session supports an
ErrorRecoveryLevel greater than 0 (see the iSCSI Login Parameters section of this
chapter).
SThis is 1 bit. The S stands for status. When this bit is set to 1, status is included in
the PDU.
O and U bitsThese are used in the same manner as previously described. These
bits are present to support phase-collapse for read commands. For bidirectional
commands, the target must send status in a separate SCSI Response PDU. iSCSI (like
FCP) does not support status phase-collapse for write commands. These bits are valid
only when the S bit is set to 1.
Status or ReservedThis is 8 bits long. When the S eld is set to 1, this eld
contains the SCSI status code for the command. Phase-collapse is supported only
when the iSCSI response code is 0x00. Thus, a Response eld is not required because
the response code is implied. Furthermore, phase-collapse is supported only when
the SCSI status is 0x00, 0x04, 0x10, or 0x14 (GOOD, CONDITION MET,
INTERMEDIATE, or INTERMEDIATE-CONDITION MET, respectively). When
the S eld is set to zero, this eld is reserved.
DataSegmentLengthThis is 24 bits long.
LUN or ReservedThis is 64 bits (8 bytes) long and contains the LUN if the A
eld is set to 1. The initiator copies the value of this eld into a similar eld in the
acknowledgment PDU. If the A eld is set to 0, this eld is reserved.
271
StatSN or ReservedThis is 32 bits long. It contains the StatSN if the S eld is set
to 1. Otherwise, this eld is reserved.
The target signals its readiness to receive write data via the R2T PDU. The target also uses
the R2T PDU to request retransmission of missing Data-Out PDUs. In both cases, the PDU
format is the same, but an R2T PDU sent to request retransmission is called a Recovery
R2T PDU. Figure 8-11 illustrates the iSCSI BHS of a R2T PDU. All elds marked with .
are reserved.
272
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. .
Opcode
Reserved
TotalAHSLength
DataSegmentLength
Word #23
LUN
Word #4
ITT
Word #5
TTT
Word #6
StatSN
Word #7
ExpCmdSN
Word #8
MaxCmdSN
Word #9
R2TSN
Word #10
Buffer Offset
Word #11
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
StatSNThis is 32 bits long. It contains the StatSN that will be assigned to this
command upon completion. This is the same as the ExpStatSN from the initiators
perspective.
273
Buffer OffsetThis is 32 bits long. It indicates the position of the rst byte of data
requested by this PDU relative to the rst byte of all the data transferred by the SCSI
command.
Desired Data Transfer LengthThis is 32 bits long. This eld indicates how much
data should be transferred in response to this R2T PDU. This eld is expressed in
bytes. The value of this eld cannot be 0 and cannot exceed the negotiated value of
MaxBurstLength (see the iSCSI Login Parameters section of this chapter).
iSCSI supports PDU retransmission and PDU delivery acknowledgment on demand via the
SNACK Request PDU. Each SNACK Request PDU species a contiguous set of missing
single-type PDUs. Each set is called a run. Figure 8-12 illustrates the iSCSI BHS of a
SNACK Request PDU. All elds marked with . are reserved.
Figure 8-12 iSCSI SNACK Request BHS Format
Byte #
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
Word #23
. .
Opcode
TotalAHSLength
F . . .
Type
Reserved
DataSegmentLength
LUN or Reserved
Word #4
ITT or 0xFFFFFFFF
Word #5
Word #6
Reserved
Word #7
ExpStatSN
Word #89
Reserved
Word #10
BegRun or ExpDataSN
Word #11
RunLength
274
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
275
beginning at DataSN 0. If some Data-In PDUs have been acknowledged, the rst
retransmitted Data-In PDU is assigned the rst unacknowledged DataSN.
RunLengthThis is 32 bits long. For Data/R2T and Status PDUs, this eld
species the number of PDUs to retransmit. This eld may be set to 0 to indicate that
all PDUs with a sequence number equal to or greater than BegRun must be
retransmitted. For DataACK and R-Data PDUs, this eld must be set to 0.
Table 8-4 summarizes the SNACK Request PDU types that are currently dened in
RFC 3720. All PDU types excluded from Table 8-4 are reserved.
Table 8-4
Name
Function
Data/R2T
Status
DataACK
R-Data
iSCSI initiators manage SCSI and iSCSI tasks via the TMF Request PDU. Figure 8-13
illustrates the iSCSI BHS of a TMF Request PDU. All elds marked with . are reserved.
276
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. I
Opcode
Function
TotalAHSLength
Reserved
DataSegmentLength
LUN or Reserved
Word #23
Word #4
ITT
Word #5
RTT or 0xFFFFFFFF
Word #6
CmdSN
Word #7
ExpStatSN
Word #8
RefCmdSN or Reserved
Word #9
ExpDataSN or Reserved
Reserved
Word #1011
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
ITTThis is 32 bits long. It contains the ITT assigned to this TMF command. This
eld does not contain the ITT of the task upon which the TMF command acts.
IThis is 1 bit.
OpcodeThis is 6 bits long. It is set to 0x02.
F bitThis is always set to 1.
FunctionThis is 7 bits long. It contains the TMF Request code of the function
to be performed. iSCSI currently supports six of the TMFs dened in the SAM-2
specication and one TMF dened in RFC 3720 (see Table 8-5). All other TMF
Request codes are reserved.
TotalAHSLengthThis is 8 bits long. It is always set to 0.
DataSegmentLengthThis is 24 bits long. It is always set to 0.
LUN or ReservedThis is 64 bits (8 bytes) long. It contains a LUN if the TMF
is ABORT TASK, ABORT TASK SET, CLEAR ACA, CLEAR TASK SET, or
LOGICAL UNIT RESET. Otherwise, this eld is reserved.
277
CmdSNThis is 32 bits long. It contains the CmdSN of the TMF command. TMF
commands are numbered the same way SCSI read and write commands are
numbered. This eld does not contain the CmdSN of the task upon which the TMF
command acts.
Table 8-5 summarizes the TMF Request codes that are currently supported by iSCSI. All
TMF Request codes excluded from Table 8-5 are reserved.
Table 8-5
TMF Name
Description
ABORT TASK
ABORT
TASK SET
CLEAR ACA
278
Table 8-5
TMF Name
Description
CLEAR TASK
SET
LOGICAL
UNIT RESET
TARGET
WARM RESET
TARGET
COLD RESET
TASK
REASSIGN
Each TMF Request PDU precipitates one TMF Response PDU. Figure 8-14 illustrates the
iSCSI BHS of a TMF Response PDU. All elds marked with . are reserved.
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
ReservedThis is the 1 bit redened as Reserved.
279
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
Word #23
. .
Opcode
TotalAHSLength
Reserved
Response
DataSegmentLength
Reserved
Word #4
ITT
Word #5
Reserved
Word #6
StatSN
Word #7
ExpCmdSN
Word #8
MaxCmdSN
Word #911
Reserved
Reserved
280
Table 8-6 summarizes the TMF Response codes that are currently supported by iSCSI. All
TMF Response codes excluded from Table 8-6 are reserved.
Table 8-6
TMF Name
Description
Function Complete
Task Allegiance
Reassignment Not
Supported
Function
Authorization Failed
255
Function Rejected
The Reject PDU signals an error condition and rejects the PDU that caused the error. The
Data segment (not shown in Figure 8-15) must contain the header of the PDU that caused
the error. If a Reject PDU causes a task to terminate, a SCSI Response PDU with status
CHECK CONDITION must be sent. Figure 8-15 illustrates the iSCSI BHS of a Reject
PDU. All elds marked with . are reserved.
281
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Word #1
. .
Opcode
Reserved
Reason
TotalAHSLength
Reserved
DataSegmentLength
Reserved
Word #23
Word #4
0xFFFFFFFF
Word #5
Reserved
Word #6
StatSN
Word #7
ExpCmdSN
Word #8
MaxCmdSN
Word #9
DataSN/R2TSN or Reserved
Reserved
Word #1011
A brief description of each eld follows. The description of each eld is abbreviated unless
a eld is used in a PDU-specic manner:
ReservedThis is 1 bit.
282
Table 8-7 summarizes the Reject Reason codes that are currently supported by iSCSI. All
Reject Reason codes excluded from Table 8-7 are reserved.
Table 8-7
Reason Name
0x02
Data-Digest Error
0x03
SNACK Reject
0x04
Protocol Error
0x05
0x06
0x07
Task In Progress
0x08
Invalid DataACK
0x09
0x0a
0x0b
Negotiation Reset
0x0c
The preceding discussion of iSCSI PDU formats is simplied for the sake of clarity.
Comprehensive exploration of all the iSCSI PDUs and their variations is outside the scope
of this book. For more information, readers are encouraged to consult IETF RFC 3720 and
the ANSI T10 SAM-2, SAM-3, SPC-2, and SPC-3 specications.
283
session. Some text keys require a response (negotiation), and others do not (declaration).
Currently, RFC 3720 denes 22 operational text keys. RFC 3720 also denes a protocol
extension mechanism that enables the use of public and private text keys that are not dened
in RFC 3720. This section describes the standard operational text keys and the extension
mechanism. The format of all text key-value pairs is:
<key name>=<list of values>
The SessionType key declares the type of iSCSI session. Only initiators send this key. This
key must be sent only during the Login Phase on the leading connection. The valid values
are Normal and Discovery. The default value is Normal. The scope is session-wide.
The HeaderDigest and DataDigest keys negotiate the use of the Header-Digest segment and
the Data-Digest segment, respectively. Initiators and targets send these keys. These keys
may be sent only during the Login Phase. Values that must be supported include CRC32C
and None. Other public and private algorithms may be supported. The default value is None
for both keys. The chosen digest must be used in every PDU sent during the Full Feature
Phase. The scope is connection-specic.
As discussed in Chapter 3, the SendTargets key is used by initiators to discover targets
during a Discovery session. This key may also be sent by initiators during a Normal session
to discover changed or additional paths to a known target. Sending this key during a Normal
session is fruitful only if the target conguration changes after the Login Phase. This is
because, during a Discovery session, a target network entity must return all target names,
sockets, and TPGTs for all targets that the requesting initiator is permitted to access.
Additionally, path changes occurring during the Login Phase of a Normal session are
handled via redirection. This key may be sent only during the Full Feature Phase. The scope
is session-wide.
The TargetName key declares the iSCSI device name of one or more target devices within
the responding network entity. This key may be sent by targets only in response to a
SendTargets command. This key may be sent by initiators only during the Login Phase of
a Normal session, and the key must be included in the leading Login Request PDU for each
connection. The scope is session-wide.
The TargetAddress key declares the network addresses, TCP ports, and TPGTs of the target
device to the initiator device. An address may be given in the form of DNS host name, IPv4
address, or IPv6 address. The TCP port may be omitted if the default port of 3260 is used.
Only targets send this key. This key is usually sent in response to a SendTargets command,
but it may be sent in a Login Response PDU to redirect an initiator. Therefore, this key may
be sent during any phase. The scope is session-wide.
The InitiatorName key declares the iSCSI device name of the initiator device within the
initiating network entity. This key identies the initiator device to the target device so that
access controls can be implemented. Only initiators send this key. This key may be sent
only during the Login Phase, and the key must be included in the leading Login Request
PDU for each connection. The scope is session-wide.
284
The InitiatorAlias key declares the optional human-friendly name of the initiator device to
the target for display in relevant user interfaces. Only initiators send this key. This key is
usually sent in a Login Request PDU for a Normal session, but it may be sent during the
Full Feature Phase as well. The scope is session-wide.
The TargetAlias key declares the optional human-friendly name of the target device to the
initiator for display in relevant user interfaces. Only targets send this key. This key usually
is sent in a Login Response PDU for a Normal session, but it may be sent during the Full
Feature Phase as well. The scope is session-wide.
The TargetPortalGroupTag key declares the TPGT of the target port to the initiator port.
Only targets send this key. This key must be sent in the rst Login Response PDU of a
Normal session unless the rst Login Response PDU redirects the initiator to another
TargetAddress. The range of valid values is 0 to 65,535. The scope is session-wide.
The ImmediateData and InitialR2T keys negotiate support for immediate data and
unsolicited data, respectively. Immediate data may not be sent unless both devices support
immediate data. Unsolicited data may not be sent unless both devices support unsolicited
data. Initiators and targets send these keys. These keys may be sent only during Normal
sessions and must be sent during the Login Phase on the leading connection. The default
settings support immediate data but not unsolicited data. The scope is session-wide for
both keys.
The MaxOutstandingR2T key negotiates the maximum number of R2T PDUs that may be
outstanding simultaneously for a single task. This key does not include the implicit R2T
PDU associated with unsolicited data. Each R2T PDU is considered outstanding until the
last Data-Out PDU is transferred (initiators perspective) or received (targets perspective).
A sequence timeout can also terminate the lifespan of an R2T PDU. Initiators and targets
send this key. This key may be sent only during Normal sessions and must be sent during
the Login Phase on the leading connection. The range of valid values is 1 to 65,535. The
default value is one. The scope is session-wide.
The MaxRecvDataSegmentLength key declares the maximum amount of data that a
receiver (initiator or target) can receive in a single iSCSI PDU. Initiators and targets send
this key. This key may be sent during any phase of any session type and is usually sent
during the Login Phase on the leading connection. This key is expressed in bytes. The range
of valid values is 512 to 16,777,215. The default value is 8,192. The scope is connectionspecic.
The MaxBurstLength key negotiates the maximum amount of data that a receiver (initiator
or target) can receive in a single iSCSI sequence. This value may exceed the value of
MaxRecvDataSegmentLength, which means that more than one PDU may be sent in
response to an R2T Request PDU. This contrasts the FC model. For write commands, this
key applies only to solicited data. Initiators and targets send this key. This key may be sent
only during Normal sessions and must be sent during the Login Phase on the leading
connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215.
The default value is 262,144. The scope is session-wide.
285
The FirstBurstLength key negotiates the maximum amount of data that a target can receive
in a single iSCSI sequence of unsolicited data (including immediate data). Thus, the value
of this key minus the amount of immediate data received with the SCSI command PDU
yields the amount of unsolicited data that the target can receive in the same sequence. If
neither immediate data nor unsolicited data is supported within the session, this key is
invalid. The value of this key cannot exceed the targets MaxBurstLength. Initiators and
targets send this key. This key may be sent only during Normal sessions and must be sent
during the Login Phase on the leading connection. This key is expressed in bytes. The range
of valid values is 512 to 16,777,215. The default value is 65,536. The scope is session-wide.
The MaxConnections key negotiates the maximum number of TCP connections supported
by a session. Initiators and targets send this key. Discovery sessions are restricted to one
TCP connection, so this key may be sent only during Normal sessions and must be sent
during the Login Phase on the leading connection. The range of valid values is 1 to 65,535.
The default is value is 1. The scope is session-wide.
The DefaultTime2Wait key negotiates the amount of time that must pass before attempting
to logout a failed connection. Task reassignment may not occur until after the failed
connection is logged out. Initiators and targets send this key. This key may be sent only
during Normal sessions and must be sent during the Login Phase on the leading connection.
This key is expressed in seconds. The range of valid values is 0 to 3600. The default value
is 2. A value of 0 indicates that logout may be attempted immediately upon detection of a
failed connection. The scope is session-wide.
The DefaultTime2Retain key negotiates the amount of time that task state information must
be retained for active tasks after DefaultTime2Wait expires. When a connection fails, this
key determines how much time is available to complete task reassignment. If the failed
connection is the last (or only) connection in a session, this key also represents the session
timeout value. Initiators and targets send this key. This key may be sent only during Normal
sessions and must be sent during the Login Phase on the leading connection. This key is
expressed in seconds. The range of valid values is 0 to 3600. The default value is 20. A value
of 0 indicates that task state information is discarded immediately upon detection of a failed
connection. The scope is session-wide.
The DataPDUInOrder key negotiates in-order transmission of data PDUs within a sequence.
Because TCP guarantees in-order delivery, the only way for PDUs of a given sequence to arrive
out of order is to be transmitted out of order. Initiators and targets send this key. This key
may be sent only during Normal sessions and must be sent during the Login Phase on
the leading connection. The default value requires in-order transmission. The scope is
session-wide.
The DataSequenceInOrder key negotiates in-order transmission of data PDU sequences
within a command. For sessions that support in-order transmission of sequences and
retransmission of missing data PDUs (ErrorRecoveryLevel greater than zero), the
MaxOustandingR2T key must be set to 1. This is because requests for retransmission may
be sent only for the lowest outstanding R2TSN, and all PDUs already received for a higher
286
287
When a PDU is dropped due to digest error, the iSCSI protocol must be able to detect
the beginning of the PDU that follows the dropped PDU. Because iSCSI PDUs are
variable in length, iSCSI recipients depend on the BHS to determine the total length
of a PDU. The BHS of the dropped PDU cannot always be trusted (for example, if
dropped due to CRC failure), so an alternate method of determining the total length
of the dropped PDU is required. Additionally, when a TCP packet containing an iSCSI
header is dropped and retransmitted, the received TCP packets of the affected iSCSI
PDU and the iSCSI PDUs that follow cannot be optimally buffered. An alternate
method of determining the total length of the affected PDU resolves this issue.
To avoid SCSI task abortion and re-issuance in the presence of digest errors, the iSCSI
protocol must support PDU retransmission. An iSCSI device may retransmit dropped
PDUs (optimal) or abort each task affected by a digest error (suboptimal).
Additionally, problems can occur in a routed IP network that cause a TCP connection or
an iSCSI session to fail. Currently, this does not occur frequently in iSCSI environments
because most iSCSI deployments are single-subnet environments. However, iSCSI is
designed in a such a way that it supports operation in routed IP networks. Specically,
iSCSI supports connection and session recovery to prevent IP network problems from
affecting the SAL. This enables iSCSI users to realize the full potential of TCP/IP.
RFC 3720 denes several delivery mechanisms to meet all these requirements.
RFC 3720 mandates the minimum recovery class that may be used for each type of error.
RFC 3720 does not provide a comprehensive list of errors, but does provide representative
examples. An iSCSI implementation may use a higher recovery class than the minimum
required for a given error. Both initiator and target are allowed to escalate the recovery
class. The number of tasks that are potentially affected increases with each higher class. So,
use of the lowest possible class is encouraged. The two lowest classes may be used in only
288
the Full Feature Phase of a session. Table 8-8 lists some example scenarios for each
recovery class.
Table 8-8
Class Name
Recovery Within A
Command
Low
Recovery Within A
Connection
Medium-Low
Recovery Of A
Connection
Medium-High
Recovery Of A Session
High
Implementation
Complexity
Low
Recovery Of A Session
Medium
High
Recovery Of A Connection
At rst glance, the mapping of levels to classes may seem counter-intuitive. The mapping
is easier to understand after examining the implementation complexity of each recovery
class. The goal of iSCSI recovery is to avoid affecting the SAL. However, an iSCSI
implementation may choose not to recover from errors. In this case, recovery is left to
the SCSI application client. Such is the case with ErrorRecoveryLevel 0, which simply
terminates the failed session and creates a new session. The SCSI application client is
289
responsible for reissuing all affected tasks. Therefore, ErrorRecoveryLevel 0 is the simplest
to implement. Recovery within a command and recovery within a connection both require
iSCSI to retransmit one or more PDUs. Therefore, ErrorRecoveryLevel 1 is more complex
to implement. Recovery of a connection requires iSCSI to maintain state for one or more
tasks so that task reassignment may occur. Recovery of a connection also requires iSCSI to
retransmit one or more PDUs on the new connection. Therefore, ErrorRecoveryLevel 2 is
the most complex to implement. Only ErrorRecoveryLevel 0 must be supported. Support
for ErrorRecoveryLevel 1 and higher is encouraged but not required.
PDU Retransmission
iSCSI guarantees in-order data delivery to the SAL. When PDUs arrive out of order due to
retransmission, the iSCSI protocol does not reorder PDUs per se. Upon receipt of all TCP
packets composing an iSCSI PDU, iSCSI places the ULP data in an application buffer. The
position of the data within the application buffer is determined by the Buffer Offset eld
in the BHS of the Data-In/Data-Out PDU. When an iSCSI digest error, or a dropped or
delayed TCP packet causes a processing delay for a given iSCSI PDU, the Buffer Offset
eld in the BHS of other iSCSI data PDUs that are received error-free enables continued
processing without delay regardless of PDU transmission order. Thus, iSCSI PDUs do not
need to be reordered before processing. Of course, the use of a message synchronization
scheme is required under certain circumstances for PDU processing to continue in the
presence of one or more dropped or delayed PDUs. Otherwise, the BHS of subsequent
PDUs cannot be read. Assuming this requirement is met, PDUs can be processed in
any order.
Retransmission occurs as the result of a digest error, protocol error, or timeout. Despite
differences in detection techniques, PDU retransmission is handled in a similar manner for
data digest errors, protocol errors and timeouts. However, header digest errors require
special handling. When a header digest error occurs, and the connection does not support
a PDU boundary detection scheme, the connection must be terminated. If the session
290
291
292
NOTE
The order of command delivery does not necessarily translate to the order of command
execution. The order of command execution can be changed via TMF request as specied
in the SCSI standards.
293
294
295
FCP IU Formats
In FCP parlance, a protocol data unit is called an information unit. The FCP-3 specication
denes ve types of IU: FCP_CMND, FCP_DATA, FCP_XFER_RDY, FCP_RSP, and
FCP_CONF. This section describes all ve IUs in detail.
This section also describes the details of the link services most commonly used by FCP. As
discussed in Chapter 5, the FC specications dene many link services that may be used by
end nodes to interact with the FC-SAN and to manage communication with other end
nodes. Three types of link service are dened: basic, extended, and FC-4. Each basic link
service (BLS) command is composed of a single frame that is transmitted as part of an
existing Exchange. Despite this, BLS commands are ignored with regard to the ability to
mix information categories within a single sequence as negotiated during PLOGI. The
response to a BLS command is also a single frame transmitted as part of an existing
Exchange. BLSs are dened in the FC-FS specication series. The BLS most commonly
used by FCP is abort sequence (ABTS). As stated in Chapter 5, an ELS may be composed
of one or more frames per direction transmitted as a single sequence per direction within a
new Exchange. Most ELSs are dened in the FC-LS specication. The ELSs most
commonly used by FCP include PRLI and read exchange concise (REC). An FC-4 link
service may be composed of one or more frames and must be transmitted as a new
Exchange. The framework for all FC-4 link services is dened in the FC-LS specication,
296
but the specic functionality of each FC-4 link service is dened in an FC-4 protocol
specication. The FCP-3 specication denes only one FC-4 link service called sequence
retransmission request (SRR). This section describes the ABTS, PRLI, REC, and SRR
link services in detail.
FCP IUs are encapsulated within the Data eld of the FC frame. An FCP IU that exceeds
the maximum size of the Data eld is sent as a multi-frame Sequence. Each FCP IU is
transmitted as a single Sequence. Additionally, each Sequence composes a single FCP IU.
This one-to-one mapping contrasts the iSCSI model. Fields within the FC Header indicate
the type of FCP IU contained in the Data eld. Table 8-10 summarizes the values of the
relevant FC Header elds.
Table 8-10
FCP IU
R_CTL
Routing
R_CTL
Information
Category
Type
F_CTL Relative
Offset Present
DF_CTL
FCP_CMND
0000b
0110b
0x08
0b
0x00 or 0x40
FCP_DATA
0000b
0001b
0x08
1b
0x00 or 0x40
FCP_XFER_RDY
0000b
0101b
0x08
0b
0x00 or 0x40
FCP_RSP
0000b
0111b
0x08
0b
0x00 or 0x40
FCP_CONF
0000b
0011b
0x08
0b
0x00 or 0x40
Only one of the members of an FCP image pair may transmit FCP IUs at any point in time.
The sequence initiative (SI) bit in the F_CTL eld in the FC Header controls which FCP
device may transmit. To transmit, an FCP device must hold the sequence initiative. If an
FCP device has more than one FCP IU to transmit, it may choose to hold the sequence
initiative after transmitting the last frame of a Sequence. Doing so allows the FCP device to
transmit another FCP IU. This is known as Sequence streaming. When the FCP device has
no more FCP IUs to transmit, it transfers the sequence initiative to the other FCP device.
During bidirectional commands, the sequence initiative may be transferred many times at
intervals determined by the participating FCP devices.
When an FC frame encapsulates an FCP_CMND IU, the Parameter eld in the FC Header
can contain a task identier to assist command retry. If command retry is not supported, the
Parameter eld is set to 0. Command retry is discussed in the FCP Delivery Mechanisms
section of this chapter. Unlike iSCSI, FCP uses a single command IU (FCP_CMND) for
SCSI commands and TMF requests. Figure 8-16 illustrates the format of the FCP_CMND
IU. Note that FCP IUs are word-oriented like FC frames, but the FCP specication series
illustrates FCP IU formats using a byte-oriented format. This book also illustrates FCP IU
formats using a byte-oriented format to maintain consistency with the FCP specication
series.
297
Byte #8
Reserved
Priority
Task Attribute
Byte #10
Byte #11
FCP_LUN
Byte #07
Byte #9
Byte #1227
FCP_CDB
Byte #28n
Additional FCP_CDB
Byte #n+1n+4
FCP_DL
Byte #n+5n+8
FCP_BIDIRECTIONAL_READ_DL
RDDATA WRDATA
ReservedThis is 1 bit.
PriorityThis is 4 bits long. It determines the order of execution for tasks in a task
managers queue. This eld is valid only for SIMPLE tasks.
298
Task Management FlagsThis is 8 bits long. This eld is used to request a TMF.
If any bit in this eld is set to 1, a TMF is requested. When a TMF is requested, the
FCP_CMND IU does not encapsulate a SCSI command. Thus, the Task Attribute,
Additional FCP_CDB Length, RDDATA, WRDATA, FCP_CDB, Additional
FCP_CDB, FCP_DL, and FCP_Bidirectional_Read_DL elds are not used. No more
than one bit in the Task Management Flags eld may be set to 1 in a given FCP_CMND
IU. Bit 1 represents the Abort Task Set TMF. Bit 2 represents the Clear Task Set TMF.
Bit 4 represents the Logical Unit Reset TMF. Bit 6 represents the Clear ACA TMF. All
other bits are reserved. Note that the Abort Task TMF is not supported via this eld.
Instead, the ABTS BLS is used. Despite its name, the ABTS BLS can be used to abort
a single sequence or an entire Exchange. When the FCP_CMND IU encapsulates a
SCSI command, the Task Management Flags eld must be set to 0.
Additional FCP_CDB LengthThis is 6 bits long. This eld indicates the length
of the Additional FCP_CDB eld expressed in 4-byte words. When the CDB length
is 16 bytes or less, this eld is set to 0. When a TMF is requested, this eld is set to 0.
FCP_CDBThis is 128 bits (16 bytes) long. It contains a SCSI CDB if the Task
Management Flags eld is set to 0. When a CDB shorter than 16 bytes is sent, this
eld is padded. The value of padding is not dened by the FCP-3 specication.
Presumably, zeros should be used for padding. When a CDB longer than 16 bytes
is sent, this eld contains the rst 16 bytes of the CDB.
Unlike iSCSI, FCP uses a single IU (FCP_DATA) for both data-out and data-in operations.
The only difference in IU format for data-out versus data-in operations is the manner in
299
which the SI bit in the FC Header is handled. For data-out operations, the sequence
initiative is transferred from initiator to target after transmission of each FCP_DATA IU.
This enables the target to transmit an FCP_XFER_RDY IU to continue the operation or an
FCP_RSP IU to complete the operation. For data-in operations, the sequence initiative is
held by the target after transmission of each FCP_DATA IU. This enables the target to
transmit another FCP_DATA IU to continue the operation or an FCP_RSP IU to complete
the operation. For all FCP_DATA IUs, the Parameter eld in the FC Header contains a
relative offset value. The FCP_DATA IU does not have a dened format within the Data
eld of the FC frame. The receiver uses only the FC Header to identify an FCP_DATA IU,
and ULP data is directly encapsulated in the Data eld of the FC frame. An FCP_DATA IU
may not be sent with a payload of 0 bytes.
NOTE
The FC-FS-2 specication mandates the Information Category sub-eld of the R_CTL eld
in the FC Header must be set to solicited data even when the initiator sends unsolicited rst
burst data.
When an FC frame encapsulates an FCP_XFER_RDY IU, the Parameter eld in the FC
Header is set to 0. In contrast to the iSCSI model, FCP implements a one-to-one relationship
between FCP_XFER_RDY IUs and FCP_DATA IUs. The FC-FS-2 specication categorizes
the FCP_XFER_RDY IU as a Data Descriptor. The FC-FS-2 specication also denes the
general format that all Data Descriptors must use. Figure 8-17 illustrates the general format
of Data Descriptors. Figure 8-18 illustrates the format of the FCP_XFER_RDY IU.
Byte #03
Data Offset
Byte #47
Data Length
Byte #811
Reserved
Byte #12n
Byte #03
FCP_DATA_RO
Byte #47
FCP_BURST_LEN
Byte #811
Reserved
300
FCP_DATA_ROThis is 32 bits long. This eld indicates the position of the rst
byte of data requested by this IU relative to the rst byte of all the data transferred by
the SCSI command.
The nal result of each SCSI command is a SCSI status indicator delivered in an FCP_RSP
IU. The FCP_RSP IU also conveys FCP status for protocol operations. When an FC frame
encapsulates an FCP_RSP IU, the Parameter eld in the FC Header is set to 0. Unlike
iSCSI, FCP uses a single response IU (FCP_RSP) for SCSI commands and TMF requests.
Figure 8-19 illustrates the format of the FCP_RSP IU.
Figure 8-19 FCP_RSP IU Format
Bit #
Byte #07
Reserved
Byte #89
Byte #10
Byte #11
FCP_
BIDI_
RSP
FCP_
FCP_
BIDI_
BIDI_
FCP_
FCP_
FCP_
READ_ READ_ CONF_ RESID_ RESID_
RESID_ RESID_ REQ UNDER OVER
UNDER OVER
FCP_
SNS_
LEN_
VALID
FCP_
RSP_
LEN_
VALID
Byte #1215
FCP_RESID
Byte #1619
FCP_SNS_LEN
Byte #2023
FCP_RSP_LEN
Byte #24x
FCP_RSP_INFO
Byte #x +1y
FCP_SNS_INFO
Byte #y +1y+4
FCP_BIDIRECTIONAL_READ_RESID
301
Retry Delay TimerThis is 16 bits long. This eld contains one of the retry delay
timer codes dened in the SAM-4 specication. These codes provide additional
information to the initiator regarding why a command failed and how long to wait
before retrying the command.
302
SCSI service delivery subsystem what the SCSI status code is to SAL. When this bit
is set to 1, the FCP_RSP_LEN and FCP_RSP_INFO elds are valid, and the SCSI
Status Code eld is ignored. This is roughly equivalent to a SCSI service response of
SERVICE DELIVERY OR TARGET FAILURE. This is also roughly equivalent to an
iSCSI response code of 0x01 (target failure). Setting this bit to 0 indicates that the
target has completed processing the command from the FCP perspective. When this
bit is set to 0, the FCP_RSP_LEN and FCP_RSP_INFO elds are ignored, and the
SCSI Status Code eld is valid. This is roughly equivalent to a SCSI service response
of LINKED COMMAND COMPLETE or TASK COMPLETE. This is also roughly
equivalent to an iSCSI response code of 0x00 (command completed at target). Setting
this bit to 0 conveys FCP success but does not imply SCSI success.
SCSI Status CodeThis is 8 bits long. This eld contains a status code that provides
more detail about the nal status of the SCSI command and the state of the logical unit
that executed the command. This eld is valid only if the FCP_RSP_LEN_VALID bit
is set to 0. Even if the FCP_RSP_LEN_VALID bit is set to 0, the target might not have
successfully processed the command. If the status code indicates failure to successfully
process the command, SCSI sense data is included in the FCP_SNS_INFO eld. All
FCP devices must support SCSI autosense. Like iSCSI, FCP uses the status codes
dened by the SAM (see Table 8-3).
303
FCP_RSP_CODE
Reserved
Byte #47
Table 8-11
Reserved
Byte #02
Byte #3
Description
0x00
TMF Complete
0x01
0x02
0x03
0x04
TMF Rejected
0x05
TMF Failed
0x09
The FCP_CONF IU is sent by an initiator only when requested via the FCP_CONF_REQ
bit in the FCP_RSP IU. The FCP_CONF IU conrms the initiator received the referenced
FCP_RSP IU. The target associates an FCP_CONF IU with the appropriate FCP_RSP IU
via the FQXID. For all FCP_CONF IUs, the Parameter eld in the FC Header is set to 0.
The FCP_CONF IU does not have a dened format within the Data eld of the FC frame.
Additionally, the FCP_CONF IU has no payload. The target uses only the FC Header to
determine a given FC frame is actually an FCP_CONF IU. The FCP_CONF IU is not
supported for TMF requests. Similarly, the FCP_CONF IU is not supported for
304
ELS
R_CTL
Routing
R_CTL
Information
Category
Type
F_CTL Relative
Offset Present
DF_CTL
PRLI
0010b
0010b
0x01
0b
0x00 or 0x40
LS_ACC
0010b
0011b
0x01
0b
0x00 or 0x40
LS_RJT
0010b
0011b
0x01
0b
0x00 or 0x40
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
Word #1n
LS Command Code
Page Length
Payload Length
LS Command CodeThis is 8 bits long. For a PRLI Request, this eld is set to
0x20. For an LS_ACC, this eld is set to 0x02.
Page LengthThis is 8 bits long. This eld indicates the length of each Service
Parameter Page expressed in bytes. This eld is set to 0x10, so each Service Parameter
Page is 16 bytes (4 words) long.
Payload LengthThis is 16 bits long. This eld indicates the total length of the
PRLI Request or LS_ACC expressed in bytes. Valid values range from 20 to 65,532.
Service Parameter PagesThis is variable in length. This eld may contain one
or more Service Parameter Pages, but no more than one Service Parameter Page may
be sent per image pair. The FC-LS specication denes the general format of the
PRLI Service Parameter Page and the LS_ACC Service Parameter Page.
305
Like iSCSI, FCP negotiates some service parameters, while others are merely declared.
The FCP-3 specication denes the specic format of the PRLI Service Parameter
Page. Figure 8-22 illustrates the FCP-specic format of the PRLI Service Parameter Page.
Figure 8-22 PRLI Service Parameter Page Format for FCP
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
Type Code
Word #0
Reserved
15
14
0
13 121110 9 8 7 6 5 4 3 2 1 0
Word #1
Originator Process_Associator
Word #2
Responder Process_Associator
Word #3
Reserved
TASK RETRY
CONFIRMED
IDENTIFICATION RETRY COMPLETION
REQUESTED
ALLOWED
Bit #
Byte #
Reserved
Service Parameters
DATA
OVERLAY
ALLOWED
INITIATOR
FUNCTION
TARGET
FUNCTION
WRITE
READ
OBSOLETE FCP_XFER_RDY FCP_XFER_RDY
DISABLED
DISABLED
Type CodeThis is 8 bits long. This eld is set to 0x08 and identies the FC-4
protocol as FCP.
ESTABLISH IMAGE PAIR (EIP)This is 1 bit. When this bit is set to 1, the
initiator requests both the exchange of service parameters and the establishment of an
image pair. When this bit is set to 0, the initiator requests only the exchange of service
parameters.
306
RETRYThis is 1 bit. When this bit is set to 1, the initiator requests support for
retransmission of sequences that experience errors. If the target agrees, the SRR link
service is used. When this bit is set to 0, the initiator does not support retransmission
of sequences, and the SRR link service is not used.
In the absence of errors, the target responds to PRLI with LS_ACC. The FCP-3 specication
denes the specic format of the LS_ACC Service Parameter Page. Figure 8-23 illustrates
the FCP-specic format of the LS_ACC Service Parameter Page.
307
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
Type Code
Word #0
Reserved
15
14
0
13
Originator Process_Associator
Word #2
Responder Process_Associator
Word #3
Reserved
Bit #
Service Parameter
Response
DATA
OVERLAY
ALLOWED
INITIATOR
FUNCTION
TARGET
FUNCTION
Byte #
Reserved
Word #1
TASK RETRY
CONFIRMED
IDENTIFICATION RETRY COMPLETION
REQUESTED
ALLOWED
12 1110 9 8 7 6 5 4 3 2 1 0
WRITE
READ
OBSOLETE FCP_XFER_RDY FCP_XFER_RDY
DISABLED
DISABLED
Type CodeThis is 8 bits long. This eld is set to 0x08. It identies the FC-4
protocol as FCP.
IMAGE PAIR ESTABLISHED (IPE)This is 1 bit. This bit is valid only if the
EIP bit is set to 1 in the PRLI Request. When this bit is set to 1, the target conrms
the establishment of an image pair. When this bit is set to 0, the target is only
exchanging service parameters.
ReservedThis is 1 bit.
Accept Response Code (ARC)This is 4 bits long. This eld contains a code that
conrms that the image pair is established, or provides diagnostic information when
the image pair is not established. Table 8-13 summarizes the PRLI Accept Response
Codes dened in the FC-LS specication. All response codes excluded from Table 8-13
are reserved.
308
Table 8-13
Description
0001b
0010b
The target has no resources available. The PRLI Request may be retried.
0011b
0100b
0101b
The target has been precongured such that it cannot establish the requested
image pair. The PRLI Request may not be retried.
0110b
The PRLI Request was executed, but some service parameters were not set as
requested.
0111b
The target cannot process a multi-page PRLI Request. The PRLI Request may
be retried as multiple single-page PRLI Requests.
1000b
RETRYThis is 1 bit. When this bit is set to 1, the target conrms support for
retransmission of dropped frames. When this bit is set to 0, the target does not
support retransmission of dropped frames.
INITIATOR FUNCTIONThis is 1 bit. Some devices set this bit to 1, but most
set this bit to 0.
309
TARGET FUNCTIONThis is 1 bit. This bit is usually set to 1. If this bit is set
to 0, an image pair cannot be established. This bit must be set to 1 if the IPE bit is
set to 1.
If the PRLI is not valid, the target responds with LS_RJT. A single LS_RJT format is
dened in the FC-LS specication for all ELS commands. Figure 8-24 illustrates the format
of LS_RJT.
Figure 8-24 LS_RJT ELS Format
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
LS Command Code
Word #1
Reserved
Unused
Reason Code
Reason Explanation
Vendor Specific
Table 8-14
Description
0x01
0x03
Logical Error
0x05
Logical Busy
0x07
Protocol Error
0x09
310
Table 8-14
Table 8-15
Description
0x0B
0x0E
0xFF
Vendor SpecicThis is 8 bits long. When the Reason Code eld is set to 0xFF,
this eld provides a vendor-specic reason code. When the Reason Code eld is set
to any value other than 0xFF, this eld is ignored.
Description
0x00
No Additional Explanation
0x1E
PLOGI Required
0x2C
The REC ELS enables an initiator to ascertain the state of a given Exchange in a target.
Support for REC is optional. When an initiator detects an error (for example, a timeout),
the initiator may use REC to determine what, if any, recovery steps are appropriate. Possible
responses to REC include LS_ACC and LS_RJT. REC and its associated responses each
are encapsulated in the Data eld of an FC frame. The elds in the FC Header indicate the
payload is a REC, LS_ACC, or LS_RJT. Table 8-16 summarizes the values of the relevant
FC Header elds. If command retry is supported, the Parameter eld in the FC Header
contains the Task Retry Identier of the Exchange referenced in the REC. If command retry
is not supported, the Parameter eld in the FC Header is set to 0. The formats of REC and
its associated responses are dened in the FC-LS specication. Figure 8-25 illustrates the
format of REC.
Table 8-16
ELS
R_CTL
Routing
R_CTL
Information
Category
Type
F_CTL Relative
Offset Present
DF_CTL
REC
0010b
0010b
0x01
0b
0x00 or 0x40
LS_ACC
0010b
0011b
0x01
0b
0x00 or 0x40
LS_RJT
0010b
0011b
0x01
0b
0x00 or 0x40
311
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
LS Command Code
Unused
Word #1
Reserved
Word #2
RX_ID
OX_IDThis is 16 bits long. This eld contains the OX_ID of the Exchange about
which state information is sought. The REC recipient uses this eld in combination
with the RX_ID eld to identify the proper Exchange.
RX_IDThis is 16 bits long. This eld contains the RX_ID of the Exchange about
which state information is sought. The REC recipient uses this eld in combination
with the OX_ID eld to identify the proper Exchange. If the value of this eld is
0xFFFF (unassigned), the REC recipient uses the Exchange Originator S_ID eld in
combination with the OX_ID eld to identify the proper Exchange. If the value of this
eld is anything other than 0xFFFF, and the REC recipient has state information for
more than one active Exchange with the specied OX_ID-RX_ID combination, the
REC recipient uses the Exchange Originator S_ID eld in combination with the
OX_ID and RX_ID elds to identify the proper Exchange.
In the absence of errors, the target responds to REC with LS_ACC. Figure 8-26 illustrates
the format of LS_ACC sent in response to REC.
312
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
LS Command Code
Unused
OX_ID
Word #1
RX_ID
Word #2
Reserved
Word #3
Reserved
Word #4
FC4VALUE
Word #5
E_STAT
RX_IDThis is 16 bits long. This eld contains the RX_ID of the Exchange about
which state information is sought. If the RX_ID specied in the REC request is
0xFFFF, the REC recipient may set this eld to the value previously assigned. This
situation occurs when a target receives a new command (thus creating state
information for the Exchange), but the initial reply is dropped, so the initiator does not
know the assigned value of RX_ID.
Originator Address IdentierThis is 24 bits long. This eld contains the FCID of
the FC device that originated the Exchange about which state information is sought.
Responder Address IdentierThis is 24 bits long. This eld contains the FCID
of the FC device that responded to the Exchange about which state information is
sought. The value of this eld might be different than the value of the D_ID eld in
the FC Header of the REC request if the REC recipient contains more than one
FC port.
313
If the REC is not valid, or if the target does not support REC, the target responds with
LS_RJT. Figure 8-24 illustrates the format of LS_RJT. Table 8-14 summarizes the LS_RJT
reason codes. Table 8-17 summarizes the reason explanations dened by the FC-LS
specication that are relevant to REC. All reason explanations excluded from Table 8-17
are either irrelevant to REC or reserved.
Table 8-17
Description
0x00
No Additional Explanation
0x15
0x17
0x1E
PLOGI Required
The ABTS BLS enables an initiator or target to abort a sequence or an entire Exchange. As
with all BLS commands, support for ABTS is mandatory. ABTS may be transmitted even
if the number of active sequences equals the maximum number of concurrent sequences
negotiated during PLOGI. Likewise, ABTS may be transmitted by an FCP device even if it
does not hold the sequence initiative. This exception to the sequence initiative rule must be
allowed because the sequence initiative is transferred with the last frame of each sequence,
but the receiving device neither acknowledges receipt of frames nor noties the sender of
dropped frames (see Chapter 5). Thus, the transmitting device typically detects errors via
timeout after transferring the sequence initiative. As a result, ABTS can be sent only by the
device that sent the sequence being aborted. Each ABTS is transmitted within the Exchange
upon which the ABTS acts. Moreover, the SEQ_ID in the FC Header of the ABTS must
match the SEQ_ID of the most recent sequence transmitted by the device that sends the
ABTS. In other words, a device may abort only its most recently transmitted sequence or
Exchange. The sequence initiative is always transferred with ABTS so the receiving device
can respond. The responding device may hold the sequence initiative after responding, or
transfer the sequence initiative with the response. Possible responses to an ABTS include
the basic accept (BA_ACC) BLS and the basic reject (BA_RJT) BLS. The action taken
upon receipt of an ABTS is governed by the Exchange Error Policy in affect for the
Exchange impacted by the ABTS. ABTS and its associated responses are each encapsulated
in the Data eld of an FC frame. The elds in the FC Header indicate that the payload is a
BLS command or a BLS response. Table 8-18 summarizes the values of the relevant FC
Header elds. Bit 0 in the Parameter eld in the FC Header conveys whether the ABTS acts
upon a single sequence (set to 1) or an entire Exchange (set to 0). The ABTS does not have
314
a dened format within the Data eld of the FC frame. Additionally, the ABTS has no
payload. The ABTS recipient uses only the FC Header to determine that a given FC frame
is actually an ABTS. The formats of the BA_ACC and BA_RJT are dened in the FC-FS-2
specication. A unique BA_ACC format is dened for each BLS command. Figure 8-27
illustrates the format of the BA_ACC associated with ABTS.
Table 8-18
BLS
R_CTL
Routing
R_CTL
Information
Category
Type
F_CTL Relative
Offset Present
DF_CTL
ABTS
1000b
0001b
0x00
0b
0x00 or 0x40
BA_ACC
1000b
0100b
0x00
0b
0x00 or 0x40
BA_RJT
1000b
0101b
0x00
0b
0x00 or 0x40
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
SEQ_ID Validity
Word #0
SEQ_ID of Last
Deliverable Sequence
Reserved
Word #1
OX_ID
RX_ID
Word #2
Low SEQ_CNT
High SEQ_CNT
SEQ_ID ValidityThis is 8 bits long. When this eld is set to 0x80, the SEQ_ID
Of Last Deliverable Sequence eld contains a valid SEQ_ID. When this eld is set to
0x00, the SEQ_ID Of Last Deliverable Sequence eld is ignored.
RX_IDThis is 16 bits long. This eld contains the RX_ID of the Exchange upon
which the ABTS acts.
OX_IDThis is 16 bits long. This eld contains the OX_ID of the Exchange upon
which the ABTS acts.
315
A common BA_RJT format is dened for all BLS commands. Figure 8-28 illustrates the
format of the BA_RJT.
Figure 8-28 BA_RJT Format
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved
Word #0
Reason Code
Reason Explanation
Vendor Specific
Table 8-19
Description
0x01
0x03
Logical Error
0x05
Logical Busy
0x07
Protocol Error
0x09
0xFF
316
Table 8-20
Description
0x00
No Additional Explanation
0x03
0x05
Vendor SpecicThis is 8 bits long. When the Reason Code eld is set to 0xFF, this
eld provides a vendor-specic reason code. When the Reason Code eld is set to
any value other than 0xFF, this eld is ignored.
The SRR link service enables an initiator to request retransmission of data during a read
command, request that the target request retransmission of data during a write command,
or request retransmission of the FCP_RSP IU during a read or write command. Only
initiators may send an SRR. Support for SRR is optional. If SRR is supported by both
initiator and target, REC and task retry identication also must be supported by both
devices. The Parameter eld in the FC Header contains the Task Retry Identier of the
Exchange referenced by the SRR payload. SRR may not be used during bidirectional
commands. This limitation does not apply to iSCSI, which supports retransmission during
bidirectional commands. The sequence initiative of the Exchange referenced by the SRR is
always transferred to the target upon receipt of an SRR. This allows the target to transmit
the requested IU. The sequence initiative of the SRR Exchange is also transferred to the
target. Possible responses to SRR include the SRR Accept and the FCP Reject (FCP_RJT).
The SRR and its associated responses are each encapsulated in the Data eld of an FC
frame. The elds in the FC Header indicate that the payload is an FC-4 link service request
or an FC-4 link service response. Table 8-21 summarizes the values of the relevant FC
Header elds. The formats of SRR and its associated responses are dened in the FCP-3
specication. Figure 8-29 illustrates the format of SRR.
Table 8-21
FC-4 LS
R_CTL
Routing
R_CTL
Information
Category
Type
F_CTL
Relative Offset
Present
DF_CTL
SRR
0011b
0010b
0x08
0b
0x00 or 0x40
SRR Accept
0011b
0011b
0x08
0b
0x00 or 0x40
FCP_RJT
0011b
0011b
0x08
0b
0x00 or 0x40
317
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
OX_ID
Word #1
RX_ID
Relative Offset
Word #2
R_CTL for IU
Word #3
Reserved
OX_IDThis is 16 bits long. This eld contains the OX_ID of the Exchange for
which retransmission is requested.
RX_IDThis is 16 bits long. This eld contains the RX_ID of the Exchange for
which retransmission is requested.
R_CTL for IUThis is 8 bits long. This eld species the type of IU being
requested. This eld contains one of the values dened in the FC-FS-2 specication
for the R_CTL eld in the FC Header. Valid values are 0x01 (solicited data), 0x05
(data descriptor) and 0x07 (command status).
In the absence of errors, the target responds to SRR with SRR Accept. Figure 8-30
illustrates the format of SRR Accept.
318
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
Upon receipt of an FCP_RJT, the initiator must abort the entire Exchange referenced by the
SRR. Figure 8-31 illustrates the format of FCP_RJT.
Figure 8-31 FCP_RJT Format
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
Reserved
Word #1
Reason Code
Reason Explanation
Vendor Specific
Vendor SpecicThis is 8 bits long. When the Reason Code eld is set to 0xFF,
this eld provides a vendor specic reason code. When the Reason Code eld is set
to any value other than 0xFF, this eld is ignored.
Reason CodeThis is 8 bits long. It indicates why the SRR link service
was rejected. Table 8-22 summarizes the reason codes dened by the FCP-3
specication. All reason codes excluded from Table 8-22 are reserved.
The preceding discussion of FCP IU and FC link service formats is simplied for the sake
of clarity. Comprehensive exploration of FCP IU and FC link service usage is outside the
scope of this book. For more information, readers are encouraged to consult the ANSI T10
SAM-3, SAM-4, and FCP-3 specications, and the ANSI T11 FC-FS-2 and FC-LS
specications.
Table 8-22
Table 8-23
319
Description
0x01
0x03
Logical Error
0x05
Logical Busy
0x07
Protocol Error
0x09
0x0B
0xFF
Description
0x00
No Additional Explanation
0x03
0x2A
320
FCP_XFER_RDY DISABLED bit in the PRLI Request or associated LS_ACC ELS is set
to 0. Data overlay is negotiated via the enable modify data pointers (EMDP) bit. The
EMDP bit overrides the value of the DATA OVERLAY ALLOWED bit in the PRLI
Request.
The Protocol Specic Logical Unit mode page as implemented by FCP is called the FC
Logical Unit Control mode page. The FC Logical Unit Control mode page is used to
congure certain logical unit service parameters. In particular, FCP implements the enable
precise delivery checking (EPDC) bit. The EPDC bit enables or disables in-order
command delivery (called precise delivery in FCP parlance). Unlike iSCSI, FCP does not
mandate precise delivery. Instead, FCP allows precise delivery to be negotiated with each
logical unit.
The Protocol Specic Port mode page as implemented by FCP is called the FC Port Control
mode page. The FC Port Control mode page is used to congure certain target port service
parameters. For fabric-attached devices, FCP implements the sequence initiative resource
recovery time-out value (RR_TOVSEQ_INIT). This timer determines the minimum amount
of time a target port must wait for a response after transferring the sequence initiative. If
this timer expires before a response is received, the target port may begin error recovery
procedures. For more information about FCPs use of the SCSI MODE commands, readers
are encouraged to consult the ANSI T10 SPC-3 and FCP-3 specications.
Error Detection
Like iSCSI, FCP mandates that both initiators and targets are responsible for detecting
errors. Errors are detected via timeouts, FC primitives, FC Header elds, FCP IU elds, the
ABTS BLS, and the REC ELS. Initiators and targets must be able to detect the link errors,
protocol errors, and timeouts dened in the FC specications. Additionally, initiators and
targets must be able to detect protocol errors and timeouts using FCP IU elds. Last,
initiators and targets must be able to accept and process the ABTS BLS as a means of error
detection. The only error detection mechanism that is optional is the REC ELS.
321
322
target discards all frames associated with the sequence identied by the ABTS. The target
then responds with a BA_ACC. The Last_Sequence bit in the F_CTL eld in the FC Header
must be set to 0. Upon receiving the BA_ACC, the initiator sends an SRR. Upon receiving
the SRR, the target responds with an SRR Accept. The target then retransmits the requested
data. In the case of a write command, the SRR requests an FCP_XFER_RDY IU that
requests the missing data. The SEQ_CNT eld in the FC Header of the rst frame of
retransmitted data is set to 0 even if continuously increasing SEQ_CNT is in affect for the
connection. Note the retransmitted data can be (and often is) transferred from a relative
offset that was already used in the Exchange being recovered. This is allowed even if the
DATA OVERLAY ALLOWED bit is set to 0 during PRLI.
If the initiator does not receive an LS_ACC in response to the REC within two times the
R_A_TOV, the REC exchange is aborted via Recovery Abort. The REC may be retried in
a new exchange, or the initiator may choose to send an ABTS without retrying REC. If the
initiator does not receive a BA_ACC within two times the R_A_TOV, the initiator may send
another ABTS. If the second BA_ACC is not received within two times the R_A_TOV, the
initiator must explicitly logout the target. The initiator must then perform PLOGI and PRLI
to continue SCSI operations with the target, and all previously active SCSI tasks must be
reissued by the SCSI application client. If a BA_ACC is received, but the initiator does not
receive an SRR Accept within two times the R_A_TOV, both the SRR Exchange and the
Exchange being recovered must be aborted via Recovery Abort.
NOTE
323
The order of command delivery does not necessarily translate to the order of command
execution. The order of command execution can be changed via TMF request as specied
in the SCSI standards.
The CRN eld is set to 0 in every FCP_CMND IU that conveys a TMF request regardless of whether the LUN supports precise delivery. Likewise, the CRN eld is set to 0 in
every FCP_CMND IU that conveys a SCSI command to LUNs that do not support
precise delivery. By contrast, the CRN eld contains a value between 1 and 255 in each
FCP_CMND IU that conveys a SCSI command to LUNs that support precise delivery. The
CRN counter for a given LUN is incremented by 1 for each command issued to that LUN.
However, certain SCSI commands (such as inquiry and test unit ready) do not require
precise delivery and may be assigned a value of 0 for the CRN eld even when issued to
LUNs that support precise delivery. When the CRN counter reaches 255, the value wraps
back to one for the next command. The limited number range of the CRN counter exposes
FCP devices to possible ambiguity among commands in environments that support
sequence level error recovery. State information about each Exchange and sequence must
be maintained for specic periods of time to support retransmission. Eventually, the
initiator will issue a new command using an OX_ID that was previously used within the
session. If the target is still maintaining state for the old command associated with the
reused OX_ID, the target could confuse the two commands. To mitigate this risk, task retry
identication must be implemented. The Task Retry Identier effectively extends the CRN
counter by providing an additional identier for each command. The Task Retry Identier
must be consistent in all IUs and link services related to a single command (FCP_CMND,
REC, SRR).
NOTE
In high I/O environments, the limited CRN counter might seem to pose a risk to
performance. However, each LUN uses a separate CRN counter, which effectively
mitigates any risk to performance by increasing the total CRN address space. In the unlikely
event that an initiator needs to issue more than 255 ordered commands to a single LUN
before the rst command completes, the CRN counter limit would prevent issuance of the
256th command. However, when so many commands are simultaneously outstanding, it
indicates that the LUN cannot execute commands as quickly as the application requires.
Thus, the LUN is the performance bottleneck, not the CRN counter.
324
target, and all previously active SCSI tasks must be reissued by the SCSI application client.
If a session fails, FCP noties the SAL of the I_T nexus loss, and all state related to the
session is cleared. The initiator must perform PRLI to continue SCSI operations with
the target, and all previously active SCSI tasks must be reissued by the SCSI application
client.
The preceding discussion of FCP delivery mechanisms is simplied for the sake of clarity.
For more information about FCP delivery mechanisms, readers are encouraged to consult
Chapter 5 and the ANSI T10 FCP-3 specication.
NOTE
325
The FC-BB specication series explicitly discusses FCIP. By contrast, iFCP is not
explicitly discussed.
The FC backbone architecture provides WAN connectivity for FC-SANs as shown in
Figure 8-32. The details of the inner workings of each FC-BB device are determined by the
functional model of the FC-BB device. The functional model is determined by the type of
non-FC network to which the FC-BB device connects. The FC-BB-3 specication denes
two functional models for FC-BB devices that connect to IP networks: virtual E_Port
(VE_Port) and B_Access. The VE_Port model is implemented by FC switches with
integrated FCIP functionality. Currently, Cisco Systems is the only FC-SAN vendor that
supports the VE_Port model. The B_Access model is implemented by FCIP devices that
are external to FC switches. Devices that implement the B_Access functional model are
called FCIP bridges. This can be confusing because bridges operate at OSI Layers 1 and 2,
yet FCIP operates at OSI Layer 5. To clarify, FCIP bridges operate at OSI Layers 1 and 2
regarding FC-SAN facing functions, but they operate at OSI Layers 1 through 5 regarding
WAN facing functions. The interface within an FCIP bridge to which an FC switch
connects is called a B_Port (see Chapter 5). B_Ports support limited FC-SAN functionality.
FC-SAN
FC-BB Device
FC-BB Device
FC-BB Device
Non-FC
Network
FC-SAN
FC-SAN
326
implements TCP and IP and interacts with the IP network. The FC-BB_IP Interface resides
between the other two interfaces and maps FC-2 functions to TCP functions. The FCBB_IP Interface is composed of a switching element, an FC Entity, an FCIP Entity, a
control and service module (CSM), and a platform management module (PMM).
Figure 8-33 illustrates the relationships between the components of the VE_Port
functional model.
Figure 8-33 VE_Port Functional Model-Component Relationships
FC Interface
E_Port
F_Port
...
F_Port
E_Port
FC-BB_IP Interface
Switching Element
FC Entity
VE_Port
Platform
Management
Module
FCIP_LEP
Control and
Service
Module
FCIP_DE
FCIP_DE
FCIP Entity
IP Interface
FC
Switch
FCIP WKP
TCP Connection
TCP Connection
IP
F_Ports, E_Ports, and VE_Ports communicate through the Switching Element. Each FC
Entity may contain one or more VE_Ports. Likewise, each FCIP Entity may contain one or
more FCIP link end points (FCIP_LEPs). Each VE_Port is paired with exactly one
FCIP_LEP. The four primary functions of a VE_Port are:
Receive FC frames from the Switching Element, generate a timestamp for each FC
frame, and forward FC frames with timestamps to an FCIP_DE.
327
Each FC Entity is paired with exactly one FCIP Entity. Each FC-BB_IP Interface may
contain one or more FC/FCIP Entity pairs. Each FCIP link terminates into an FCIP_LEP.
Each FCIP_LEP may contain one or more FCIP data engines (FCIP_DEs). Each FCIP link
may contain one or more TCP connections. Each TCP connection is paired with exactly one
FCIP_DE. The six primary functions of an FCIP_DE are:
NOTE
Receive FC frames and timestamps from a VE_Port via the FC Frame Receiver Portal
(a conceptual port).
Receive TCP packets via the Encapsulated Frame Receiver Portal (a conceptual port).
Transmit TCP packets via the Encapsulated Frame Transmitter Portal (a conceptual
port).
De-encapsulate FC frames and timestamps from TCP packets.
Transmit FC frames and timestamps to a VE_Port via the FC Frame Transmitter
Portal (a conceptual port).
Encapsulation
Engine
Encapsulated
Frame
Transmitter
Portal
Deencapsulation
Engine
Encapsulated
Frame
Receiver
Portal
VE_Port
IP Network
FC Frame
Transmitter
Portal
The CSM is responsible for establishing the rst TCP connection in each FCIP link (FCIP
link initialization). To facilitate this, the CSM listens on the FCIP WKP for incoming TCP
connection requests. IANA has reserved TCP port number 3225 as the FCIP WKP. The
SAN administrator may optionally congure the CSM to listen on a different TCP port. The
CSM is also responsible for establishing additional TCP connections within existing FCIP
328
Links. When a new FCIP link is requested, the CSM creates an FC/FCIP Entity pair. The
FCIP Entity encapsulates FC frames without inspecting the FC frames. So, the FCIP Entity
cannot determine the appropriate IP QoS setting based on the FC Header elds. Thus, the
FC Entity determines the appropriate IP QoS policies for FC frames that are passed to the
FCIP Entity. The FC Entity accesses IP QoS services via the CSM. The CSM is also
responsible for TCP connection teardown, FCIP link dissolution, and FC/FCIP Entity pair
deletion.
Whereas the FC Entity is responsible for generating a timestamp for each FC frame passed
to the FCIP Entity, the PMM is responsible for ensuring that the timestamp is useful at the
remote end of the FCIP link. This is accomplished by synchronizing the clocks of the FC
switches at each end of an FCIP link. Two time services may be used for this purpose: FC
Time Service or Simple Network Time Protocol (SNTP). SNTP is most commonly used.
The total unidirectional transit time, including processing time for FCIP encapsulation
and de-encapsulation, is dened as the FCIP transit time (FTT). Upon receipt of an
FC frame and timestamp from an FCIP Entity, the FC Entity subtracts the value of the
timestamp from the current value of FC switchs internal clock to derive the FTT. If the FTT
exceeds half of the E_D_TOV, the frame is discarded. Otherwise, the frame is forwarded to
the destination FC end node.
The PMM is also responsible for discovery of remote FCIP devices. An FC switch may be
manually congured with the IP address, TCP port, and FCIP Entity Name of a remote
FCIP device. Alternately, SLPv2 may be used for dynamic discovery of FCIP devices. The
PMM interfaces with the SLPv2 service and makes discovered information available to the
CSM. The PMM is also responsible for certain security functions (see Chapter 12, Storage
Network Security) and general housekeeping functions such as optional event and error
logging.
After an FCIP link is established, the VE_Ports at each end of the link can communicate.
VE_Port pairs communicate via a virtual ISL (VISL). Each FCIP link maps to a single
VISL. The FCIP link logically resides under the VISL and serves the equivalent function
of a physical cable between the FC switches. VE_Ports communicate over a VISL in the
same manner that E_Ports communicate over a physical ISL. For example, VE_Ports use
Class F frames (ELP, ESC, and others) to initialize and maintain a VISL. Likewise, FC
frames are routed across a VISL in the same manner as a physical ISL. When an FC-SAN
is connected to multiple remote FC-SANs, the FSFP protocol decides which VISL to use
to reach a remote FC-SAN. After FC frames are encapsulated in TCP packets, IP routing
protocols are responsible for forwarding the IP packets through the IP network. At the
destination FC-SAN, the FSFP protocol makes the nal forwarding decisions after FC
frame de-encapsulation. Figure 8-35 illustrates the communication that occurs at each
layer between FC switches that employ the VE_Port functional model. Note that the
layers in Figure 8-35 do not correlate to the OSI layers. Note also that Figure 8-35
indicates which Classes of Service are permitted by the standards. In practice, only
Classes 3 and F are used.
329
End
Node
Switching
Element
FC Switch
F_Port
E_Port
Physical ISL: All Classes of Service
E_Port
F_Port
Switching
Element
Switching
Element
VE_Port
FCIP_LEP
TCP
FC Switch
VE_Port
FCIP_LEP
TCP
FC Switch
330
...
FC-BB Bridge
E_Port
F_Port
F_Port
FC Interface
B_Port
FC-BB_IP Interface
FC Entity
B_Access
Platform
Management
Module
FCIP_LEP
Control and
Service
Module
FCIP_DE
FCIP_DE
FCIP Entity
IP Interface
FCIP WKP
TCP Connection
TCP Connection
IP
331
End
Node
Switching
Element
E_Port
E_Port
Physical ISL:
Class F, 2, 3, and 4 FC Frames
Physical ISL:
Class F, 2, 3, and 4 FC Frames
B_Port
B_Access
FCIP_LEP
FC Switch
Switching
Element
F_Port
FC Switch
F_Port
B_Port
TCP
FC-BB Bridge
B_Access
FCIP_LEP
TCP
FC-BB Bridge
For more information about the FC backbone architecture and functional models, readers
are encouraged to consult the ANSI T11 FC-BB-3 specication, and the IETF RFCs 3821
and 3822.
332
FCIP also uses WWNs that are assigned by FC. FC assigns WWNs as follows:
Each FC switch is assigned a WWN regardless of the presence of FCIP devices. This
is called the Switch_Name.
Each F_Port is assigned a WWN regardless of the presence of FCIP devices. This is
called the F_Port_Name.
Each E_Port is assigned a WWN regardless of the presence of FCIP devices. This is
called the E_Port_Name.
Switch_Name
FC/FCIP Entity identier
VE_Port_Name
In the VE_Port functional model, each FCIP Link is identied by the following:
In the B_Access functional model, each FCIP Link Originator/Acceptor is identied by the
following:
Fabric_Name
FC/FCIP Entity identier
B_Access_Name
In the B_Access functional model, each FCIP Link is identied by the following:
333
FCIP is very exible regarding IP addressing. IP addresses may be assigned using one or
more of the following methods:
A single IP address may be shared by all FCIP Entities within an FC switch or FC-BB
bridge device
Each FCIP Entity within an FC switch or FC-BB bridge device may be assigned an
IP address
Furthermore, FCIP devices at each end of an FCIP link are not required to implement IP
addressing via the same method. For more information about FCIP addressing, readers are
encouraged to consult the ANSI T11 FC-BB-3 specication.
FC-FE Header
Word #7
FC-FE SOF
Word #813
FC Header
Word #14541
FC Payload
Word #542
FC CRC
Word #543
FC-FE EOF
The encapsulation header is word-oriented, and an FC-FE word is four bytes. The FC-FE
header contains many ones complement elds to augment the TCP checksum. Figure 8-39
illustrates the FC-FE header format.
334
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #0
Protocol #
-Protocol #
-Version
Word #12
Word #3
Version
Flags
Frame Length
-Flags
Word #4
Word #5
Word #6
CRC
-Frame Length
Protocol #This is 8 bits long. This eld contains the IANA assigned protocol
number of the encapsulating protocol. FCIP is protocol 1.
VersionThis is 8 bits long. This eld indicates which version of FC-FE is being
used. In essence, this eld indicates the format of the packet. Currently, only version
1 is valid.
-Protocol #This is 8 bits long. This eld contains the ones complement of the
Protocol # eld. This eld is used to validate the value of the Protocol # eld.
-VersionThis is 8 bits long. This eld contains the ones complement of the Version
eld. This eld is used to validate the value of the Version eld.
FlagsThis is 6 bits long. This eld currently contains a single ag called the CRC
Valid (CRCV) ag. Bits 0 through 4 are reserved and must be set to 0. The CRCV ag
occupies bit 5. When the CRCV bit is set to 1, the CRC eld is valid. When the CRCV
bit is set to 0, the CRC eld is ignored. FCIP always sets the CRCV bit to 0. This
is because FCIP encapsulates the FC CRC eld for end-to-end data integrity verication, and FCIP does not modify FC frames during encapsulation/de-encapsulation.
Thus, the FC-FE CRC eld is redundant.
Frame LengthThis is 10 bits long. This eld indicates the total length of the FCFE packet (FC-FE header through FC-FE EOF) expressed in 4-byte words.
-FlagsThis is 6 bits long. This eld contains the ones complement of the Flags eld.
This eld is used to validate the value of the Flags eld.
-Frame LengthThis is 10 bits long. This eld contains the ones complement of
the Frame Length eld. This eld is used to validate the value of the Frame Length
eld.
335
Time Stamp (Seconds)This is 4 bytes long. This eld may be set to 0 or contain
the number of seconds that have passed since 0 hour on January 1, 1900. As discussed
in the VE_Port functional model section of this chapter, the value of this eld is
generated by the FC Entity. The format of this eld complies with SNTPv4.
Time Stamp (Second Fraction)This is 4 bytes long. This eld may be set to 0
or contain a number of 200-picosecond intervals to granulate the value of the Time
Stamp (Seconds) eld. As discussed in the VE_Port functional model section of this
chapter, the value of this eld is generated by the FC Entity. The format of this eld
complies with SNTPv4.
CRCThis is 4 bytes long. This eld is valid only if the CRCV ag is set to 1.
This eld is ignored in FCIP implementations and must be set to 0.
RFC 3821 denes the format of the Encapsulating Protocol Specic eld (Words 1 and 2)
as implemented by FCIP. Figure 8-40 illustrates the FCIP format of the Encapsulating
Protocol Specic eld.
Figure 8-40 FCIP Format of Encapsulating Protocol Specic Field
Byte #
Bit #
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Word #1
Protocol #
Version
-Protocol #
-Version
Word #2
pFlags
Reserved
-pFlags
-Reserved
Protocol Specic Flags (pFlags)This is 8 bits long. This eld currently contains
two ags called the Changed (Ch) ag and the special frame (SF) ag. The Ch ag
occupies bit 0. Bits 1 through 6 are reserved and must be set to 0. The SF ag occupies
bit 7. When the SF ag is set to 1, the FCIP packet contains an FCIP special frame
(FSF) instead of an FC frame. When the SF ag is set to 0, the FCIP packet contains
an FC frame. The Ch ag may be set to 1 only when the SF ag is set to 1. When the
Ch ag is set to 1, the FSF has been changed by the FCIP Link Acceptor. When the
Ch ag is set to 0, the FSF has not been changed by the FCIP Link Acceptor.
-pFlagsThis is 8 bits long. This eld contains the ones complement of the pFlags
eld. This eld is used to validate the value of the pFlags eld.
The FC frame delimiters must be encapsulated because they serve several purposes rather
than merely delimiting the start and end of each FC frame (see Chapter 5). However, FC
frame delimiters are 8B/10B encoded words (40 bits long). Some of the 10-bit characters
that compose FC SOF and EOF words have no 8-bit equivalent. Therefore, FC SOF and
336
EOF words cannot be decoded into 32-bit words for transmission across an IP network. The
solution to this problem is to represent each SOF and EOF word with an 8-bit code that can
be encapsulated for transmission across an IP network. These codes are called Ordered Set
Codes (OS-Codes). They are dened in the FC-BB-3 specication. The OS-Codes relevant
to FC-FE are also listed in RFC 3643. Table 8-24 summarizes the OS-Codes that are
relevant to FC-FE.
Table 8-24
Frame Delimiter
Class of Service
0x28
SOFf
0x2D
SOFi2
0x35
SOFn2
0x2E
SOFi3
0x36
SOFn3
0x29
SOFi4
0x31
SOFn4
0x39
SOFc4
0x41
EOFn
F, 2, 3, 4
0x42
EOFt
F, 2, 3, 4
0x49
EOFni
F, 2, 3, 4
0x50
EOFa
F, 2, 3, 4
0x44
EOFrt
0x46
EOFdt
0x4E
EOFdti
0x4F
EOFrti
The format of the FC-FE SOF eld is illustrated in gure 8-41. The FC-FE EOF eld uses
the same format (substituting EOF values for SOF values) and follows the same rules for
sub-eld interpretation.
Figure 8-41 FC-FE SOF Field Format
Byte #
Bit #
Word #7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
SOF
SOF
-SOF
-SOF
337
First SOFThis is 8 bits long. This eld contains the OS-Code that maps to the FC
SOF ordered set associated with the encapsulated FC frame.
First -SOFThis is 8 bits long. This eld contains the ones complement of the rst
SOF eld.
When a TCP connection is established, the rst packet transmitted by the FCIP Link
Originator must be an FSF. The FCIP Link Acceptor may not transmit any packets until it
receives the FSF. Upon receipt of the FSF, the FCIP Link Acceptor must transmit the FSF
back to the FCIP Link Originator before transmitting any other packets. The FCIP Link
Originator may not send additional packets until the echoed FSF is received. The FSF
serves the following ve purposes:
NOTE
Despite the FC-BB mandate that discovery be accomplished using SLPv2, RFC 3821
permits limited discovery using the FSF. However, RFC 3821 encourages the use of SLPv2
because SLPv2 supports discovery of more conguration parameters, is more extensible,
and is more secure.
338
Bit # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
FC-FE Header
Word #06
Word #7
Reserved
-Reserved
Source FC Fabric Entity WWN
Word #89
Word #1011
Word #1213
Connection Nonce
Word #14
Word #1516
Connection Usage
Flags
Reserved
Word #17
Word #18
Reserved
-Reserved
Connection Usage FlagsThis is 8 bits long. This eld indicates the Classes of
Service that the FCIP link is intended to transport. In practice, only Classes 3 and F
are transported. FCIP does not restrict the link usage to comply with the information
conveyed by this eld. This eld is only used to facilitate collaborative control of IP
QoS settings by the FC Entity and FCIP Entity.
339
The FC-BB-3 specication denes the LKA ELS. Remember that a SW_ILS is an ELS that
may be transmitted only between fabric elements. The LKA is exchanged only between
VE_Port peers and B_Access peers. So, one might think that the LKA is a SW_ILS.
However, the valid responses to an LKA are LS_ACC and LS_RJT (not SW_ACC and
SW_RJT). Another key difference is that, unlike most SW_ILSs that must be transmitted
in Class F Exchanges, the LKA may be transmitted using any Class of Service supported
by the FCIP link. Thus, the FC-BB-3 specication refers to the LKA as an ELS instead of
as an SW_ILS. The LKA serves two purposes: the LKA keeps an FCIP link active when no
other trafc is traversing the link, and the LKA veries link connectivity when no FCIP
trafc is received for an abnormal period of time. In the case of multi-connection links, the
LKA may be sent on each TCP connection. The number of unacknowledged LKAs that
signify loss of connectivity is congurable, and the default value is 2. When a TCP
340
connection failure is detected via LKA, the FC Entity may attempt to re-establish the
connection (via the FCIP Entity) or redirect the affected ows to one of the surviving TCP
connections within the FCIP link (assuming that another TCP connection is available).
When an FCIP link failure is detected via LKA, the FC Entity must attempt to re-establish
the FCIP link (via the FCIP entity). Support for the LKA is optional. If supported, at least
one LKA should be sent per K_A_TOV during periods of inactivity, but LKAs may be sent
more frequently. The LKA may be sent during periods of bidirectional activity, but the LKA
serves no useful purpose during such periods. For implementations that support TCP keepalives, the LKA might be unnecessary. However, the LKA is sent from one FC Entity to
another. So, the LKA veries connectivity across more of the end-to-end path than TCP
keep-alives. For this reason, SAN administrators who wish to eliminate redundant keepalive trafc should consider disabling the TCP keep-alive. Because the default K_A_TOV
is lower than the typical TCP keep-alive interval, disabling the TCP keep-alive should have
no negative side affects. If the TCP keep-alive is used, and the FCIP Entity detects a failed
TCP connection, the FCIP Entity must notify the FC Entity and provide diagnostic
information about the failure. The FC Entity decides whether to re-establish the TCP
connection (via the FCIP Entity). An LKA may be rejected via the LS_RJT ELS, but receipt
of an LS_RJT serves the purpose of keeping the link active and veries connectivity to the
peer FC Entity regardless of the Reject Reason Code. The FC-BB-3 specication does not
dene any Reject Reason Codes or Reason Explanations that are unique to the LKA.
Receipt of an LS_ACC ELS signies a successful response to an LKA. Both the LKA and
its associated LS_ACC use the same frame format. Figure 8-43 illustrates the data eld
format of an LKA/LS_ACC ELS frame.
Figure 8-43 Data Field Format of an FC LKA/LS_ACC ELS Frame
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
LS Command Code
Unused
Unused
Unused
LS Command CodeThis is 8 bits long. This eld contains the LKA command
code (0x80) or the LS_ACC command code (0x02).
The FC-BB-3 specication denes the EBP SW_ILS. When using the B_Access functional
model, the EBP complements the ELP. One B_Access portal transmits an EBP to another
B_Access portal to exchange operating parameters similar to those exchanged via the ELP.
The valid responses to an EBP are SW_ACC and SW_RJT. The SW_ACC contains the
responders operating parameters. No other frames may be transmitted before successful
exchange of the EBP. The normal switch-to-switch extended link initialization procedure
occurs after the EBP exchange (see Chapter 5). Both the EBP and its associated SW_ACC
341
use the same frame format. Figure 8-44 illustrates the data eld format of an EBP/
SW_ACC SW_ILS frame.
Figure 8-44 Data Field Format of an FC EBP/SW_ACC SW_ILS Frame
Byte #
Bit # 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Word #0
Word #1
R_A_TOV
Word #2
E_D_TOV
Word #3
K_A_TOV
Word #45
Requester/Responder B_Access_Name
Word #69
SW_ILS Command CodeThis is 4 bytes long. This eld contains the EBP
command code (0x28010000) when transmitted by a requestor. This eld contains the
SW_ACC command code (0x02000000) when transmitted by a responder.
R_A_TOVThis is 4 bytes long. This eld contains the senders required R_A_TOV
expressed in milliseconds.
E_D_TOVThis is 4 bytes long. This eld contains the senders required E_D_TOV
expressed in milliseconds.
K_A_TOVThis is 4 bytes long. This eld contains the senders required K_A_TOV
expressed in milliseconds.
Class F Service ParametersThis is 16 bytes long. The format and use of this
eld are identical to its format and use in an ELP frame (see Chapter 5). This eld
contains various B_Access operating parameters. Key parameters include the classspecic MTU (ULP buffer size), the maximum number of concurrent Class F
sequences, the maximum number of concurrent sequences within each exchange, and
the number of end-to-end Credits (EE_Credits) supported. Some FC switches allow
the network administrator to manually congure one or more of these values.
If a parameter mismatch occurs, the EBP command is rejected with an SW_RJT (see
Chapter 5). The reason codes dened in the FC-SW-4 specication are used for EBP.
However, the FC-BB-3 specication redenes the reason code explanations that are used
342
for EBP. Table 8-25 summarizes the EBP reject reason code explanations. All reason code
explanations excluded from Table 8-25 are reserved.
Table 8-25
Description
0x00
No Additional Explanation
0x01
0x02
Invalid B_Access_Name
0x03
K_A_TOV Mismatch
0x04
E_D_TOV Mismatch
0x05
R_A_TOV Mismatch
For more information about the FCIP packet formats and FC-BB frame formats, readers
are encouraged to consult the IETF RFCs 3643 and 3821, and the ANSI T11 FC-BB-2,
FC-BB-3, FC-LS, FC-SW-3, and FC-SW-4 specications.
343
At this point, the two FC-SANs are merged into one FC-SAN. N_Ports can query their local
Name Server to discover the new target devices in the remote segment of the merged FCSAN. Additionally, Class 2, 3, and 4 frames can be exchanged between N_Ports across the
new VISL.
If the FC Entities at each end of an FCIP link lose time synchronization, the timestamps
delivered to the receiving FC Entity might result in dropped FC frames. Each incoming
encapsulated FC frame must have an FTT equal to or less than half the E_D_TOV. Frames
that do not meet this requirement must be discarded by the receiving FC Entity. When this
happens, the destination N_Port is responsible for detecting the dropped frame and taking
the appropriate action based on the Exchange Error Policy in effect for the impacted
Exchange. For this reason, SAN administrators should ensure the time server infrastructure
is redundant, and the relevant FC time-out values are set correctly to accommodate the FTT
(taking jitter into account).
NOTE
344
As previously discussed, the TCP checksum does not detect all bit errors. Therefore, the
receiving FCIP Entity is required to validate each FC frame before forwarding it to the FC
Entity. The receiving FCIP Entity must verify:
Additionally, the FCIP Entity must verify that each packet received contains the proper
content (for example, an FSF in the rst packet received on each TCP connection). Invalid
FC frames may be dropped by the FCIP Entity or forwarded to the FC Entity with an EOF
that indicates that the frame is invalid. Forwarded invalid FC frames are dropped by the
destination N_Port. FC is responsible for re-transmission in accordance with the Exchange
Error Policy in affect for the affected Exchange.
The FC-BB-3 specication requires FC-BB_IP devices to coordinate the TCP/IP owcontrol mechanisms with the FC ow-control mechanisms. Specically, the FC Entity must
cooperate with the FCIP Entity to manage ow control in both the IP network and the FCSAN as if the two networks were one. In principle, this is accomplished by managing the
number of BB_Credits advertised locally based on the size of the peer FCIP devices
advertised TCP window. However, no specic methods are mandated or provided by the
ANSI or IETF standards. So, the implementation choice is vendor-specic. If this
requirement is not implemented properly, dropped FC frames or TCP packets can result.
NOTE
The challenges of mapping ow-control information between the FC and IP networks apply
equally to iFCP implementations.
FCIP packets enter the TCP byte stream without the use of a message synchronization
scheme. If an FCIP_DE loses message synchronization, it may take any one of the
following actions:
Search the incoming TCP byte stream for a valid FCIP header and discard all bytes
received until a valid FCIP header is found.
If message synchronization cannot be recovered, and the TCP connection is closed, the
FC Entity is responsible for re-establishing the TCP connection (via the FCIP Entity).
Regardless of whether message synchronization is recovered, the affected N_Ports are
345
responsible for detecting and re-transmitting all dropped FC frames in accordance with the
Exchange Error Policy in effect for the affected Exchanges.
When the FCIP Entity detects an error, the FC Entity is notied via proprietary means. The
FC Entity may notify the PMM via proprietary means for the purpose of error logging.
The FC Entity must convert each error notication into a registered link incident report
(RLIR). The RLIR is then forwarded to the domain controller of the switch. For certain
errors, the FC Entity also forwards the RLIR to the management server of the switch. For
more information about the FCIP delivery mechanisms and error reporting procedures,
readers are encouraged to consult the IETF RFC 3821,and the ANSI T11 FC-BB-2, FCBB-3, and FC-LS specications.
NOTE
MTU discovery during FLOGI does not account for FCIP links. FCIP links are completely
transparent to N_ports. Also, the MTU of each FC-SAN that will be merged via FCIP must
be the same.
346
As mentioned in the FCP section of this chapter, some of the newer optimizations being
offered seek to accelerate SCSI operations. The most common technique is to spoof the
FCP_XFER_RDY signal to eliminate excessive round trips across the WAN during
SCSI write operations. Most, but not all, SAN extension vendors support some kind of
FCP_XFER_RDY spoong technique. Cisco Systems supports two FCP_XFER_RDY
spoong techniques on all FCIP products. One, for disk I/O, is called FCIP write
acceleration (FCWA). The other, for tape I/O, is called FCIP tape acceleration (FCTA).
Summary
This chapter concludes a multi-chapter examination of the operational details of the
primary storage networking technologies being deployed today: iSCSI, FCP, and FCIP. The
discussions provided in this chapter focus on OSI session layer functionality to give
readers a rm understanding of how each technology meets the requirements of the SAM
Transport Protocol model without delving too deeply into the application-oriented
functionality of SCSI. The information provided in this chapter should prepare readers to
design and deploy storage networks with greater understanding of the implications of their
design choices.
Review Questions
1 Name the three types of iSCSI names.
2 Why does RFC 3720 encourage the reuse of ISIDs?
3 What key feature does iSNS support that SLP does not support?
4 In the context of iSCSI, what is the difference between immediate data and unsolicited
data?
5 What is the function of the iSCSI R2T PDU?
6 Can security parameters be re-negotiated during an active iSCSI session?
7 Why must iSCSI use its own CRC-based digests?
8 Does iSCSI support stateful recovery of active tasks in the presence of TCP
connection failures?
9 Does FCP natively support the exchange of all required operating parameters to
Review Questions
347
14 When an FCP target that supports only Exchange level error recovery detects a frame
networks?
19 Which eld in the FSF may be set to zero for the purpose of discovery?
20 What procedure begins after an FCIP link becomes fully active?
21 Why must jumbo frames be supported end-to-end if they are implemented?
PART
III
Chapter 10
Chapter 11
Load Balancing
Chapter 12
Chapter 13
Chapter 14
List all of the ow control and QoS mechanisms related to modern storage networks
Describe the general characteristics of each of the ow control and QoS mechanisms
related to modern storage networks
CHAPTER
The principle of operation for half-duplex upper layer protocols (ULPs) over
full-duplex network protocols
352
353
354
buffer requirements or a decrease in throughput. Because all devices have nite memory
resources, degraded throughput is inevitable if network latency continues to increase over
time. Few devices support dynamic reallocation of memory to or from the receive buffer
pool based on real-time uctuations in network latency (called jitter), so the maximum
expected RTT, including jitter, must be used to calculate the buffer requirements to sustain
optimal throughput. More buffers increase equipment cost. So, more network latency and
more jitter results in higher equipment cost if optimal throughput is to be sustained.
Support for retransmission also increases equipment cost. Aside from the research and
development (R&D) cost associated with the more advanced software, devices that support
retransmission must buffer transmitted frames or packets until they are acknowledged by
the receiving device. This is advantageous because it avoids reliance on ULPs to detect and
retransmit dropped frames or packets. However, the transmit buffer either consumes
memory resources that would otherwise be available to the receive buffer (thus affecting
ow control and degrading throughput) or increases the total memory requirement of a
device. The latter is often the design choice made by device vendors, which increases
equipment cost.
The factors that contribute to end-to-end latency include transmission delay, serialization
delay, propagation delay, and processing delay. Transmission delay is the amount of time
that a frame or packet must wait in a queue before being serialized onto a wire. QoS policies
affect transmission delay. Serialization delay is the amount of time required to transmit a
signal onto a wire. Frames or packets must be transmitted one bit at a time when using serial
communication technologies. Thus, bandwidth determines serialization delay. Propagation
delay is the time required for a bit to propagate from the transmitting port to the receiving
port. The speed of light through an optical ber is 5 microseconds per kilometer. Processing
delay includes, but is not limited to, the time required to:
The order of processing steps depends on the architecture of the network device and its
conguration. Processing delay varies depending on the architecture of the network device
and which steps are taken.
355
Ethernet QoS
Ethernet supports QoS via the Priority eld in the header tag dened by the IEEE 802.1Q2003 specication. Whereas the 802.1Q-2003 specication denes the header tag format,
the IEEE 802.1D-2004 specication denes the procedures for setting the priority bits.
Because the Priority eld is 3 bits long, eight priority levels are supported. Currently, only
seven trafc classes are considered necessary to provide adequate QoS. The seven trafc
classes dened in the 802.1D-2004 specication include the following:
356
IP Flow Control
IP employs several ow-control mechanisms. Some are explicit, and others are implicit. All
are reactive. The supported mechanisms include the following:
Tail-drop
Internet Control Message Protocol (ICMP) Source-Quench
357
Tail-drop is the historical mechanism for routers to control the rate of ows between end
nodes. It often is implemented with a FIFO algorithm. When packets are dropped from the
tail of a full queue, the end nodes detect the dropped frames via TCP mechanisms. TCP then
reduces its window size, which precipitates a reduction in the rate of transmission. Thus,
tail-drop constitutes implicit, reactive ow control.
ICMP Source-Quench messages can be used to explicitly convey a request to reduce the
rate of transmission at the source. ICMP Source-Quench messages may be sent by any
IP device in the end-to-end path. Conceptually, the ICMP Source-Quench mechanism
operates in a manner similar to the Ethernet Pause Opcode. A router may choose to send an
ICMP Source-Quench packet to a source node in response to a queue overrun. Alternately,
a router may send an ICMP Source-Quench packet to a source node before a queue
overruns, but this is not common. Despite the fact that ICMP Source-Quench packets can
be sent before a queue overrun occurs, ICMP Source-Quench is considered a reactive
mechanism because some indication of congestion or potential congestion must trigger the
transmission of an ICMP Source-Quench message. Thus, additional packets can be transmitted
by the source nodes while the ICMP Source-Quench packets are in transit, and tail-drop can
occur even after proactive ICMP Source-Quench packets are sent. Upon receipt of an
ICMP Source-Quench packet, the IP process within the source node must notify the
appropriate Network Layer protocol or ULP. The notied Network Layer protocol or ULP
is then responsible for slowing its rate of transmission. ICMP Source-Quench is a
rudimentary mechanism, so few modern routers depend on ICMP Source-Quench messages
as the primary means of avoiding tail-drop.
RFC 2309 denes the concept of AQM. Rather than merely dropping packets from the tail
of a full queue, AQM employs algorithms that attempt to proactively avoid queue overruns
by selectively dropping packets prior to queue overrun. The rst such algorithm is called
Random Early Detection (RED). More advanced versions of RED have since been developed.
The most well known are Weighted RED (WRED) and DiffServ Compliant WRED. All
RED-based algorithms attempt to predict when congestion will occur and abate based on
rising and falling queue level averages. As a queue level rises, so does the probability of
packets being dropped by the AQM algorithm. The packets to be dropped are selected at
random when using RED. WRED and DiffServ Compliant WRED consider the trafc class
when deciding which packets to drop, which results in administrative control of the
probability of packet drop. All RED-based algorithms constitute implicit ow control
because the dropped packets must be detected via TCP mechanisms. Additionally, all REDbased algorithms constitute reactive ow control because some indication of potential
congestion must trigger the packet drop. The proactive nature of packet drop as implemented
by AQM algorithms should not be confused with proactive ow-control mechanisms that
exchange buffer resource information before data transfer occurs, to completely avoid
frame/packet drops. Note that in the most generic sense, sending an ICMP Source-Quench
358
message before queue overrun ocurs based on threshold settings could be considered a form
of AQM. However, the most widely accepted denition of AQM does not include ICMP
Source-Quench.
ECN is another method of implementing AQM. ECN enables routers to convey congestion
information to end nodes explicitly by marking packets with a congestion indicator rather
than by dropping packets. When congestion is experienced by a packet in transit, the
congested router sets the two ECN bits to 11. The destination node then noties the source
node (see the TCP Flow Control section of this chapter). When the source node receives
notication, the rate of transmission is slowed. However, ECN works only if the Transport
Layer protocol supports ECN. TCP supports ECN, but many TCP implementations do not
yet implement ECN. For more information about IP ow control, readers are encouraged
to consult IETF RFCs 791, 792, 896, 1122, 1180, 1812, 2309, 2914, and 3168.
IP QoS
IP QoS is a robust topic that dees precise summarization. That said, we can categorize all
IP QoS models into one of two very general categories: stateful and stateless. Currently, the
dominant stateful model is the Integrated Services Architecture (IntServ), and the dominant
stateless model is the Differentiated Services Architecture (DiffServ).
The IntServ model is characterized by application-based signaling that conveys a request
for ow admission to the network. The signaling is typically accomplished via the Resource
Reservation Protocol (RSVP). The network either accepts the request and admits the new
ow or rejects the request. If the ow is admitted, the network guarantees the requested
service level end-to-end for the duration of the ow. This requires state to be maintained for
each ow at each router in the end-to-end path. If the ow is rejected, the application may
transmit data, but the network does not provide any service guarantees. This is known as
best-effort service. It is currently the default service offered by the Internet. With best-effort
service, the level of service rendered varies as the cumulative load on the network varies.
The DiffServ model does not require any signaling from the application prior to data
transmission. Instead, the application marks each packet via the Differentiated Services
Codepoint (DSCP) eld to indicate the desired service level. The rst router to receive each
packet (typically the end nodes default gateway) conditions the ow to comply with the
trafc prole associated with the requested DSCP value. Such routers are called conditioners.
Each router (also called a hop) in the end-to-end path then forwards each packet according
to Per Hop Behavior (PHB) rules associated with each DSCP value. The conditioners
decouple the applications from the mechanism that controls the cumulative load placed on
the network, so the cumulative load can exceed the networks cumulative capacity. When
this happens, packets may be dropped in accordance with PHB rules, and the affected end
nodes must detect such drops (usually via TCP but sometimes via ICMP Source-Quench).
In other words, the DiffServ model devolves into best-effort service for some ows when
the network capacity is exceeded along a given path.
359
Both of these QoS models have strengths and weaknesses. At rst glance, the two models
would seem to be incompatible. However, the two models can interwork, and various RFCs
have been published detailing how such interworking may be accomplished. For more
information about IP QoS, readers are encouraged to consult IETF RFCs 791, 1122, 1633,
1812, 2205, 2430, 2474, 2475, 2815, 2873, 2963, 2990, 2998, 3086, 3140, 3260, 3644, and
4094.
360
IETF RFCs 792, 793, 896, 1122, 1180, 1323, 1812, 2309, 2525, 2581, 2914, 3042, 3155,
3168, 3390, 3448, 3782, and 4015.
TCP QoS
TCP interacts with the QoS mechanisms implemented by IP. Additionally, TCP provides
two explicit QoS mechanisms of its own: the Urgent and Push ags in the TCP header.
The Urgent ag indicates whether the Urgent Pointer eld is valid. When valid, the Urgent
Pointer eld indicates the location of the last byte of urgent data in the packets Data eld.
The Urgent Pointer eld is expressed as an offset from the Sequence Number in the TCP
header. No indication is provided for the location of the rst byte of urgent data. Likewise,
no guidance is provided regarding what constitutes urgent data. An ULP or application
decides when to mark data as urgent. The receiving TCP node is not required to take any
particular action upon receipt of urgent data, but the general expectation is that some effort
will be made to process the urgent data sooner than otherwise would occur if the data were
not marked urgent.
As previously discussed, TCP decides when to transmit data received from a ULP.
However, a ULP occasionally needs to be sure that data submitted to the source nodes TCP
byte stream has actually be sent to the destination. This can be accomplished via the push
function. A ULP informs TCP that all data previously submitted needs to be pushed to
the destination ULP by requesting (via the TCP service provider interface) the push
function. This causes TCP in the source node to immediately transmit all data in the byte
stream and to set the Push ag to one in the nal packet. Upon receiving a packet with the
Push ag set to 1, TCP in the destination node immediately forwards all data in the byte
stream to the required ULPs (subject to the rules for in-order delivery based on the
Sequence Number eld). For more information about TCP QoS, readers are encouraged to
consult IETF RFCs 793 and 1122.
361
unsolicited data is supported, the FirstBurstLength text key controls how much data may
be transferred in or with the SCSI Command PDU, thus performing an equivalent
function to the Desired Data Transfer Length eld. The MaxRecvDataSegmentLength
text key controls how much data may be transferred in a single Data-Out or Data-In PDU.
The MaxBurstLength text key controls how much data may be transferred in a single
PDU sequence (solicited or unsolicited). Thus, the FirstBurstLength value must be equal to
or less than the MaxBurstLength value. The MaxConnections text key controls how many
TCP connections may be aggregated into a single iSCSI session, thus controlling the
aggregate TCP window size available to a session. The MaxCmdSN eld in the Login
Response BHS and SCSI Response BHS controls how many SCSI commands may be
outstanding simultaneously. For more information about iSCSI ow control, readers are
encouraged to consult IETF RFC 3720.
iSCSI QoS
iSCSI depends primarily on lower-layer protocols to provide QoS. However, iSCSI
provides support for expedited command processing via the I bit in the BHS of the Login
Request PDU, the SCSI Command PDU, and the TMF Request PDU. For more information
about iSCSI QoS, readers are encouraged to consult IETF RFC 3720.
FC Flow Control
The primary ow-control mechanism used in modern FC-SANs (Class 3 fabrics) is the
Buffer-to-Buffer_Credit (BB_Credit) mechanism. The BB_Credit mechanism provides
link-level ow control. The FLOGI procedure informs the peer port of the number of
BB_Credits each N_Port and F_Port has available for frame reception. Likewise, the
Exchange Link Parameters (ELP) procedure informs the peer port of the number of
BB_Credits each E_Port has available for frame reception. Each time a port transmits
a frame, the port decrements the BB_Credit counter associated with the peer port. If
the BB_Credit counter reaches zero, no more frames may be transmitted until a
Receiver_Ready (R_RDY) primitive signal is received. Each time an R_RDY is received,
the receiving port increments the BB_Credit counter associated with the peer port. Each
time a port processes a received frame, the port transmits an R_RDY to the peer port.
The explicit, proactive nature of the BB_Credit mechanism ensures that no frames are ever
dropped in FC-SANs because of link-level buffer overrun. However, line-rate throughput
can be very difcult to achieve over long distances because of the high BB_Credit count
requirement. Some of the line cards available for FC switches produced by Cisco Systems
support thousands of BB_Credits on each port, thus enabling long-distance SAN
362
FC QoS
FC supports several QoS mechanisms via elds in the FC header. The DSCP sub-eld in
the CS_CTL/Priority eld can be used to implement differentiated services similar to the
IP DiffServ model. However, the FC-FS-2 specication currently reserves all values other
than zero, which is assigned to best-effort service. The Preference subeld in the CS_CTL/
Priority eld can be used to implement a simple two-level priority system. The FC-FS-2
specication requires all Class 3 devices to support the Preference subeld. No requirement
exists for every frame within a sequence or Exchange to have the same preference value.
So, it is theoretically possible for frames to be delivered out of order based on inconsistent
values in the Preference elds of frames within a sequence or Exchange. However, this
scenario is not likely to occur because all FC Host Bus Adapter (HBA) vendors recognize
the danger in such behavior. The Priority subeld in the CS_CTL/Priority eld can be used
to implement a multi-level priority system. Again, no requirement exists for every frame
within a sequence or Exchange to have the same priority value, so out-of-order frame
delivery is theoretically possible (though improbable). The Preemption subeld in the
CS_CTL/Priority eld can be used to preempt a Class 1 or Class 6 connec-tion to allow
Class 3 frames to be forwarded. No modern FC switches support Class 1 or Class 6 trafc,
so the Preemption eld is never used. For more information about FC QoS, readers are
encouraged to consult the ANSI T11 FC-FS-2 specication.
363
FCP QoS
FCP depends primarily on lower-layer protocols to provide QoS. However, FCP provides
support for expedited command processing via the Priority eld in the FCP_CMND IU
header. For more information about FCP QoS, readers are encouraged to consult the ANSI
T10 FCP-3 specication.
364
control, readers are encouraged to consult IETF RFC 3821 and the ANSI T11 FC-BB-3
specication.
FCIP QoS
FCIP does not provide any QoS mechanisms of its own. However, RFC 3821 requires the
FC Entity to specify the IP QoS characteristics of each new TCP connection to the FCIP
Entity at the time that the TCP connection is requested. In doing so, no requirement exists
for the FC Entity to map FC QoS mechanisms to IP QoS mechanisms. This may be
optionally accomplished by mapping the value of the Preference subeld or the Priority
subeld in the CS_CTL/Priority eld of the FC header to an IntServ/RSVP request or a
DiffServ DSCP value. FCIP links are not established dynamically in response to received
FC frames, so the FC Entity needs to anticipate the required service levels prior to FC frame
reception. One method to accommodate all possible FC QoS values is to establish one TCP
connection for each of the seven trafc classes identied by the IEEE 802.1D-2004
specication. The TCP connections can be aggregated into one or more FCIP links, or each
TCP connection can be associated with an individual FCIP link. The subsequent mapping
of FC QoS values onto the seven TCP connections could then be undertaken in a proprietary
manner. Many other techniques exist, and all are proprietary. For more information about
FCIP QoS, readers are encouraged to consult IETF RFC 3821 and the ANSI T11 FC-BB3 specication.
Summary
The chapter reviews the ow-control and QoS mechanisms supported by Ethernet, IP, TCP,
iSCSI, FC, FCP, and FCIP. As such, this chapter provides insight to network performance
optimization. Application performance optimization requires attention to the ow-control
and QoS mechanisms at each OSI Layer within each protocol stack.
Review Questions
1 What is the primary function of all ow-control mechanisms?
2 What are the two categories of QoS algorithms?
3 What is the name of the queue management algorithm historically associated with
tail-drop?
4 Which specication denes trafc classes, class groupings, and class-priority
Review Questions
365
5 What is the name of the rst algorithm used for AQM in IP networks?
6 What are the names of the two dominant QoS models used in IP networks today?
7 What is the name of the TCP state variable that controls the amount of data that may
be transmitted?
8 What is the primary ow-control mechanism employed by iSCSI?
9 What are the names of the two QoS subelds currently available for use in FC-SANs?
10 What is the primary ow-control mechanism employed by FCP?
11 Are FCIP devices required to map FC QoS mechanisms to IP QoS mechanisms?
CHAPTER
10
In traditional IP and Ethernet terminology, the term routing describes the process of
forwarding Layer 3 packets, whereas the terms bridging and switching describe the process
of forwarding Layer 2 frames. This chapter uses the term switching instead of bridging
when discussing Layer 2 forwarding. In both cases, the forwarding process involves two
basic steps: path determination and frame/packet forwarding. Path determination (sometimes
called path selection) involves some type of table lookup to determine the correct egress
port or the next hop address. Frame/packet forwarding is the process of actually moving a
received frame or packet from the ingress port to a queue associated with the appropriate
egress port. Buffer management and scheduling algorithms then ensure the frame or packet
is serialized onto the wire. The basic difference between routing and switching is that
368
routing uses Layer 3 addresses for path determination, whereas switching uses Layer 2
addresses.
FC is a switching technology, and FC addresses are Layer 2 constructs. Therefore, FC
switches do not route frames according to the traditional denition of routing. That said,
the ANSI T11 FC-SW series of specications refers to switching functionality, based on the
Fabric Shortest Path First (FSPF) protocol, as routing. FSPF must use FC addresses for
path determination, therefore FSPF is actually a switching protocol according to the traditional
denition. Moreover, many FC switch vendors have recently begun to offer FC routing
products. Although the architectures of such offerings are quite different, they all accomplish
the same goal, which is to connect separate FC-SANs without merging them. Since all FC
routing solutions must use FC addresses for path determination, all FC routing solutions are
actually FC switching solutions according to the traditional denition. However, FSPF
employs a link-state algorithm, which is traditionally associated with Layer 3 routing
protocols. Additionally, all FC routing solutions provide functionality that is similar to
inter-VLAN IP routing. So, these new denitions of the term routing in the context of
FC-SANs are not so egregious.
When a source node injects a frame or packet into a network, the frame or packet consumes
network resources until it is delivered to the destination node. This is normal and does not
present a problem as long as network resources are available. Of course, the underlying
assumption is that each frame or packet will exit the network at some point in time. When
this assumption fails to hold true, network resources become fully consumed as new frames
or packets enter the network. Eventually, no new frames or packets can enter the network.
This scenario can result from routing loops that cause frames or packets to be forwarded
perpetually. For this reason, many routed protocols (such as IP) include a Time To Live
(TTL) eld (or equivalent) in the header. In the case of IP, the source node sets the TTL
value, and each router decrements the TTL by one as part of the routing process. When
the value in the TTL eld reaches 0, the IP packet is discarded. This mechanism enables
complex topologies in which loops might exist. However, even with the TTL mechanism,
loops can cause problems. As the number of end nodes connected to an IP network grows,
so does the network itself. The TTL mechanism limits network growth to the number of
hops allowed by the TTL eld. If the TTL limit is increased to enable network growth, the
lifetime of looping packets is also increased. So, the TTL mechanism cannot solve the
problem of loops in very large IP networks. Therefore, routing protocols that support
complex topologies must implement other loop-suppression mechanisms to enable
scalability. Even in small networks, the absence of a TTL mechanism requires the routing
or switching protocol to suppress loops.
NOTE
The TTL eld in the IP header does not limit the time each packet can live. The lifetime of
an IP packet is measured in hops rather than time.
369
Protocols that support hierarchical addressing can also support three types of broadcast
trafc:
Local
Directed
All-networks
A local broadcast is sent to all nodes on the local network. A local broadcast contains the
local network address in the high-order portion of the destination address and the all-nodes
designator in the low-order portion of the destination address. A local broadcast is not
forwarded by routers.
A directed broadcast is sent to all nodes on a specic, remote network. A directed broadcast
contains the remote network address in the high-order portion of the destination address
and the all-nodes designator in the low-order portion of the destination address. A directed
broadcast is forwarded by routers in the same manner as a unicast packet until the broadcast
packet reaches the destination network.
An all-networks broadcast is sent to all nodes on all networks. An all-networks broadcast
contains the all-networks designator in the high-order portion of the destination address and
the all-nodes designator in the low-order portion of the destination address. An all-networks
broadcast is forwarded by routers. Because Ethernet addressing is at, Ethernet supports
only local broadcasts. IP addressing is hierarchical, but IP does not permit all-networks
broadcasts. Instead, an all-networks broadcast (sent to IP address 255.255.255.255) is
treated as a local broadcast. FC addressing is hierarchical, but the high-order portion of an
FC address identies a domain (an FC switch) rather than a network. So, the all-networks
broadcast format equates to an all-domains broadcast format. FC supports only the alldomains broadcast format (no local or directed broadcasts). An all-domains broadcast is
sent to D_ID 0xFF FF FF and is subject to zoning constraints (see Chapter 12, Storage
Network Security).
NOTE
Some people consider broadcast and multicast to be variations of the same theme. In that
context, a broadcast is a simplied version of a multicast.
Each routing protocol is generally considered to be either a distance vector protocol (such
as Routing Information Protocol [RIP]) or a link-state protocol (such as Open Shortest Path
First [OSPF]). With a distance vector protocol, each router advertises its routing table to its
neighbor routers. Initially, the only entries in a routers routing table are the networks to
which the router is directly connected. Upon receipt of a distance vector advertisement, each
receiving router updates its own routing table and then propagates its routing table to its
neighbor routers. Thus, each router determines the best path to a remote network based on
370
information received from neighbor routers. This is sometimes called routing by rumor
because each router must make forwarding decisions based on unveried information.
By contrast, a router using a link-state protocol sends information about only its own
interfaces. Such an advertisement is called a Link State Advertisement (LSA). Upon receipt
of an LSA, each receiving router copies the information into a link-state database and then
forwards the unmodied LSA to its neighbor routers. This process is called ooding. Thus,
each router makes forwarding decisions based on information that is known to be accurate
because the information is received from the actual source router.
In short, distance vector protocols advertise the entire routing table to adjacent routers only,
whereas link-state protocols advertise only directly connected networks to all other routers.
Distance vector protocols have the benet of comparatively low processing overhead on
routers, but advertisements can be comparatively large. Link-state protocols have the
benet of comparatively small advertisements, but processing overhead on routers can be
comparatively high. Some routing protocols incorporate aspects of both distance vector and
link state protocols. Such protocols are called hybrid routing protocols. Other variations
also exist.
Routing protocols also are categorized as interior or exterior. An interior protocol is called
an Interior Gateway Protocol (IGP), and an exterior protocol is called an Exterior Gateway
Protocol (EGP). IGPs facilitate communication within a single administrative domain, and
EGPs facilitate communication between administrative domains. An administrative domain
can take the form of a corporation, an Internet Service Provider (ISP), a division within a
government, and so on. Each administrative domain is called an Autonomous System (AS).
Routing between Autonomous Systems is called inter-AS routing. To facilitate inter-AS
routing on the Internet, IANA assigns a globally unique AS number to each AS.
IP Routing Protocols
371
RSTP is a variation of the distance vector model. Each switch learns the location of MAC
addresses by inspecting the Source Address eld in the header of Ethernet frames received
on each switch port. Thus, RSTP operation is completely transparent to end nodes. The
learned addresses are entered into a forwarding table that associates each address with an
egress port (the port on which the address was learned). No information is stored regarding
the distance to each address. So, RSTP is a vector based protocol. No information is exchanged
between switches regarding the reachability of MAC addresses. However, switches do
exchange information about the physical topology so that loop suppression may occur. In
multi-VLAN environments, the Multiple Spanning Tree Protocol (MSTP) may be used.
MSTP is a variation of RSTP that enables an independent spanning tree to be established
within each VLAN on a common physical network.
Because none of the Ethernet header formats include a TTL eld (or equivalent), Ethernet
frames can be forwarded indenitely in topologies that have one or more loops. For
example, when a broadcast frame (such as an ARP request) is received by a switch, the
switch forwards the frame via all active ports except for the ingress port. If a loop
exists in the physical topology, the forwarded broadcast frame eventually returns to the
same switch and is forwarded again. Meanwhile, new broadcast frames are generated by
the attached nodes. Those frames are forwarded in the same manner as the rst broadcast
frame. This cycle continues until all available bandwidth on all active ports is fully consumed
by re-circulating broadcast frames. This phenomenon is called a broadcast storm. RSTP
suppresses all loops in the physical topology to prevent broadcast storms and other congestionrelated failures. The resulting logical topology is a tree that spans to facilitate connectivity
between all attached nodes. Connectivity is always symmetric, which means that frames
exchanged between a given pair of end nodes always traverse the same path in both directions.
For more information about Ethernet switching protocols, readers are encouraged to
consult the IEEE 802.1Q-2003 and 802.1D-2004 specications.
IP Routing Protocols
IP supports a broad variety of IGPs. In the distance vector category, IP supports the Routing
Information Protocol (RIP) and the Interior Gateway Routing Protocol (IGRP). In the
hybrid category, IP supports the Enhanced Interior Gateway Routing Protocol (EIGRP).
In the link-state category, IP supports the Open Shortest Path First (OSPF) protocol and
the Integrated Intermediate System to Intermediate System (Integrated IS-IS) protocol.
IP also supports two EGPs: the Exterior Gateway Protocol (EGP) and the Border Gateway
Protocol (BGP).
RIP is the original distance vector protocol. RIP and its successor, RIP version 2 (RIPv2),
enjoyed widespread use for many years. Today, RIP and RIPv2 are mostly historical. RIP
employs classful routing based on classful IP addresses. RIP distributes routing updates via
broadcast. RIPv2 enhances RIP by supporting classless routing based on variable-length
subnet masking (VLSM) methodologies. Other enhancements include the use of multicast
372
for routing update distribution and support for route update authentication. Both RIP and
RIPv2 use hop count as the routing metric and support load balancing across equal-cost
paths. RIP and RIPv2 are both IETF standards. For more information about classful/
classless routing and VLSM, see Chapter 6, The OSI Network Layer.
IGRP is a Cisco Systems proprietary protocol. IGRP was developed to overcome the
limitations of RIP. The most notable improvement is IGRPs use of a composite metric that
considers the delay, bandwidth, reliability, and load characteristics of each link.
Additionally, IGRP expands the maximum network diameter to 255 hops versus the 15-hop
maximum supported by RIP and RIPv2. IGRP also supports load balancing across unequalcost paths. IGRP is mostly historical today.
EIGRP is another Cisco Systems proprietary protocol. EIGRP signicantly enhances IGRP.
Although EIGRP is often called a hybrid protocol, it advertises routing-table entries to
adjacent routers just like distance vector protocols. However, EIGRP supports several
features that differ from typical distance vector protocols. Among these are partial table updates
(as opposed to full table updates), change triggered updates (as opposed to periodic
updates), scope sensitive updates sent only to affected neighbor routers (as opposed to blind
updates sent to all neighbor routers), a diffusing computation system that spreads the
route calculation burden across multiple routers, and support for bandwidth throttling to
control protocol overhead on low-bandwidth WAN links. EIGRP is a classless protocol that
supports route summaries for address aggregation, load balancing across unequal-cost
paths, and route update authentication. Though waning in popularity, EIGRP is still in
use today.
OSPF is another IETF standard protocol. OSPF was originally developed to overcome
the limitations of RIP. OSPF is a classless protocol that employs Dijkstras Shortest Path
First (SPF) algorithm, supports equal-cost load balancing, supports route summaries for
address aggregation, and supports authentication. To promote scalability, OSPF supports
the notion of areas. An OSPF area is a collection of OSPF routers that exchange LSAs.
In other words, LSA ooding does not traverse area boundaries. This reduces the number
of LSAs that each router must process and reduces the size of each routers link-state
database. One area is designated as the backbone area through which all inter-area
communication ows. Each area has one or more Area Border Routers (ABRs) that
connect the area to the backbone area. Thus, OSPF implements a two-level hierarchical
topology. All inter-area routes are calculated using a distance-vector algorithm. Despite
this fact, OSPF is not widely considered to be a hybrid protocol. OSPF is very robust and
is in widespread use today.
IS-IS was originally developed by Digital Equipment Corporation (DEC). IS-IS was later
adopted by the ISO as the routing protocol for its Connectionless Network Protocol
(CLNP). At one time, many people believed that CLNP eventually would replace IP. So,
an enhanced version of IS-IS was developed to support CLNP and IP simultaneously.
IP Routing Protocols
373
The enhanced version is called Integrated IS-IS. In the end, the IETF adopted OSPF as
its ofcial IGP. OSPF and Integrated IS-IS have many common features. Like OSPF,
Integrated IS-IS is a classless protocol that employs Dijkstras SPF algorithm, supports equalcost load balancing, supports route summaries for address aggregation, supports authentication, and supports a two-level hierarchical topology. Some key differences also exist.
For example, Integrated IS-IS uses the Dijkstra algorithm to compute inter-area routes.
EGP was the rst exterior protocol. Due to EGPs many limitations, many people consider
EGP to be a reachability protocol rather than a full routing protocol. EGP is mostly historical
today. From EGP evolved BGP. BGP has since evolved from its rst implementation into
BGP version 4 (BGP-4). BGP-4 is widely used today. Many companies run BGP-4 on their
Autonomous System Border Routers (ASBRs) for connectivity to the Internet. Likewise,
many ISPs run BGP-4 on their ASBRs to communicate with other ISPs. Whereas BGP-4 is
widely considered to be a hybrid protocol, BGP-4 advertises routing table entries to other
BGP-4 routers just like distance vector protocols. However, a BGP-4 route is the list of AS
numbers (called the AS_Path) that must be traversed to reach a given destination. Thus,
BGP-4 is called a path vector protocol. Also, BGP-4 runs over TCP. Each BGP-4 router
establishes a TCP connection to another BGP-4 router (called a BGP-4 peer) based on
routing policies that are administratively congured. Using TCP relaxes the requirement
for BGP-4 peers to be topologically adjacent. Connectivity between BGP-4 peers often
spans an entire AS that runs its own IGP internally. A TCP packet originated by a BGP-4
router is routed to the BGP-4 peer just like any other unicast packet. BGP-4 is considered
a policy-based routing protocol because the protocol behavior can be fully controlled via
administrative policies. BGP-4 is a classless protocol that supports equal-cost load
balancing and authentication.
NOTE
BGP-4s use of TCP can be confusing. How can a routing protocol operate at OSI Layer 3
and use TCP to communicate? The answer is simple. For a router to operate, it must gather
information from peer routers. Various mechanisms exist for gathering such information.
Once the information is gathered, the subsequent functions of path determination and
packet forwarding are executed at OSI Layer 3. In the case of BGP-4, the peer communication
function leverages TCP, but AS_Path creation and packet forwarding are executed at OSI
Layer 3.
The sheer volume of information associated with IP routing protocols can be very intimidating
to someone who is new to IP networking. For more information on IP routing protocols,
readers can consult the numerous IETF RFCs in which the protocols are dened and
enhanced. Alternately, readers can consult one of the many books written about this subject.
A very comprehensive analysis of all IP routing protocols is available in the two-volume
set by Jeff Doyle entitled Routing TCP/IP, volumes I and II.
374
FC Switching Protocols
FSPF is the protocol used for routing within an FC-SAN. FSPF is a link-state protocol, but
each vendor may choose which link-state algorithm to use. Dijkstras algorithm is the most
widely known link-state algorithm, but the ANSI T11 FC-SW-4 specication does not
require the use of Dijkstras algorithm. Well-dened interactions between switches ensure
interoperability even if multiple link-state algorithms are used within a single FC-SAN.
Additionally, the FC-SW-4 specication neither requires nor precludes the ability to loadbalance, so FC switch vendors may choose whether and how to implement such functionality.
In most other respects, FSPF is similar to other link-state protocols.
Routing between FC-SANs is currently accomplished via proprietary techniques. ANSI
recently began work on standardization of inter-fabric routing, but no such standards
currently exist. Each FC switch vendor currently takes a different approach to solving this
problem. All these approaches fall into one of two categories: integrated or appliancebased. The integrated approach is conceptually similar to Layer 3 switching, wherein the
ASICs that provide Layer 2 switching functionality also provide inter-fabric routing
functionality as congured by administrative policy. The appliance-based approach
requires the use of an external device that physically connects to each of the FC-SANs
between which frames need to be routed. This approach is conceptually similar to the
original techniques used for inter-VLAN IP routing (often called the router-on-a-stick
model).
Summary
This chapter introduces readers to the routing and switching protocols used by Ethernet, IP,
and FC. The traditional and new denitions of the terms switching and routing are discussed.
The issues related to topological loops are discussed. The three types of broadcast trafc
are discussed: local, directed, and all-networks. Multicast is not discussed. The categories
of routing protocols are discussed: distance vector versus link state, and interior versus
exterior. RSTP and MSTP are discussed in the context of Ethernet. RIP, IGRP, EIGRP,
OSPF, Integrated IS-IS, EGP, and BGP-4 are discussed in the context of IP. FSPF is
discussed in the context of FC.
Review Questions
1 What are the two basic steps that every switching and routing protocol performs?
2 What is the primary purpose of a TTL mechanism?
3 Does IP permit all-networks broadcasts?
Review Questions
375
List the load-balancing techniques and mechanisms supported by Ethernet, IP, Fibre
Channel (FC), Internet Small Computer System Interface (iSCSI), Fibre Channel
Protocol (FCP), and Fibre Channel over TCP/IP (FCIP)
CHAPTER
11
Load Balancing
This chapter provides a brief introduction to the principles of load balancing. We briey
discuss the load-balancing functionality supported by Ethernet, IP, FC, iSCSI, FCP, and
FCIP, and the techniques employed by end nodes.
378
IP Load Balancing
Each IP routing protocol denes its own rules for load balancing. Most IP routing protocols
support load balancing across equal cost paths, while some support load balancing across
equal and unequal cost paths. While unequal-cost load balancing is more efcient in its use
of available bandwidth, most people consider unequal-cost load balancing to be more
trouble than it is worth. The comparatively complex nature of unequal-cost load balancing
makes conguration and troubleshooting more difcult. In practice, equal-cost load
balancing is almost always preferred.
The router architecture and supported forwarding techniques also affect how trafc is loadbalanced. For example, Cisco Systems routers can load balance trafc on a simple roundrobin basis or on a per-destination basis. The operating mode of the router and its interfaces
determines which load-balancing behavior is exhibited. When process switching is
379
congured, each packet is forwarded based on a route table lookup. The result is roundrobin load balancing when multiple equal-cost paths are available. Alternately, route table
lookup information can be cached on interface cards so that only one route table lookup is
required per destination IP address. Each subsequent IP packet sent to a given destination
IP address is forwarded on the same path as the rst packet forwarded to that IP address.
The result is per-destination load balancing when multiple equal-cost paths are available.
Note that the source IP address is not relevant to the forwarding decision.
Each IP routing protocol determines the cost of a path using its own metric. Thus, the best
path from host A to host B might be different for one routing protocol versus another.
Likewise, one routing protocol might determine two or more equal cost paths exists
between host A and host B, while another routing protocol might determine only one best
cost path exists. So, the ability to load-balance is somewhat dependent upon the choice of
routing protocol. When equal-cost paths exist, administrators can congure the number of
paths across which trafc is distributed for each routing protocol.
A complementary technology, called the Virtual Router Redundancy Protocol (VRRP), is
dened in IETF RFC 3768. VRRP evolved from Cisco Systems proprietary technology
called Hot Standby Router Protocol (HSRP). VRRP enables a virtual IP address to be
used as the IP address to which end nodes transmit trafc (the default gateway address).
Each virtual IP address is associated with a oating Media Access Control (MAC)
address.
VRRP implements a distributed priority mechanism that enables multiple routers to
potentially take ownership of the virtual IP address and oating MAC address. The router with
the highest priority owns the virtual IP address and oating MAC address. That router
processes all trafc sent to the oating MAC address. If that router fails, the router with the
next highest priority takes ownership of the virtual IP address and oating MAC address.
VRRP can augment routing protocol load-balancing functionality by distributing end nodes
across multiple routers. For example, assume that an IP subnet containing 100 hosts has two
routers attached via interface A. Two VRRP addresses are congured for interface A in each
router. The rst router has the highest priority for the rst VRRP address and the lowest
priority for the second VRRP address. The second router has the highest priority for the
second VRRP address and the lowest priority for the rst VRRP address. The rst 50 hosts
are congured to use the rst VRRP address as their default gateway. The other 50 hosts are
congured to use the second VRRP address as their default gateway. This conguration
enables half the trafc load to be forwarded by each router. If either router fails, the other
router assumes ownership of the failed routers VRRP address, so none of the hosts are
affected by the router failure.
The Gateway Load Balancing Protocol (GLBP) augments VRRP. GLBP is currently
proprietary to Cisco Systems. Load balancing via VRRP requires two or more default
gateway addresses to be congured for a single subnet. That requirement increases
administrative overhead associated with Dynamic Host Conguration Protocol (DHCP)
conguration and static end node addressing. Additionally, at least one IP address per router
380
FC Load Balancing
As discussed in Chapter 5, The OSI Physical and Data Link Layers, FC supports the
aggregation of multiple physical links into a single logical link (an FC port channel).
Because all FC link aggregation schemes are currently proprietary, the load-balancing
algorithms are also proprietary. In FC, the load-balancing algorithm is of crucial
importance because it affects in-order frame delivery. Not all FC switch vendors support
link aggregation. Each of the FC switch vendors that support link aggregation currently
implements one or more load-balancing algorithms. Cisco Systems offers two algorithms.
The default algorithm uses the source Fibre Channel Address Identier (FCID), destination
FCID, and Originator Exchange ID (OX_ID) to achieve load balancing at the granularity
of an I/O operation. This algorithm ensures that all frames within a sequence and all
sequences within an exchange are delivered in order across any distance. This algorithm
also improves link utilization within each port channel. However, this algorithm does not
guarantee that exchanges will be delivered in order. The second algorithm uses only the
source FCID and destination FCID to ensure that all exchanges are delivered in order.
As previously stated, load balancing via Fabric Shortest Path First (FSPF) is currently
accomplished in a proprietary manner. So, each FC switch vendor implements FSPF load
balancing differently. FC switches produced by Cisco Systems support equal-cost load
balancing across 16 paths simultaneously. Each path can be a single ISL or multiple ISLs
aggregated into a logical ISL. When multiple equal-cost paths are available, FC switches
produced by Cisco Systems can be congured to perform load balancing based on the
source FCID and destination FCID or the source FCID, destination FCID, and OX_ID.
Similar to Ethernet, FC supports independent conguration of FSPF link costs in each
Virtual Storage Area Network (VSAN). This enables FC-SAN administrators to optimize
ISL bandwidth utilization. The same design principles that apply to Ethernet also apply to
FC when using this technique.
381
NOTE
By placing the network portals of a multihomed target device in different Internet Storage
Name Service (iSNS) Discovery Domains (DDs), iSNS can be used to facilitate load
balancing. For example, a network entity containing two SCSI target nodes (nodes A and
B) and two NICs (NICs A and B) may present both nodes to initiators via both NICs.
Selective DD assignment can force initiator A to access target node A via NIC A while
forcing initiator B to access node B via NIC B.
382
Review Questions
383
across the available paths. For example, all I/O for LUN 13 traverses path A, and path B is the
backup path for LUN 13. Simultaneously, all I/O for LUN 17 traverses path B, and path A is
the backup path for LUN 17. Obviously, the LUN-to-path mapping approach enables load
balancing only when multiple LUNs are being accessed simultaneously. When a host uses the
SCSI command distribution approach, all I/O operations initiated to all LUNs are distributed
across all available paths. The DMP software establishes a session via each available path
and then determines which LUNs are accessible via each session. For any given SCSI
command, the session is selected by the congured DMP algorithm. Some algorithms simply
perform round-robin distribution. Other algorithms attempt to distribute commands based on
real-time utilization statistics for each session. Many other algorithms exist. Note that optimal
use of all available paths requires LUN access via each path, which is controlled by the storage
array conguration. The SCSI command distribution approach enables load balancing even if
the host is accessing only one LUN.
Summary
This chapter briey introduces readers to the principles of load balancing. The goals of load
balancing are discussed, and some common terminology. Ethernet port channeling and MSTP
conguration options are reviewed. Routing protocols, VRRP and GLBP are examined in
the context of IP. FC port channeling, FSPF and per-VSAN FSPF conguration options are
covered. iSCSI connection-level and session-level techniques are explored. FCP session-level
load balancing is compared to iSCSI. FCIP connection-level and tunnel-level options are
discussed. Finally, end node congurations using DMP software are discussed.
Review Questions
1 What is the primary goal of load balancing?
2 Which protocol enables network administrators to modify link costs independently
Transport Layer?
6 Does FCP support Transport Layer load balancing?
7 Does FCIP support Exchange-based load balancing across TCP connections within an
FCIP link?
8 Name the two techniques commonly used by end nodes to load-balance across DMP
paths.
Describe the features of the most common AAA protocols and management protocols
Explain the benets of Role Based Access Control (RBAC) and Authentication,
Authorization, and Accounting (AAA)
Enumerate the security services supported by Ethernet, IP, TCP, Internet Small
Computer System Interface (iSCSI), Fibre Channel (FC), Fibre Channel Protocol
(FCP), and Fibre Channel over TCP/IP (FCIP)
CHAPTER
12
Data origin authentication is the service that veries that each message actually
originated from the source claimed in the header.
Data integrity is the service that detects modications made to data while in ight.
Data integrity can be implemented as connectionless or connection-oriented.
Anti-replay protection is the service that detects the arrival of duplicate packets
within a given window of packets or a bounded timeframe.
386
reference model and functional specication dened in the 359-2004 standard are broadly
applicable to many environments, various organizations outside of ANSI are working on
RBAC standards for specic environments that have specialized requirements. Today, most
information technology vendors support RBAC in their products.
RBAC is complemented by a set of technologies called authentication, authorization, and
accounting (AAA). AAA implemented as a client/server model in which all security
information is centrally stored and managed on an AAA server. The devices under
management act as clients to the AAA server by relaying user credentials and access
requests to the AAA server. The AAA server replies authoritatively to the managed devices.
The user is granted or denied access based on the AAA servers reply. The traditional
alternate is to create, store, and manage user identication and password information on
each managed device (a distributed model). The AAA model requires signicantly less
administration than the distributed model. AAA is also inherently more secure because the
central database can be protected by physical security measures that are not practical to
implement in most distributed environments. Consequently, AAA is currently deployed
in most large organizations. Many AAA products are available as software-only solutions
that run on every major operating system. As its name suggests, AAA provides three
services. The authentication service veries the identication of each user or device. The
authorization service dynamically grants access to network and compute resources based
on a precongured access list associated with the users credentials. This enables granular
control of who can do what rather than granting each authenticated user full access to all
resources. Authorization is handled transparently, so the user experience is not tedious. The
accounting service logs actions taken by users and devices. Some AAA servers also support
the syslog protocol and integrate syslog messages into the accounting log for consolidated
logging. The accounting service can log various data including the users ID, the source
IP address, protocol numbers, TCP and UDP port numbers, time and date of access, the
commands executed and services accessed, the result of each attempt (permitted or denied),
and the location of access. The accounting service enables many applications such as
customer billing, suspicious activity tracing, utilization trending, and root cause analysis.
AAA Protocols
Communication between an AAA client and an AAA server occurs using one of several
protocols. The most prevalent AAA protocols are the Remote Authentication Dial In User
Service (RADIUS) and the Terminal Access Controller Access Control System Plus
(TACACS+). A third protocol, Kerberos, is used for authentication in server environments.
A fourth protocol, Secure Remote Password (SRP), is leveraged by many application
protocols as a substitute for native authentication procedures.
RADIUS is dened in IETF RFC 2865, and RADIUS source code is freely distributed. As its
name implies, RADIUS was originally implemented to authenticate remote users trying to
access a LAN via analog modem connections. Remote users dial into a Network Access
Server (NAS), which relays the users credentials to a RADIUS server. Thus, the NAS (not
the user) is the RADIUS client. RADIUS is still used for remote user authentication, but
AAA Protocols
387
RADIUS is now commonly used for other authentication requirements, too. For example,
network administrators are often authenticated via RADIUS when accessing routers and
switches for management purposes. To prevent unauthorized access to the RADIUS database,
RADIUS client requests are authenticated by the RADIUS server before the users credentials
are processed. RADIUS also encrypts user passwords prior to transmission. However, other
information (such as user ID, source IP address, and so on) is not encrypted. RADIUS
implements authentication and authorization together. When a RADIUS server replies to a
client authentication request, authorization information is included in the reply. A RADIUS
server can reply to a client request or relay the request to another RADIUS server or other
type of authentication server (such as Microsoft Active Directory). Communication between
client and server is accomplished via variable-length keys in the form of Attribute-LengthValue. This enables new attributes to be dened to extend RADIUS functionality without
affecting existing implementations. Note that RADIUS uses UDP (not TCP). Whereas the
decision to use UDP is justied by a variety of reasons, this sometimes causes a network or
security administrator to choose a different AAA protocol.
TACACS began as a protocol for authenticating remote users trying to access the
ARPANET via analog modem connections. TACACS is dened in IETF RFC 1492.
TACACS was later augmented by Cisco Systems. The proprietary augmentation is called
Extended TACACS (XTACACS). Cisco subsequently developed the TACACS+ protocol
based on TACACS and XTACACS. However, TACACS+ is a signicantly different
protocol and is incompatible with TACACS and XTACACS. Cisco Systems has deprecated
TACACS and XTACACS in favor of TACACS+. Similar to RADIUS, the TACACS+ client
(NAS, router, switch, and others) relays the users credentials to a TACACS+ server. Unlike
RADIUS, TACACS+ encrypts the entire payload of each packet (but not the TACACS+
header). Thus, TACACS+ is considered more secure than RADIUS. TACACS+ supports
authentication, authorization, and accounting functions separately. So, any combination of
services can be enabled via TACACS+. TACACS+ provides a more granular authorization
service than RADIUS, but the penalty for this granularity is increased communication
overhead between the TACACS+ client and server. Another key difference between
TACACS+ and RADIUS is that TACACS+ uses TCP, which makes TACACS+ more
attractive than RADIUS to some network and security administrators.
Kerberos was originally developed by the Massachusetts Institute of Technology (MIT) in the
mid 1980s. The most recent version is Kerberos V5 as dened in IETF RFC 4120. Kerberos
V5 is complemented by the Generic Security Services API (GSS-API) dened in IETF
RFC 4121. Kerberos provides an encrypted authentication service using shared secret keys.
Kerberos can also support authentication via public key cryptography, but this is not covered
by RFC 4120. Kerberos does not provide an authorization service, but Kerberos does support
pass-through to other authorization services. Kerberos does not provide an accounting service.
Another popular authentication protocol is the SRP protocol as dened in IETF RFC 2945.
SRP provides a cryptographic authentication mechanism that can be integrated with a broad
variety of existing Internet application protocols. For example, IETF RFC 2944 denes an
SRP authentication option for Telnet. SRP implements a secure key exchange that enables
additional protection such as data integrity and data condentiality.
388
Management Protocols
The Simple Network Management Protocol (SNMP) is currently the most widely used
management protocol. Early versions of SNMP restrict management access via community
strings. A community string is specied by a management host (commonly called a management
station) when connecting to a managed device. The managed device grants the management
station access to conguration and state information based on the permissions associated with
the specied community string. Community strings may be congured to grant read-only or
read-write access on the managed device. Early versions of SNMP transmit community strings
as clear text strings (said to be in the clear). SNMP version 3 (SNMPv3) replaces the
community string model with a user-based security model (USM). SNMPv3 provides user
authentication and data condentiality. IETF RFC 3414 denes the USM for SNMPv3.
The Telnet protocol is very old. It is a staple among IP-based application protocols. Telnet
was originally dened through a series of IETF RFCs in the 1970s. The most current
Telnet specication is RFC 854. Telnet enables access to the command line interface (CLI)
of remote devices. Unfortunately, Telnet operates in the clear and is considered insecure.
Multiple security extensions have been dened for Telnet via a large number of RFCs.
Telnet now supports strong authentication and encryption options. A suite of Unix commands
(collectively called the R-commands) provides similar functionality to Telnet, but the suite
of R-commands operates in the clear and is considered insecure. The suite includes Remote
Login (RLOGIN), Remote Shell (RSH), Remote Command (RCMD) and Remote Copy
(RCP) among other commands. Another protocol called Secure Shell (SSH) was developed
by the open source community in the late 1990s to overcome the security limitations of
Telnet and the suite of R-commands. The most commonly used free implementation of SSH
is the OpenSSH distribution. SSH natively supports strong authentication and encryption.
Among its many features, SSH supports port forwarding, which allows protocols like
Telnet and the R-command suite to operate over an encrypted SSH session. The encrypted
SSH session is transparent to Telnet and other forwarded protocols.
The File Transfer Protocol (FTP) is commonly used to transfer conguration les and system
images to and from infrastructure devices such as switches, routers, and storage arrays.
FTP supports authentication of users, but authentication is accomplished by sending user
credentials in the clear. Once a user is authenticated, the user may access the FTP server. In
other words, successful authentication implies authorization. No mechanism is dened for
the user to authenticate the server. Additionally, data is transferred in the clear. To address
these security deciencies, IETF RFC 2228 denes several security extensions to FTP. The
extensions provide secure bi-directional authentication, authorization, data integrity, and data
condentiality. Any or all of these extensions may be used by an FTP implementation. Secure
implementations of FTP should not be confused with the SSH File Transfer Protocol (SFTP).
SFTP is in the development stage and is currently dened in an IETF draft RFC. However,
SFTP is already in widespread use. Despite its misleading name, SFTP is not FTP operating
over SSH. SFTP is a relatively new protocol that supports many advanced features not
supported by FTP. SFTP provides a secure le transfer service and implements some features
typically associated with a le system. SFTP does not support authentication. Instead, SFTP
relies on the underlying secure transport to authenticate users. SFTP is most commonly used
IP Security
389
with SSH, but any secure transport can be leveraged. Another option for moving conguration
les and system images is the RCP command/protocol. As previously stated, RCP operates in
the clear and is considered insecure. The Secure Copy (SCP) command/protocol is based on
RCP, but SCP leverages the security services of SSH. SCP is not currently standardized.
NOTE
FTP deployed with security extensions is generically called secure FTP. However, secure FTP
is not abbreviated as SFTP. SSH File Transfer Protocol is ofcially abbreviated as SFTP.
Note that the Simple File Transfer Protocol is also ofcially abbreviated as SFTP. Readers
are encouraged to consider the context when the acronym SFTP is encountered.
Ethernet Security
The IEEE 802.1X-2001 specication provides a port-based architecture and protocol for
the authentication and authorization of Ethernet devices. The authorization function is
not granular and merely determines whether access to the LAN is authorized following
successful authentication. A device to be authenticated (such as a host) is called a
supplicant. A device that enforces authentication (such as an Ethernet switch) is called an
authenticator. The authenticator relays supplicant credentials to an authentication server,
which permits or denies access to the LAN. The authentication server function may be
implemented within the authenticator device. Alternately, the authentication server may
be centralized and accessed by the authenticator via RADIUS, TACACS+, or other such
protocol. A port in an Ethernet switch may act as authenticator or supplicant. For example,
when a new Ethernet switch is attached to a LAN, the port in the existing Ethernet switch
acts as authenticator, and the port in the new Ethernet switch acts as supplicant.
VLANs can be used as security mechanisms. By enforcing trafc isolation policies along
VLAN boundaries, Ethernet switches protect the devices in each VLAN from the devices
in other VLANs. VLAN boundaries can also isolate management access in Ethernet
switches that support VLAN-aware RBAC.
IP Security
Security for IP-based communication is provided via many mechanisms. Central to these
is the IP Security (IPsec) suite of protocols and algorithms. The IPsec suite is dened in
many IETF RFCs. Each IPsec RFC falls into one of the following seven categories:
Architecture
Encapsulating Security Payload
Authentication Header
Encryption Algorithms
Authentication Algorithms
Key Management Protocols
Domain of Interpretation
390
Access control
Data origin authentication
Connectionless data integrity
Anti-replay protection
Data condentiality
Limited trafc ow condentiality
IPsec is implemented at the OSI Network Layer between two peer devices, so all IP-based
ULPs can be protected. IPsec supports two modes of operation: transport and tunnel. In
transport mode, a security association (SA) is established between two end nodes. In tunnel
mode, an SA is established between two gateway devices or between an end node and a
gateway device. A security association is a unidirectional tunnel identied by a Security
Parameter Index (SPI), the protocol used (ESP) and the destination IP address. Two SAs
must be established (one in each direction) for successful communication to occur.
IP routers and switches also support Access Control Lists (ACL). An ACL permits or denies
protocol actions based on a highly granular permissions list applied to the ingress or egress
trafc of a specied interface or group of interfaces. An ACL can be applied to inter-VLAN
trafc or intra-VLAN trafc. Inter-VLAN trafc is ltered by applying an ACL to a router
interface. This is sometimes called a Router ACL (RACL). Intra-VLAN trafc is ltered by
applying an ACL to all non-ISL switch ports in a given VLAN. An intra-VLAN ACL is
sometimes called a VLAN ACL (VACL).
TCP Security
TCP does not natively provide secure communication aside from limited protection against
mis-delivery via the Checksum eld in the TCP header. TCP-based applications can rely
on IPsec for security services. Alternately, TCP-based applications can rely on an OSI
Transport Layer protocol (other than TCP) for security services. The Transport Layer
Security (TLS) protocol is one such option. TLS is currently dened in IETF RFC 2246.
TLS operates above TCP (but within the Transport Layer) and provides peer authentication,
connection-oriented data integrity, and data condentiality. TLS operation is transparent to
all ULPs. TLS is comprised of two sub-protocols: the TLS Record Protocol and the TLS
391
Handshake Protocol. TLS is sometimes referred to as the Secure Sockets Layer (SSL).
However, SSL is a separate protocol that was originally developed by Netscape for secure
web browsing. HTTP is still the primary consumer of SSL services. TLS v1.0 evolved from
SSL v3.0. TLS and SSL are not compatible, but TLS implementations can negotiate the use
of SSL when communicating with SSL implementations that do not support TLS.
iSCSI Security
As previously discussed, iSCSI natively supports bi-directional authentication. iSCSI
authentication occurs as part of the initial session establishment procedure. iSCSI
authentication is optional and may transpire using clear text messages or cryptographically
protected messages. For cryptographically protected authentication, IETF RFC 3720 permits
the use of SRP, Kerberos V5, the Simple Public-Key GSS-API Mechanism (SPKM) as
dened in RFC 2025, and the Challenge Handshake Authentication Protocol (CHAP)
as dened in RFC 1994. Vendor-specic protocols are also permitted for cryptographically
protected authentication. For all other security services, iSCSI relies upon IPsec.
Additional iSCSI security can be achieved by masking the existence of iSCSI devices during
the discovery process. Both Internet Storage Name Service (iSNS) Discovery Domains and
Service Location Protocol (SLP) Scopes can be leveraged for this purpose. Both of these
mechanisms provide limited access control by conning device discovery within administratively dened boundaries. However, this form of security is based on a merit system; no
enforcement mechanisms are available to prevent direct discovery via probing. Readers are
encouraged to consult IETF RFC 3723 for background information related to iSCSI security.
Device authentication
Device authorization
Connectionless data integrity
Data condentiality
Cryptographic key management
Security policy denition and distribution
Binding restrictions to control which devices (N_Ports, B_Ports, and so on) may join
a fabric, and to which switch(es) a given device may connect
392
Binding restrictions to control which switches may join a fabric and which switch
pairs may form an ISL
Management access restrictions to control which IP hosts may manage a fabric and
which IP protocols may be used by management hosts
The authentication and binding procedures are based on Worldwide Names (WWN).
The optional ESP_Header dened in the Fibre Channel Framing and Signaling (FC-FS)
specication series provides the data integrity and condentiality services. Key management is facilitated by an FC-specic variant of IKE.
Perhaps the most known FC security mechanism is the FC zoning service. The FC zoning
service is dened in the Fibre Channel Generic Services (FC-GS) specication series. FC
zoning restricts which device pairs may communicate. FC zoning traditionally operates in
two modes: soft zoning and hard zoning. Soft zoning is a merit system in which certain
WWNs are masked during the discovery process. The Fibre Channel Name Server (FCNS)
provides each host a list of targets that the host is permitted to access. The list is derived
from WWN-based policies dened in the Fibre Channel Zone Server (FCZS). However, no
enforcement mechanism is implemented to prevent hosts from accessing all targets. By
contrast, hard zoning enforces communication policies that have traditionally been based
on switch ports (not WWNs). The line between soft and hard zoning is beginning to blur
because newer FC switches support hard zoning based on WWNs.
Virtual Fabrics (VF) can also be used as security mechanisms. By enforcing trafc isolation
policies along VF boundaries, FC switches protect the devices in each VF from the devices
in other VFs. VF boundaries can also isolate management access in FC switches that
support VF-aware RBAC.
Modern storage arrays commonly support another security mechanism called Logical Unit
Number (LUN) masking. LUN masking hides certain LUNs from initiators when the
storage array responds to the SCSI REPORT LUNS command. Note that LUN masking
was developed to ensure data integrity, and the security benets are inherent side affects.
FC switches produced by Cisco Systems support enforcement of LUN masking policies via
the FC zoning mechanism (called LUN zoning).
FCP Security
FCP does not natively support any security services. FCP relies on the security services
provided by the FC architecture.
FCIP Security
FCIP does not natively support any IP-based security mechanisms. FCIP relies upon IPsec
for all IP-based security services. Additional FCIP security can be achieved by masking the
existence of FCIP devices during the discovery process. Because FCIP does not support
discovery via iSNS, only SLP Scopes can be leveraged for this purpose. However, this form
of security is based on a merit system; no enforcement mechanisms are available to prevent
direct discovery via probing.
Review Questions
393
Following FCIP link establishment, the FC virtual inter-switch link (VISL) may be secured
by FC-SP procedures. For example, after an FCIP link is established, the peer FC switches
may be authenticated via FC-SP procedures during E_Port (VISL) initialization. If authentication fails, no SCSI data can transit the FCIP link even though an active TCP connection
exists. One limitation of this approach is the inability to authenticate additional TCP
connections that are added to an existing FCIP link. From the perspective of the FC fabric,
the additional TCP connections are transparent. Therefore, FC-SP procedures cannot be
used to validate additional TCP connections. For this reason, the ANSI T11 FC-BB
specication series denes the Authenticate Special Frame (ASF) Switch Internal Link
Service (SW_ILS). The ASF is used to authenticate additional TCP connections before they
are added to an existing FCIP link. When a new TCP connection is requested for an existing
FCIP link, the receiving FCIP Entity passes certain information about the connection
request to the FC Entity. The FC Entity uses that information to send an ASF to the claimed
requestor. The claimed requestor validates the ASF with a Switch Accept (SW_ACC)
SW_ILS if the TCP connection request is valid. Until the ASF transmitter receives an
SW_ACC, SCSI data may not traverse the new TCP connection. Readers are encouraged
to consult IETF RFC 3723 for background information related to FCIP security.
Summary
This chapter highlights the primary security protocols used by modern storage networks.
An introduction to in-ight data protection services, RBAC, and AAA is followed by a brief
discussion of the most commonly used AAA and management protocols. An overview of
Ethernet, IP, and TCP security is provided by reviewing the IEEE 802.1X-2001 specication,
the IPsec suite, and SSL/TLS, respectively. iSCSI authentication and discovery are
discussed and followed by a discussion of FC security as dened in the FC-SP specication.
To conclude this chapter, a summary of FCIP security mechanisms is provided.
Review Questions
1 List the ve primary protection services for in-ight data.
2 Which management protocol supports port forwarding?
3 What type of security model does Ethernet implement?
4 Which standard denes the IPsec architecture?
5 Is SSL compatible with TLS?
6 Is iSCSI authentication mandatory?
7 Which security architecture was leveraged as the basis for the FC-SP architecture?
8 Which standard denes the ASF SW_ILS?
Characterize the Fibre Channel (FC) Management Service and supporting services
CHAPTER
13
Conceptual Underpinnings of
Storage Management Protocols
To understand management protocols, readers must rst understand certain principles of
management. Typically, several components work together to compose a management
system. At the heart of the management system is a centralized management host called a
management station. In some cases, the management station is actually a group of clustered
hosts. The management station typically communicates with a software component, called
an agent, on each device to be managed. The agent accesses hardware-based instrumentation
to make management data (such as events, states, and values) available to the management
station. A data model is required to ensure that management data is structured using a well
dened format. A well dened protocol facilitates communication between the management
station and agents. The types of actions that the management station may perform on the
managed devices are determined by the capabilities of the agents, management station,
communication protocol, and administrative policies. The administrative policies cover
such things as authentication, authorization, data privacy, provisioning, reclamation,
alerting, and event response.
NOTE
396
Standards play an important role in the cost of management systems. In the absence of
standards, each product vendor must develop a proprietary management agent. Each new
product results in a new proprietary agent. That increases product development costs.
Additionally, management station vendors must adopt multiple proprietary interfaces to
manage heterogeneous devices with proprietary agents. That increases the number of lines
of software code in the management station product. The increased code size often
increases the occurrence of bugs, slows product performance, increases product development
costs, slows product adaptation to accommodate new managed devices, and complicates
code maintenance when changes are made to one or more existing agents. Standards
address these challenges so that product development efforts can be focused on high-level
management functionality instead of basic communication challenges. As a result, prices
fall and innovation occurs more rapidly. Much of the innovation currently taking place in
storage and network management seeks to automate provisioning and reclamation tasks.
Several categories of storage-related management applications exist. One such category
is called Storage Resource Management (SRM) that provides the ability to discover,
inventory, and monitor disk and tape resources. Some SRM products support visualization
of the relationship between each host and its allocated storage resources. Many SRM
vendors are actively augmenting their products to include policy-based, automated
provisioning and reclamation. SRM applications are sold separately from the storage resources
being managed. SAN management is another category. SAN management applications are
often called fabric managers. SAN management provides the ability to discover, inventory,
monitor, visualize, provision, and reclaim storage network resources. Most FC switch
vendors bundle a SAN management application with each FC switch at no additional
charge. However, advanced functionality is often licensed or sold as a separate SAN
management application. Note that the line between SRM and SAN management is blurring
as convergence takes place in the storage management market.
Data management is another category. Data management is sometimes called Hierarchical
Storage Management (HSM). HSM should not be confused with Information Lifecycle
Management (ILM). HSM provides policy-based, automated migration of data from one
type of storage resource to another. HSM policies are usually based on frequency or recency
of data access. The goal of HSM is to leverage less-expensive storage technologies to store
data that is used infrequently. HSM products have existed for decades. They originated in
mainframe environments. The concept of ILM recently evolved from HSM. ILM performs
essentially the same function as HSM, but ILM migration policies are based on the business
value of data rather than solely on frequency and recency of use. The word information
implies knowledge of the business value of the data. In other words, all data must be
classied by the ILM application. The phrase tiered storage is often used in conjunction
with ILM. ILM applications migrate data between tiers of storage. Each storage tier
provides a unique level of performance and reliability. Thus, each storage tier has a unique
cost basis. The concept of tiered storage originally derives from the HSM context. However,
storage tiers were typically dened as different types of storage media (disk, tape, and
optical) in the HSM context. In the ILM context, storage tiers are typically dened as
TCP/IP Management
397
TCP/IP Management
Management of IP-based devices is accomplished via the Internet Standard Management
Framework (ISMF). The ISMF and supporting specications are summarized in IETF
RFC 3410. Many people erroneously refer to the Simple Network Management Protocol
(SNMP) when they really mean to refer to the ISMF. While SNMP is a key component of
the framework, several other components are required for the framework to be of any use.
The ve major components of the framework are as follows:
A virtual store, called the Management Information Base (MIB), for organizing and
storing management data
SMI is based on the ISOs Abstract Syntax Notation One (ASN.1). The most recent version
of the SMI is called SMIv2. It is dened in RFC 2578. SMIv2 denes data types, an object
model, and syntax rules for MIB module creation. The framework views each management
datum as an object. Management objects are stored in the MIB. The most recent version of
the MIB is called MIB-II. It is dened in RFC 1213. Many proprietary extensions have been
made to the MIB without compromising the standard object denitions. This is possible
because the MIB is modular. The internal structure of the MIB is an inverted tree in which
each branch represents a group of related management objects (called a module). MIB
modules are sometimes called MIBs. Vendors can dene their own MIB modules and graft
those modules onto the standard MIB. An agent in a managed device uses the MIB to
organize management data retrieved from hardware instrumentation. Management stations
access the MIB in managed devices via SNMP requests sent to agents. SNMP version 2
(SNMPv2) is the most recent and capable version of the communication protocol. SNMPv2
is dened in RFC 3416.
398
Chapter 12, Storage Network Security, states that SNMPv2 uses a community-based
security model, whereas SNMPv3 uses a user-based security model. More accurately
stated, SNMPv2 combined with a community-based security model is called SNMPv2c,
and SNMPv2 combined with a user-based security model is called SNMPv3. The fact
that the same communication protocol (SNMPv2) is used by SNMPv2c and SNMPv3 is
commonly misunderstood. SNMPv2 typically operates over UDP, but TCP is also
supported. However, the RFC that denes a mapping for SNMP over TCP is still in the
experimental state. Another common misconception is the notion that SNMP is a management application. In actuality, SNMP is an enabling technology (a protocol) for management
applications. Most SNMP operations occur at the direction of a management application
layered on top of SNMP.
Management stations can read MIB object values by issuing the GetRequest Protocol Data
Unit (PDU) or one of its variants. Management stations can modify MIB object values
by issuing the SetRequest PDU. The SetRequest PDU can be used to congure device
operations, clear counters, and so on. Note that SNMP supports limited device conguration
functionality. Each GetRequest PDU, variant of the GetRequest PDU, and SetRequest PDU
elicits a Response PDU. Managed devices can also initiate communication. When an event
transpires or a threshold is reached, a managed device can reactively send notication to the
management station via the Trap PDU. A Trap PDU does not elicit a Response PDU
(unacknowledged). Alternately, reactive notication can be sent via the InformRequest
PDU. Each InformRequest PDU elicits a Response PDU (acknowledged).
Worthy of mention is a class of MIB modules used for Remote Network Monitoring
(RMON). In the past, network administrators had to deploy physical devices to remotely
monitor the performance of a network. To manage those specialized devices, a MIB module
(the RMON MIB) was developed. Over time, the functionality of remote network
monitoring devices was integrated into the actual network devices, thus eliminating the
need for physically separate monitoring devices. The RMON MIB was also integrated into
the self-monitoring network devices. As new classes of network devices came to market,
self-monitoring functionality adapted. New MIB modules were dened to manage the new
devices, and the RMON MIB module was also augmented. Today, several variants of the
RMON MIB are dened and in widespread use.
Another management framework, called Web-Based Enterprise Management (WBEM),
was developed by the Distributed Management Task Force (DMTF). WBEM seeks to unify
the management of all types of information systems including storage arrays, servers,
routers, switches, protocol gateways, rewalls, transaction load balancers, and even
applications. WBEM aspires to go beyond the limits of SNMP. One example is WBEMs
support for robust provisioning and conguration. WBEM uses the Common Information
Model (CIM) as its object model. Like the MIB object model, CIM supports vendor
extensions without compromising interoperability. However, CIM goes beyond the MIB
object model by supporting the notion of relationships between objects. CIM also supports
methods (operations) that can be invoked remotely. CIM enables a wide variety of disparate
management applications to share management data in a common format. WBEM is
TCP/IP Management
399
modular so that any modeling language and communication protocol can be used. The
Extensible Markup Language (XML) is the most commonly used modeling language to
implement CIM, which is dened in the xmlCIM Encoding specication. HTTP is the most
commonly used communication protocol, which is dened in the CIM Operations Over
HTTP specication. The CIM-XML specication brings together the xmlCIM Encoding
specication and the CIM Operations Over HTTP specication. WBEM supports discovery
via SLP, which is generally preferred over the direct probing behavior of SNMP-based
management stations. In the WBEM framework, an agent is called a CIM provider or a CIM
server, and a management station is called a CIM client.
Web Services is a technology suite produced by the Organization for the Advancement of
Structured Information Standards (OASIS). The Web Services architecture is built upon the
Simple Object Access Protocol (SOAP), the Universal Description, Discovery and
Integration (UDDI) protocol, the Web Service Denition Language (WSDL), and XML.
The DMTF began work in 2005 to map the Web Services for Management (WS-Management)
specication and the Web Services Distributed Management (WSDM) specication onto
WBEM. Upon completion, WS-Management and WSDM will be able to manage resources
modeled with CIM.
The rst CIM-based storage management technology demonstration occurred in October
1999 at an industry trade show called Storage Networking World. Various working groups
within the Storage Networking Industry Association (SNIA) continued work on CIM-based
management projects. Simultaneously, a group of 16 SNIA member companies developed
a WBEM-based storage management specication called Bluen. The goal of Bluen was
to unify the storage networking industry on a single management interface. The group of
16 submitted Bluen to SNIA in mid-2002. SNIA subsequently created the Storage
Management Initiative (SMI) to streamline the management projects of the various SNIA
working groups and to incorporate those efforts with Bluen to produce a single storage
management standard that could be broadly adopted.
The resultant standard is called the Storage Management Initiative Specication (SMI-S),
which leverages WBEM to provide a high degree of interoperability between heterogeneous
devices. Storage and networking vendors have developed object models for storage devices
and storage networking devices, and work continues to extend those models. Additionally,
object models for storage services (such as backup/restore, snapshots, clones, and volume
management) are being developed. SNIA continues to champion SMI-S today in the hopes
of fostering widespread adoption of the new standard. Indeed, most storage and storage
networking vendors have already adopted SMI-S to some extent. ANSI has also adopted
SMI-S v1.0.2 via the INCITS 388-2004 specication. SNIA also conducts conformance
testing to ensure vendor compliance.
Another common function supported on storage and storage networking devices is call
home. Call home is similar to SNMP traps in that they both reactively notify support
personnel when certain events transpire. Unlike SNMP traps, call home is usually invoked
only in response to hardware or software failures. That said, some devices allow
400
administrators to decide which events generate a call home. Many devices support call
home over modem lines, which explains the name of the feature. Call home functionality
is not standardized, so the communication protocols and message formats are usually
proprietary (unlike SNMP traps). That said, some devices support call home via the Simple
Mail Transfer Protocol (SMTP), which operates over TCP/IP in a standardized manner as
dened in IETF RFC 2821.
Similar to SNMP traps and call home, the Syslog protocol can be used to log details about
device operation. The Syslog protocol was originally developed by University of California
at Berkeley for use on UNIX systems. The popularity of the Syslog protocol precipitated
its adoption on many other operating systems and even on networking devices. Though
Syslog was never standardized, IETF RFC 3164 documents the behavior of many Syslog
implementations in the hopes of improving interoperability. Like SNMP traps, Syslog
messages are not acknowledged. Another useful management tool is the accounting
function of the AAA model discussed in Chapter 12, Storage Network Security. Accounting
data can be logged centrally on an AAA server or locally on the managed device. Today,
most devices support some level of accounting.
IP operates over every major OSI Layer 2 protocol including FC. The ANSI T11 Fibre
Channel Link Encapsulation (FC-LE) specication denes a mapping for IP and ARP onto
FC. However, the FC-LE specication fails to adequately dene all necessary aspects of IP
operation over FC. So, the IETF lled in the gaps by producing RFC 2625. Together, the two
specications provide sufcient guidance for IP over FC (IPFC) to be implemented reliably.
IPFC originally was envisioned as a server-oriented transport for use in clustered server
environments, localized grid computing environments, tiered application environments,
High Performance Computing (HPC) environments, and so on. However, IPFC can also be
used for management purposes. Any given device (server or otherwise) that is attached to
a FC fabric is inevitably also attached to an IP network. If a device loses its primary IP
connection (perhaps because of a hardware failure), IPFC can be used to access the device
over the FC fabric. This back door approach ensures that management stations can access
devices continuously even in the presence of isolated communication failures. However,
very few administrators are willing to compromise the reliability of SCSI operations by
introducing a second ULP into their FC-SANs. Moreover, modern server platforms are
commonly congured with dual Ethernet NICs for improved redundancy on the LAN.
Consequently, IPFC is rarely used for management access. In fact, IPFC is rarely used
for any purpose. Low latency Ethernet switches and InniBand switches are generally
preferred for high performance server-to-server communication. Thus, IPFC is currently
relegated to niche applications.
FC Management
The ANSI T11 Fibre Channel Generic Services (FC-GS) specication series denes several
services that augment the functionality of FC-SANs. Among these is an in-band
FC Management
401
402
consistently across disparate operating systems and server platforms. Note that the same
information can be gathered from the FC fabric via the Performance Server, the HBA
Management Server, and the FCNS.
The FC-GS specication series also denes a notication service called the Event Service.
Fabric attached devices can register with the Event Server to receive notication of events.
Currently, a limited number of events are supported. Like the FC Management Service, the
Event Service is distributed.
The FC-LS specication denes various Extended Link Service (ELS) commands that can
be issued by a fabric device to ascertain state information about devices, connections,
Exchanges, and sequences. Examples include Read Exchange Status Block (RES), Read
Sequence Status Block (RSS), Read Connection Status (RCS), Read Link Error Status
Block (RLS), and Read Port Status Block (RPS). These commands are issued directly from
one device to another.
SCSI Management
SCSI enclosures provide electrical power, thermal cooling, and other support for the
operation of SCSI target devices. Two in-band techniques for management of SCSI
enclosures warrant mention. The rst is the SCSI Accessed Fault-Tolerant Enclosures
(SAF-TE) specication. The most recent SAF-TE specication was produced by the nStor
Corporation and the Intel Corporation in 1997. SAF-TE provides a method of monitoring
fault-tolerant SCSI enclosures using the following six SCSI commands:
INQUIRY
READ BUFFER
REQUEST SENSE
SEND DIAGNOSTIC
TEST UNIT READY
WRITE BUFFER
All six of these commands are dened in the ANSI T10 SCSI Primary Commands (SPC)
specication series. SAF-TE is a proprietary specication that is published in an open
manner. SAF-TE is not a de jure standard. That said, the goal of SAF-TE is to provide a
nonproprietary method for heterogeneous SCSI controllers to monitor heterogeneous
storage enclosures. SAF-TE is implemented as a SCSI processor inside the enclosure. The
SAF-TE SCSI processor supports target functionality. The SCSI controller (initiator)
periodically polls the SAF-TE target using the aforementioned commands to detect
changes in enclosure status such as temperature and voltage levels. The SAF-TE target can
also assert indicators (for example, lights and audible alarms) to indicate the status of
enclosure components such as fans, power supplies, and hot-swap bays.
Review Questions
403
The second in-band technique is called SCSI Enclosure Services (SES). SES is dened in
the ANSI T10 SES specication series. SES provides a method of monitoring and
managing the components of a SCSI enclosure. SES is conceptually similar to SAF-TE,
and SES has the same goal as SAF-TE. However, SES is a standard, and ANSI development
of SES is ongoing. Like SAF-TE, SES is implemented as a SCSI target. SES operation is
similar to SAF-TE, but SCSI controllers (initiators) use only two commands to access SES:
SEND DIAGNOSTIC
RECEIVE DIAGNOSTIC RESULTS
Both of these commands are dened in the ANSI T10 SPC specication series.
Summary
This chapter provides an introduction to management concepts by discussing the
components of a management system and the role of standards in the realm of management.
Certain categories of management applications are then described. The two predominant
IP-based management frameworks, ISMF and WBEM, are reviewed and followed by a
brief description of SMI-S, call home, Syslog, Accounting, and IPFC. The FC Management
Service and its constituent parts are explained. Concluding the FC section is a description
of the HBA API, the FC Event Service and management related ELS commands. Finally,
two SCSI in-band management techniques, SAF-TE and SES, are examined.
Review Questions
1 List two aspects of management that are currently undergoing rapid innovation.
2 List the ve major components of the ISMF.
3 WBEM usually operates over which communication protocol?
4 List the constituent parts of the FC Management Service.
5 List two in-band techniques for management of SCSI enclosures.
CHAPTER
14
406
facilitate trafc capture. This method is known as in-line trafc capture. The disadvantages
of this approach include the following:
The suspect device must be taken ofine temporarily to connect the protocol decoder
and again to subsequently disconnect the protocol decoder.
The act of inserting another device into the data stream can introduce additional
problems or mask the original problem.
A one-to-one relationship exists between the suspect device and the protocol decoder,
so only one device can be decoded at any point in time unless multiple protocol
decoders are available. Note that protocol decoders tend to be very expensive.
All types of frames can be captured including low-level primitives that are normally
terminated by the physically adjacent device.
OSI Layer 1 issues related to faulty cabling and connectors can be detected.
To mitigate the drawbacks of in-line trafc capture, Cisco Systems developed an alternate
approach (out-of-line). The Switch Port Analyzer (SPAN) feature was introduced on the
Catalyst family of Ethernet switches in the mid-1990s. The MDS9000 family of switches
also supports SPAN. SPAN is also known as port mirroring and port monitoring. SPAN
transparently copies frames from one or more ports to a specied port called a SPAN
Destination (SD) port. In most cases, the SD port can be any port in the switch. A protocol
decoder is attached to the SD port. The disadvantages of this approach include the following:
Low-level primitives that are normally terminated by the device physically adjacent
to the suspect device cannot be captured.
OSI Layer 1 issues related to faulty cabling and connectors cannot be detected.
Multiple SPAN sessions can be congured and activated simultaneously. Thus, the
one-to-one relationship between the suspect device and the protocol decoder is
removed. This reduces the total number of protocol decoders required to troubleshoot
large networks.
SPAN trafc can be forwarded between switches via the Remote SPAN (RSPAN)
feature. RSPAN further reduces the total number of protocol decoders required to
troubleshoot large networks.
After the protocol decoder is connected to the switch, an administrator can congure
and activate SPAN/RSPAN sessions remotely.
No new devices are introduced into the original data stream, so no additional
problems are created.
407
Another approach is to use a signal splitter such as a tap or Y-cable, but this approach
has its own set of drawbacks that precludes widespread adoption. So, signal splitters are
used to meet niche requirements, and the in-line and out-of-line approaches are used to
meet mainstream requirements.
408
be used to capture and decode control trafc such as FLOGI requests and RSCNs. Captured
frames can be decoded in real time, saved to a le for future decoding, or encapsulated in
TCP/IP and forwarded in real time to a PC that has Ethereal installed.
A fth option is to integrate a hardware-based capture and decode product into a switch.
That is the case with the Cisco Systems Network Analysis Module (NAM). The NAM is
available for the Catalyst family of switches. The NAM can capture and decode frames
from multiple ports simultaneously. The NAM provides access to the decoded frames via
an integrated web server.
409
410
411
412
413
414
Review Questions
415
Summary
This chapter introduces the concepts of trafc capture and protocol decoding and explains
why they are needed. Various methods for capturing trafc are discussed and followed by
descriptions of the various types of protocol decoders. Screenshots of decoded frames are
included to help readers understand the value of protocol decoding. The concept of trafc
analysis is then presented and followed by descriptions of the various types of trafc
analyzers.
Review Questions
1 Why are protocol decoders necessary?
2 List the three methods of trafc capture.
3 Name the most popular free GUI-based protocol decoder.
4 What is the purpose of trafc analysis?
5 What is the highest performance solution available for trafc analysis?
PART
IV
Appendixes
Appendix A
Appendix B
Appendix C
APPENDIX
420
421
422
423
424
425
Common Information Model (CIM) Operations over HyperText Transfer Protocol (HTTP)
DSP0200
Version 1.2
12/9/2004
426
Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and
Physical Layer Specications
802.3-2002
3/8/2002
Media Access Control (MAC) Parameters, Physical Layers and Management Parameters
for 10Gbps Operation
802.3ae-2002
8/30/2002
427
Physical Layer and Management Parameters for 10Gbps Operation, Type 10GBASE-CX4
802.3ak-2004
3/1/2004
Resilient Packet Ring (RPR) Access Method and Physical Layer Specications
802.17-2004
9/24/2004
Internet Protocol
RFC 791
428
429
A TCP/IP Tutorial
RFC 1180
430
431
IP Authentication Header
RFC 2402
Denition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers
RFC 2474
432
433
434
435
Small Computer System Interface protocol over the Internet (iSCSI) Requirements and
Design Considerations
RFC 3347
Internet Protocol Small Computer System Interface (iSCSI) Cyclic Redundancy Check
(CRC)/Checksum Considerations
RFC 3385
User-based Security Model (USM) for version 3 of the Simple Network Management
Protocol (SNMPv3)
RFC 3414
Version 2 of the Protocol Operations for the Simple Network Management Protocol
(SNMP)
RFC 3416
436
String Prole for Internet Small Computer System Interface (iSCSI) Names
RFC 3722
437
Small Computer System Interface (SCSI) Command Ordering Considerations with iSCSI
RFC 3783
Finding Fibre Channel over TCP/IP (FCIP) Entities Using Service Location Protocol
version 2 (SLPv2)
RFC 3822
T11 Network Address Authority (NAA) Naming Format for iSCSI Node Names
RFC 3980
Finding Internet Small Computer System Interface (iSCSI) Targets and Name Servers by
Using Service Location Protocol version 2 (SLPv2)
RFC 4018
The Kerberos Version 5 Generic Security Service Application Program Interface (GSSAPI) Mechanism: Version 2
RFC 4121
438
Bootstrapping Clients using the Internet Small Computer System Interface (iSCSI)
Protocol
RFC 4173
The IPv4 Dynamic Host Conguration Protocol (DHCP) Option for the Internet Storage
Name Service
RFC 4174
Vendor Documents
439
Vendor Documents
IBM Enterprise Systems Connection (ESCON) Implementation Guide
SG24-4662-00
7/1996
APPENDIX
10-Gbps Ethernet
10GFC
10GigE
10-Gbps Ethernet
3PC
ACK
Acknowledge
ACL
ADISC
Discover Address
AH Authentication Header
AHS Additional Header Segment
ANSI
API
AQM
ARC
442
ASN.1
ATA
ATAPI
ATM
B_Port
Bridge Port
BA_ACC
BA_RJT
BB_SC
BBC
Basic Accept
Basic Reject
Buffer-to-Buffer_State_Change
Buffer-to-Buffer_Credit
BER
BF Build Fabric
BGP Border Gateway Protocol
BGP-4
CDB
443
Cyber Security
Control
CU Control Unit
CUP Control Unit Port
CWDM Coarse Wavelength Division Multiplexing
CWND Congestion Window
CWR
D_ID Destination ID
DA
DA
Directory Agent
DA
Destination Address
DAT
444
DC
DCI
Direct Current
Direct-Coupled Interlock
DD Discovery Domain
DDS Discovery Domain Set
DEC
DF Dont Fragment
DF_CTL
Differentiated Services
Domain of Interpretation
DPT
DS Data Streaming
DS Digital Signal Services
DS-0
Digital Signal 0
Double Transition
Expansion Port
ECE
ECN Echo
445
End-to-End_Credit
Electro-Magnetic Interference
EN Enterprise Number
EOF End-of-Frame
EOR
Electro-Optical Repeater
ESB
ESC
Frame Control
F_Port
Fabric Port
446
FCAL
FC Congestion Control
FC-FE
Fast Ethernet
447
FIN Final
FL
Fabric Loop
FL_Port
FLOGI
Fabric Login
FT_1
Gigabit Ethernet
GFP-T
GHz
Gigahertz
GigE
Gigabit Ethernet
448
HIPPI
HPC
InniBand
IDE
IKE
Integrated Services
IO Input/Output
449
IPv6
iSER
Inter-Switch Link
Inter-VSAN Routing
450
LOGO Logout
LPSM Loop Port State Machine
LR
LRR
Link Reset
Link Reset Response
LS_ACC
LS_RJT
MAC
451
MIT
Multimode Single-Ended
MSL
Node Port
Network Address Authority
NAS
NAS
NAT
452
NIC
NIR
NL
Node Loop
NL_Port
No Operation In
No Operation Out
Optical Carrier
OS-Code
OSI
Operation Code
PC
453
Personal Computer
PCB
PLOGI
N_Port Login
Process Login
Pause 1
PS2
Pause 2
PSH Push
PSS Principal Switch Selection
PSTN Public Switched Telephone Network
PTP Point-To-Point
PVID Port VLAN Identier
PWWN Port Worldwide Name
QAS Quick Arbitration and Selection
QoS Quality of Service
454
Routing Control
R_RDY Receiver_Ready
R_T_TOV Receiver Transmitter Time-Out Value
R2T
Ready to Transfer
Request Domain_ID
Remote Fault 1
RF2
Remote Fault 2
RFC
455
RPR
Reset
Retransmission Time-Out
RTT
Round-Trip Time
RTT
456
SCSI
SD SPAN Destination
SDH Synchronous Digital Hierarchy
SDLC
Single-Ended
SEQ_CNT
Sequence Count
Sequence Initiative
SMB
SMC
457
SMI
SNMPv3
SNR
Signal-to-Noise Ratio
SPI
458
Single Transition
ST
SPAN Trunk
SW_ACC
TE_Port
459
TPGT
TSecr
Timestamps Option
Task Set Type
TSval
Timestamp Value
TTL
Time to Live
TTT
U/L
Universal/Local
UA
User Agent
Unnumbered Information
UNC
URG Urgent
URL
VC_RDY Virtual_Circuit_Ready
460
VE_Port
VF Virtual Fabric
VI
Virtual Interface
VID VLAN ID
VISL
Windows Sockets
461
APPENDIX
Answer: IBM.
2 Who published the SMI-S?
Answer: Any two of the following are correct: increased throughput, improved
exibility, higher scalability, LAN-free backups, server-free backups, advanced
volume management.
4 What is an IP-SAN?
Answer: ATA.
6 What is the latest version of the SCSI protocol?
Answer: SCSI-3.
7 What is the term for an adapter that processes TCP packets on behalf of the host CPU?
464
Answer: While Microsoft gave CIFS its name, the original protocol was invented by
IBM.
12 What standards body made NFS an industry standard?
Answer: Any one of the following is correct: laser tolerances can be relaxed,
uncooled lasers can be used, optical lters can be less discerning.
16 In what class of networks is IEEE 802.17?
Chapter 2
1 How many layers are specied in the OSI reference model?
Answer: Seven. Many popular networking technologies divide the seven OSI Layers
into sublayers, but these optional subdivisions are not specied within the OSI
reference model.
2 The data-link layer communicates with which OSI layer in a peer node?
Chapter 3
465
7 Create a mnemonic device for the names and order of the OSI layers.
Answer: Any answer that helps the reader remember the names and the order of the
OSI Layers is correct. A popular mnemonic device is All People Seem To Need Data
Processing.
Chapter 3
1 List the functions provided by bit-level encoding schemes.
Answer: Star, linear, circular, tree, partial mesh and full mesh.
4 How can a protocol that is not designed to operate in a circular physical topology
Answer: Either of the following is correct: when a human learns of the service/device
name or address via some non-computerized method or when a device has been
manually congured.
6 What additional steps might be required after service/device discovery?
Answer: Name and/or address resolution depending on the contents of the discovery reply.
7 How many ULPs was the SPI designed to support? List them.
466
Answer: That ULPs will provide the required functionality that is missing from Ethernet.
13 What logical topology is supported by Ethernet?
Answer: Tree.
14 List three characteristics that contributed to the early success of TCP/IP.
Answer: SendTargets.
17 What two protocols are available for automated discovery of iSCSI target devices and
nodes?
Answer: SLP and iSNS.
18 Did FC evolve into a multi-protocol transport, or was it designed as such from the
beginning?
Answer: FC was designed to be multi-protocol from the beginning.
19 What is the ULP throughput rate of FCP on one Gbps FC?
Answer: Zoning.
Chapter 4
1 How does iSCSI complement the traditional storage over IP model?
Answer: Yes.
3 How does the FC network service model resemble the IP network service model?
Answer: Both models provide a robust set of services that are available to all
application protocols.
From the Library of Jose de Jesus Terriquez Garcia
Chapter 5
467
Answer: No.
8 Which address mode is most commonly used in modern iFCP deployments?
Chapter 5
1 What two components compose a SCSI logical unit?
Answer: In a network.
6 Is it possible for a network protocol to guarantee that frames or packets will never be
dropped?
Answer: No.
7 Why do most packet-switching technologies support bandwidth reservation schemes?
468
9 Which term describes tagging of frames transmitted on an ISL to indicate the VLAN
Answer: XENPAK.
11 What is the maximum operating range of the SPI?
Answer: 25m.
12 Does the SPI support in-order delivery?
Answer: 30km.
15 What is the BER of 1000BASE-T?
Answer: 1010.
16 What are the /C1/ and /C2/ ordered sets used for in 1000BASE-SX?
Answer: Three.
21 Can ow-based load balancing be implemented across an Ethernet port channel?
Answer: Yes.
22 Do copper-based and ber-based Ethernet implementations follow the same link
initialization procedures?
Answer: No.
Chapter 6
469
Answer: LC.
24 What is the maximum operating range of 2-Gbps FC on 62.5 micron MMF?
Answer: 150m.
25 How many 8B/10B control characters does 4-Gbps FC use?
Answer: One.
26 What information does the NAA eld of an FC WWN provide?
Answer: The type of address contained within the Name eld and the format of the
Name eld.
27 What do FC attached SCSI initiators do following receipt of an RSCN frame?
layer?
Answer: Yes.
29 Which FC CoS is used by default in most FC-SANs?
Answer: CoS 3.
30 Does FC currently support automation of link aggregation?
Answer: No, but some proprietary methods have recently appeared in the market.
31 What determines the sequence of events during extended FC link initialization?
Chapter 6
1 Is IP a routed protocol or a routing protocol?
Answer: Routed.
2 What IP term is equivalent to buffer?
Answer: Queue.
3 What is the Ethertype of IP?
Answer: 0x0800.
4 Does Ethernet padding affect an IP packet?
Answer: No.
470
Answer: A frame format denition, the Link Control Protocol (LCP) and a suite of
Network Control Protocols (NCPs).
6 Which of the IPCP negotiated options affect FCIP and iFCP?
Answer: None.
7 Are DNS names analogous to SAM port names?
Answer: No.
8 What is the IP equivalent of a SAM port identier?
Answer: IP address.
9 What is the granularity of a CIDR prex?
Answer: 56.93.16.0.
11 What is a non-authoritative DNS reply?
Answer: A reply that is sent to a DNS client by a DNS server that does not have
authority over the domain referenced in the reply.
12 Why is NAT required for communication over the Internet between devices using
Answer: It enables the receiving device to determine the format of the IP header.
16 What is the maximum length of the Data eld in an IP packet?
Answer: Yes.
18 How many ow-control mechanisms does IP support?
Answer: Four.
Chapter 7
471
Answer: No.
20 In which layer of the OSI Reference Model does ICMP reside?
header?
Answer: The Type eld indicates the category of message, while the Code eld
identies the specic message.
Chapter 7
1 Is reliable delivery required by all applications?
Answer: No.
2 Does UDP provide segmentation/reassembly services?
Answer: No.
3 What is the purpose of the Destination Port eld in the UDP packet header?
Answer: It enables the receiving host to properly route incoming packets to the
appropriate Session Layer protocol.
4 Does UDP provide notication of dropped packets?
Answer: Yes.
9 What is the purpose of the TCP ISN?
Answer: To synchronize the data byte counter in the responding hosts TCB with the
data byte counter in the initiating hosts TCB.
472
Answer: FC-2 performs segmentation, but the ULP controls the process.
11 Which two elds in the PLOGI ELS facilitate negotiation of end-to-end delivery
Answer: The Exchange Error Policy is specied by the Exchange originator on a perExchange basis via the Abort Sequence Condition bits of the F_CTL eld of the FC
header of the rst frame of each new Exchange.
13 After completing PLOGI with the switch, what does each initiator N_Port do?
Answer: Each initiator N_Port queries the FCNS and then performs PLOGI with each
target N_Port discovered via the FCNS.
Chapter 8
1 Name the three types of iSCSI names.
Answer: To promote initiator SCSI port persistence for the benet of applications and
to facilitate target recognition of initiator SCSI ports in multi-path environments.
3 What key feature does iSNS support that SLP does not support?
Answer: RSCN.
4 In the context of iSCSI, what is the difference between immediate data and unsolicited
data?
Answer: Immediate data is sent in the SCSI Command PDU, whereas unsolicited
data accompanies the SCSI Command PDU in a separate Data-Out PDU.
5 What is the function of the iSCSI R2T PDU?
Answer: The iSCSI target signals its readiness to receive write data via the R2T PDU.
6 Can security parameters be re-negotiated during an active iSCSI session?
Answer: No.
7 Why must iSCSI use its own CRC-based digests?
Answer: Because the TCP checksum cannot detect all bit errors.
Chapter 8
473
8 Does iSCSI support stateful recovery of active tasks in the presence of TCP
connection failures?
Answer: Yes.
9 Does FCP natively support the exchange of all required operating parameters to
Answer: No.
11 How many bytes long is the FCP_DATA IU header?
Answer: Zero. The receiver uses only the FC Header to identify an FCP_DATA IU,
and ULP data is directly encapsulated in the Data eld of the FC frame.
12 What is the name of the FC-4 link service dened by the FCP-3 specication?
Answer: The type of non-FC network to which the FC-BB device connects.
16 How many FC Entities may be paired with an FCIP Entity?
Answer: One.
17 Does the FCIP addressing scheme map to the SAM addressing scheme?
Answer: No.
18 Why must the FC frame delimiters be encapsulated for transport across FC backbone
networks?
Answer: Because they serve several purposes rather than merely delimiting the start
and end of each FC frame.
19 Which eld in the FSF may be set to zero for the purpose of discovery?
474
Chapter 9
1 What is the primary function of all ow-control mechanisms?
tail-drop?
Answer: FIFO.
4 Which specication denes trafc classes, class groupings, and class-priority
be transmitted?
Answer: The Congestion Window (CWND).
8 What is the primary ow-control mechanism employed by iSCSI?
Chapter 11
475
Answer: No.
Chapter 10
1 What are the two basic steps that every switching and routing protocol performs?
Answer: No.
4 Is RSTP backward-compatible with STP?
Answer: Yes.
5 Does EIGRP support load balancing across unequal-cost paths?
Answer: Yes.
6 How many levels in the topological hierarchy does OSPF support?
Answer: Two.
7 How do BGP-4 peers communicate?
Chapter 11
1 What is the primary goal of load balancing?
476
Answer: No.
5 Does iSCSI require any third-party software to perform load balancing at the
Transport Layer?
Answer: No.
6 Does FCP support Transport Layer load balancing?
Answer: No.
7 Does FCIP support Exchange-based load balancing across TCP connections within an
FCIP link?
Answer: No.
8 Name the two techniques commonly used by end nodes to load-balance across DMP
paths.
Answer: LUN-to-path mapping and SCSI command distribution.
Chapter 12
1 List the ve primary protection services for in-ight data.
Answer: SSH.
3 What type of security model does Ethernet implement?
Answer: No.
6 Is iSCSI authentication mandatory?
Answer: No.
7 Which security architecture was leveraged as the basis for the FC-SP architecture?
Answer: IPsec.
Chapter 14
477
Chapter 13
1 List two aspects of management that are currently undergoing rapid innovation.
Answer: The SMI, the MIB, the SNMP, a security model and management
applications.
3 WBEM usually operates over which communication protocol?
Answer: HTTP.
4 List the constituent parts of the FC Management Service.
Chapter 14
1 Why are protocol decoders necessary?
Answer: Because the tools integrated into server, storage and network devices are
sometimes insufcient to resolve communication problems.
2 List the three methods of trafc capture.
Answer: Ethereal.
4 What is the purpose of trafc analysis?
GLOSSARY
A
advanced technology attachment (ATA). The block-level protocol historically
used in laptops, desktops, and low-end server platforms. See also direct attached
storage (DAS).
anti-replay protection. A security service that detects the arrival of duplicate
B
back-pressure. Any mechanism that slows transmission by creating congestion
480
block-level protocol
block-level protocol. A protocol used to store and retrieve data via storage
device constructs.
block multiplexer. One of two classes of mainframe channel protocols.
bridge. A device that operates at the OSI data-link layer to connect two physical
network segments. A bridge can connect segments that use the same
communication protocol (transparent bridge) or dissimilar communication
protocols (translational bridge). A bridge can forward or lter frames.
byte multiplexer. One of two classes of mainframe channel protocols.
C
carrier sense multiple access with collision detection (CSMA/CD). The
used to verify the integrity of a frame or packet. Some checksum algorithms can
detect multi-bit errors; others can detect only single-bit errors.
class of service (CoS). (1) A synonym for QoS. (2) A eld in the 802.1Q
header used to mark Ethernet frames for QoS handling. (3) The mode of delivery
in Fibre Channel networks.
classful network. An IP network address that complies with the addressing
packets on the Internet using network addresses that do not comply with the
addressing denitions in RFC 791.
coarse wavelength division multiplexing (CWDM). A simplied implementa-
(CDB).
command/response protocol. A protocol used to control devices. The devices
can operate as peers, or as masters or slaves.
481
that all PDUs associated with a single SCSI command must traverse a single TCP
connection.
crosstalk. Signal interference on a conductor caused by another signal on a
nearby conductor.
cut-through switching. A method of switching in which the forwarding of each
frame begins while the frame is still being received. As soon as sufcient header
information is received to make a forwarding decision, the switch makes a
forwarding decision and begins transmitting the received bits on the appropriate
egress port. This method was popular during the early days of Ethernet switching
but is no longer used except in high performance computing (HPC) environments.
cyclic redundancy check (CRC). A mathematical calculation capable of
D
data condentiality. A security service that prevents exposure of application
482
Droop occurs when the round-trip time of a link is greater than the time required
to transmit sufcient data to ll the receivers link-level buffers.
dynamic multi-pathing (DMP). A generic term describing the ability of end
E
electromagnetic interference (EMI). The degradation of a signal caused by
483
Corporation, Intel, and Xerox. Ethernet II evolved from Ethernet, which was
originally created by Bob Metcalfe while working at Xerox PARC. Ethernet II is
sometimes called DIX, DIX Ethernet, or Ethernet DIX. DIX stands for Digital,
Intel, and Xerox.
EtherType. The system used by Ethernet and Ethernet II to assigned a unique
exchange recipient will handle frame and sequence errors. The exchange
originator species the error policy in the rst frame of each new exchange.
exchange status block (ESB). A small area in memory used to maintain the
484
fabric
F
fabric. The term used in Fibre Channel environments to describe a switched
network.
fabric manager. A SAN management application that provides the ability to
17 clock signals and 16 bits of conguration information that is used for autonegotiation of Ethernet link parameters in copper-based implementations.
FC-SAN. Acronym for Fibre Channel (FC) storage area network (SAN). A SAN
built on FC for the purpose of transporting SCSI or SBCCS trafc.
Fibre Channel Connection (FICON). A proprietary channel architecture based
constructs.
File Transfer Protocol (FTP). A le-level protocol used for rudimentary access
magnetic substance. Data blocks on a oppy disk are accessed randomly, but
capacity, throughput, and reliability are very low relative to hard disks. Floppy
disks are classied as removable media.
oppy disk drive. See oppy drive.
oppy drive. A device used to read from and write to oppy disks.
fragmentation. The process by which an intermediate network device
fragments an in-transit frame into multiple frames for the purpose of MTU
matching. Fragmentation occurs when the MTU of an intermediate network link
is smaller than the MTU of the transmitters directly connected link.
485
GH
gigabit interface converter (GBIC). A small hardware device used to adapt an
that contains compiled software code (such as BIOS or rmware) might behave a
certain way regardless of conguration parameters applied by the device
administrator. Such behavior is said to be hardcoded into the device because the
behavior cannot be changed.
hard disk. A storage medium and drive mechanism enclosed as a single unit.
The medium itself is made of aluminum alloy platters covered with a magnetic
substance and anchored at the center to a spindle. Data blocks on a hard disk are
accessed randomly. Hard disks provide high capacity, throughput, and reliability
relative to oppy disks. Hard disks are classied as permanent media.
hard disk drive. See hard disk.
hard drive. See hard disk.
hierarchical storage management (HSM). A method of automating data
network. An HBA is installed in the PCI bus, Sbus, or other expansion bus of
a host.
486
access text les across a network. HTTP is now often used for read-only access
to les of all types and also can be used for writing to les.
I
in-ber optical amplier (IOA). An optical ber that has been doped with a rare
earth element. Certain wavelengths excite the rare earth element, which in turn
amplies other wavelengths.
initiator-target nexus (I_T Nexus). A SCSI term referring to the relationship
numbers such as parameters, ports, and protocols for use on the global Internet.
Internet attached storage (IAS). See network attached storage (NAS).
487
JK
jitter. Variance in the amount of delay experienced by a frame or packet during
transit.
just-a-bunch-of-disks (JBOD). A chassis containing disks. JBOD chassis do
not support the advanced functionality typically supported by storage subsystems.
keepalives. Frames or packets sent between two communicating devices during
L
lambda. A Greek letter. The symbol for lambda is used to represent wavelength.
The term lambda is often used to refer to a signal operating at a specic wavelength.
LAN-free backup. A technique for backing up data that connes the data path to a SAN.
line rate. The theoretical maximum rate at which OSI Layer 2 frames can be
forwarded onto a link. Line rate varies from one networking technology to
another. For example, the line rate of IEEE 802.3 10BaseT is different than the line
rate of IEEE 802.5 16-Mbps Token Ring.
link aggregation. The bundling of two or more physical links in a manner that
blocks on a storage medium such as disk or tape. LBA constitutes the lowest level
of addressing used in storage and does not represent a virtualized address scheme.
logical unit number (LUN). The identication of a logical partition in SCSI
environments.
488
magnetic drum
M
magnetic drum. One of the earliest mediums for storing computer commands
and data. A magnetic drum is a cylindrical metal surface covered with a magnetic
substance.
magnetic tape. A long, thin strip of celluloid-like material covered with a
management system.
maximum transmission unit (MTU). The maximum payload size supported by
a given interface on a given segment of an OSI Layer 2 network. MTU does not
include header or trailer bytes. MTU is used to indicate the maximum data payload
to ULPs.
mean time between failures (MTBF). A metric for the reliability of a repairable
device expressed as the mean average time between failures of that device. MTBF
is calculated across large numbers of devices within a class.
mean time to failure (MTTF). A metric for the reliability of non-repairable
devices expressed as the mean average time between the rst use of devices within
a class and the rst failure of devices within the same class. MTTF is calculated
across large numbers of devices within a class.
mirror set. A set of partitions on physically separate disks to which data blocks
Examples include the SCSI Parallel Interface (SPI), IEEE 802.3 10Base2, and
IEEE 802.3 10Base5.
multi-source agreement (MSA). A voluntary agreement among multiple
ones complement
489
N
network address authority (NAA). Any standards body that denes an address
format.
network architecture. The totality of specications and recommendations
required to implement a network.
network attached storage (NAS). A class of storage devices characterized by
OP
ones complement. A mathematical system used to represent negative numbers
in computer systems.
490
ordered set
functions. An ordered set is usually 2 or 4 bytes long. Each byte has a specic
meaning, and the bytes must be transmitted in a specic order.
organizationally unique identier (OUI). The IEEE-RAC assigned 24-bit
value that is used as the rst three bytes of an EUI-48 or EUI-64 address. The OUI
identies the organization that administers the remaining bits of the EUI-48 or
EUI-64 address.
parallel ATA (PATA). A series of specications for ATA operation on parallel bus
interfaces.
parallel SCSI. A series of specications for SCSI operation on parallel bus
interfaces.
path maximum transmission unit (PMTU). The smallest MTU along a given
491
QR
quality of service (QoS). The level of delivery service guaranteed by a network
the digitally encoded signal, regenerates the signal, and retransmits the signal on
a different link than whence it was received.
resilient packet ring (RPR). A class of data-link layer metropolitan area
transport technologies that combine characteristics of TDM-based optical rings
with LAN technologies.
router. A device that operates at the OSI network layer to connect two data-link
layer networks. A router can forward or lter packets. Routers typically employ
advanced intelligence for processing packets.
routing. The combined act of making forwarding decisions based on OSI Layer 3
S
SAM identier. An identier assigned to a SCSI port or SCSI logical unit. SAM
SCSI port through which the logical unit is accessed. The identier assigned to a
logical unit can change. A SAM logical unit identier is commonly called a SCSI
logical unit number (LUN).
492
SAM name
SAM name. An identier assigned to a SCSI device or SCSI port that is globally
unique within the context of the SCSI Transport Protocol implemented by the
SCSI device. SCSI names are used to positively identify a SCSI device or SCSI
port. SCSI names never change.
SAM port identier. An identier that is unique within the context of the SCSI
service delivery subsystem to which the port is connected. The identier assigned
to a SCSI port can change.
SAN manager. See fabric manager.
SCSI application client. The SCSI protocol implemented in an initiator.
SCSI application protocol. A generic term encompassing both the SCSI
application client and the SCSI logical unit.
SCSI asynchronous mode. A mode of communication in which each request
493
data into multiple smaller chunks that can be framed for transmission without
exceeding the MTU of the egress port in the NIC, HBA, or other network interface.
semiconductor optical amplier (SOA). A device that converts an optical
signal to an electrical signal, amplies it, and then converts it back to optical.
sequence qualier. A sequence identier that is guaranteed to be unique during
the R_A_TOV and is composed of the S_ID, D_ID, OX_ID, RX_ID, and SEQ_ID
elds of the header of a Fibre Channel frame. The sequence qualier is used for
error recovery.
sequence status block (SSB). A small area in memory used to maintain the
bus interfaces.
serial attached SCSI (SAS). A new series of specications for SCSI operation
494
server-free backup
node or the ULP. The dropped frame or packet may be counted in an error log.
single-byte command code set (SBCCS). The block-multiplexer protocol
that provides the ability to discover, inventory, and monitor disk and tape
resources. Automated storage provisioning and reclamation functions are also
supported by some SRM applications. Some SRM applications also support
limited SAN management functionality.
synchronous replication
495
of each frame begins after the entire frame has been received. As soon as
sufcient header information is received to make a forwarding decision, the
switch makes a forwarding decision but does not begin transmitting until the rest
of the frame is received. This method is currently the de facto standard for
Ethernet switching in non-HPC environments.
stripe set. A set of partitions on physically separate disks to which data and
optional parity blocks are written. One block is written to one partition, then
another block is written to another partition, and so on until one block is written
to each partition. The cycle then repeats until all blocks are written. The set of
partitions is presented to applications as a single logical partition.
switch. A multiport bridge that employs a mechanism for internal port inter-
connection. A switch has the ability to simultaneously forward frames at line rate
across multiple ports. A switch typically employs advanced intelligence for
processing frames.
switch internal link service (SW_ILS). Special Fibre Channel procedures and
frame payloads that provide advanced low-level networking functions and operate
internal to a fabric (that is, between switches).
switching. The combined act of making forwarding decisions based on OSI
Layer 2 information and forwarding frames.
synchronous digital hierarchy (SDH). A physical layer wide area transport
which the states of the source and destination disk images are never more than one
write request out of sync at any point in time. Both the source and destination disk
devices must successfully complete the write request before SCSI status is
returned to the initiator.
496
T
TCP ofoad engine (TOE). An intelligent hardware interface to a network. A
TOE is capable of processing TCP and IP packets on behalf of the host CPU.
A TOE is installed in the PCI bus, Sbus, or other expansion bus of a host.
TCP port number. A 16-bit identier assigned by IANA to each well-known
session layer protocol that operates over TCP. Each port number that IANA
assigns may be used by TCP or UDP. Unassigned port numbers may be used by
vendors to identify their own proprietary session layer protocols. A range of port
numbers is reserved for use by clients to initiate TCP connections to servers.
TCP sliding window. The mechanism by which a host dynamically updates its
address of a host with the port number of a session layer protocol within the host.
third party copy (3PC). The predecessor to the SCSI-3 EXTENDED COPY command.
time division multiplexing (TDM). A technique for multiplexing signals onto a
shared medium in which the full bandwidth of the medium is allotted to each
source sequentially for a xed and limited period of time.
top level domain (TLD). A domain within the Domain Name System (DNS)
497
type of service (ToS). A eld in the IP Header that was previously used to
specify Quality of Service (QoS) requirements on a per-packet basis. This eld
was redened in RFC 2474 as the Differentiated Services (DiffServ) eld.
UVW
UDP port number. See TCP port number.
uniform naming convention (UNC). The resource identication mechanism
used by CIFS.
uniform resource locator (URL). The resource identication mechanism used
or a le-level entity such as an NFS mount point. A volume can also be a tape
cartridge or other removable media.
volume serial number (VSN). See volume tag.
volume tag. An identier used to track physical units of removable storage.
WDM window. Wavelengths generally in the range of 15001600 nanometers.
well known address (WKA). A reserved address that is used for access to a
network service.
well known logical unit. A logical unit that supports specied functions, which
INDEX
Numerics
1Base5, 71
8B/10B encoding scheme, 135
8B1Q4 encoding scheme, 136
10/100 auto-sensing switches, 68
10GE, throughput, 70
10GFC, throughput, 86
64B/66B encoding scheme, 137
1000BASE-X Conguration ordered sets,
elds, 147
A
AAA, 386
Abort Sequence (Abort Sequence) BLS, 295
ABTS BLS, 313314
Accept format (SRR), elds, 317
ACK segments, 235
acknowledgements, SAM delivery mechanisms,
119120
ACLs (access control lists), 390
address masking, 201
address resolution, FC, 162166
addressing
DNS, 197
Ethernet, 137138
EUI-64 format, 115116
FC, 158160
ANSI T11 WWN address elds, 156
FCAL, 87
FCIP, 331333
FCP, 293
IPv4, 198
dotted decimal notation, 202
elds, 199
subnetting, 199
iSCSI, 245
I_T nexus, 248
IP address assignment, 251
ISIDs, 247249
port names, 246
MAC-48 format, 114115
NAAs, 114
SAM
element addressing scheme, 112
identiers, 110
LUNs, 111
names, 110
SPI, 128
ADISC (Discover Address), 161
adoption rate of FCP, 98
advanced tail-drop, 356
advanced volume management, 8
agents, 395
aliases, 246
ALL parameter (SendTargets
command), 80
all-networks broadcasts, 369
ANSI (American National Standards
Institute), 8
ANSI T11 subcommittee
Ethernet address formats, 156159
Fibre Channel specifications, 5051
ANSI X3 committee, 9
anti-replay protection, 385
APIs, HBA API, 401
application layer, 44
AQM, as IP ow control mechanism, 357
arbitration processes, SPI, 129130
areas, 372
ARPANET model, 48
comparing with OSI reference model, 49
AS (autonomous system), 370
ASBRs (autonomous system boundary
routers), 373
ASF (ANSI T11 FC-BB) specication, 393
ASN.1 (Abstract Syntax Notation One), 397
assigning WKPs, 218
asymmetric ow control, 355
ATA (advanced technology attachment)
protocol, 1213
authentication
FC-SP model, DH-CHAP, 391
ISCSI, 391
Kerberos, 387
RADIUS, 386
SRP protocol, 387
supplicants, 389
TACACS+, 387
500
Autocreation Protocol
B
B_Access functional model (FCIP), 329331
B_Ports, 325
BA_ACC format of BLSs, 314
BA_RJT format of BLSs, 315
backbone architecture, FCIP, 325
back-pressure ow control mechanisms, 69
backup protocols
NDMP, 23
SCSI-3 EXTENDED COPY, 24
baud rate, 55
definition of, 85
BB_Credit (Buffer-to-Buffer) mechanism, 361
BB_Credit Recovery, 362
best path determination for IP routing
protocols, 379
best-effort service, 358
BGP-4, 373
BHS (Basic Header Segment) elds, 255257
Data-In BHS, 270271
Data-Out BHS, 268269
Login Request BHS, 258259
Login Response BHS, 260261
R2T BHS, 271273
Reject BHS, 281282
Reason codes, 282
SCSI Command BHS, 262263
SCSI Response BHS, 264, 267
SNACK Request BHS, 274275
TMF Request BHS, 275276
TMF Response BHS, 278280
bi-directional authentication, 391
Bidirectional Write Residual
Underow, 265
bit rates, Gbps FC, 85
block-level storage protocols
ATA, 1213
SBCCS, 1617
SCSI, 13, 15
block-oriented storage virtualization, 33
C
calculating ULP throughput on iSCSI, 76
call home feature, 399
capturing trafc, 405406
carrier signal, 55
cascade topologies, 60
cascading, 18
CCWs (channel command words), 16
cells, 42
channel architectures, 16
channels, dening, 84
CIDR (Classless Inter-Domain Routing), 201
CIFS (common Internet le system), 2022
CIM (Common Information Model), 398
Cisco Fabric Analyzer, protocol decoding, 407
Class N services, 173
classful addressing, 199
classifying routing protocols, 370
CmdSN (Command Sequence Number), 97
collapsed ring topologies, 5960
colored GBICs, 29
commands
linked, 253
mode select, 294
move medium, 113
non-immediate, 255
read element status, 113
scsi mode, 294
SendTargets, 250
community strings, 388
comparing FC and IP network models, 99
concentrators, 67
delivery mechanisms
conditioners, 358
conguration ordered sets, 136
connection allegiance, 293
connection initialization
for FC, 241242
TCP, 234235
connections, iSCSI stateful recovery, 292
connectivity
disruptions on FCIP, 101
IP, ubiquitious nature of, 7475
connectors
Ethernet, 132134
FC, 150153
SPI, 126
continuously increasing mode, 236
control frames, 16
conversations, 378
core-edge designs, 87
core-only designs, 87
corrupt packets, SAM delivery mechanisms, 119
CoS (Class of Service), 173
Crocker, Stephen, 75
crosstalk, mitigating, 6
CSMA/CD (carrier sense multiple access with
collision detection), 68
CUs (disk control units), 16
CWDM (coarse wavelength division
multiplexing), 2829
CWND (Congestion Window), 359
D
DAFS (direct access le system), 2023
DAFS Implementers Forum, 23
daisy-chains, cascading, 18
DAT (direct access transports), 22
DAT Collaborative, 23
data backup protocols
NDMP, 23
SCSI-3 EXTENDED COPY, 24
data condentiality, 385
data integrity service, 385
data link layer, 42
Data Link Layer technologies
comparing with Network Layer
technologies, 195
501
Ethernet, 195196
PPP, 196
data management, 396
data mobility, 8
data movers, 24
data origin authentication service, 385
data transfer optimization
FCIP, 345346
iSCSI, 253
DataDigest operational text key, 283
Data-In BHS, elds, 270271
Data-Out BHS, elds, 268269
DataPDUInOrder operational text key, 285
DataSequenceInOrder operational
text key, 285
DCI (direct-coupled interlock), 16
DD (discovery domain) membership, 83
debug commands, 405
DefaultTime2Retain operational text key, 285
DefaultTime2Wait operational text key, 285
dening storage networks, 911
delivery mechanisms
Ethernet, 144145
FC, 173176, 240241
FCIP, 343345
FCP, 320324
precise delivery, 322323
IP, 209210
iSCSI
in-order command delivery, 291292
PDU retransmission, 289291
SAM
acknowledgements, 119120
corrupt frames, 119
delivery failure notication, 118
duplicates, 118
ow control, 120
fragmentation, 121122
guaranteed bandwidth, 121
guaranteed delivery, 120
guaranteed latency, 121
in-order delivery, 122124
packet drops, 118
QoS, 120121
SPI, 131
TCP, 232234
UDP, 220
502
device discovery
device discovery, 62
Ethernet, 73
FC, 8990
IPS protocols
automated conguration, 8184
manual conguration, 79
semi-manual conguration, 7981
TCP/IP, 7881
devices, IAS, 10
DH-CHAP (Dife-Hellmann CHAP), 391
DiffServ (Differentiated Services
Architecture), 358
directed broadcasts, 206, 369
director-class FC switches, 99
discovery sessions, iSCSI, 251
distance vector routing protocols, 370
RIP, 371
DNS (Domain Name Service), 197
IPv4 address assignment and resolution,
204205
IPv4 name assignment and resolution, 203
domains, 160
dotted decimal notation, 202
Doyle, Jeff, 373
DPT (dynamic packet transport), 31
droop, 18
drops, causes of, 118
DS (data streaming), 16
DS-0 (digital signal 0), 26
duplicates, SAM delivery mechanisms, 118
DWDM (Dense Wave Division Multiplexing), 27
IOAs, 29
native interfaces, 28
protocol-aware interfaces, 28
transparent interfaces, 28
E
EBP (Exchange B_Access Parameters), 186
EBP Reject reason codes, 342
EBSI (Enhanced Backup Solutions
Initiative), 13
ECN (Explicit Congestion Notication), 358
EDFAs (erbium-doped ber ampliers), 29
EGPs (Exterior Gateway Protocols),
370, 373
EIGRP (Enhanced IGRP), 372
F
fabric managers, 396
Fabric Zone Server (FC Management
Service), 401
fabric-class FC switches, 99
FARP (Fibre Channel Address Resolution
Protocol), 161
FC (Fibre Channel), 84, 236
10GFC, throughput, 86
addressing, 156160
503
504
505
FC, 361362
FCIP, 363
FCP, 363
IP, 356
AQM, 357
ICMP Source-Quench
messages, 357
iSCSI, 360
network-level, 210
proactive versus reactive, 353
SAM, 120
TCP, 233, 359
ow-based Ethernet load balancing
algorithms, 378
FLP LCW, elds, 148149
format
of IPv4 packets, 207209
of IQN node names, 245
of ISIDs, 249
of UDP packets, 219220
FQXID (Fully Qualied Exchange
Identier), 99, 297
fragmentation
of IP packets, effect on
throughput, 75
versus segmentation, 122
frame-level load balancing, 122
frames, 41
Ethernet, 140
Ethernet II, 141
IEEE 802.1Q-2003 subheader format, 143
IEEE 802.3-2002, 140141
IEEE 802-2001 subheader format, 142
FC, fields, 168173
FC FLA/LS_ACC ELS, fields, 340341
FSF, fields, 337, 339340
OC-3 frame structure, 26
FSF (FCIP Special Frame) elds, 337340
FSFP (Fabric Shortest Path First), 374, 405
load balancing, 380
path determination, 368
FTP (File Transfer Protocol), 388
FTT (FCIP Transit Time), 328
full feature phase (iSCSI), 252
full-feature phase of iSCSI sessions, 80
functional models (FCIP)
B_Access functional model, 329, 331
VE_PORT functional model, 325328
506
gateways, iFCP
G
gateways, iFCP, 104
Gbps FC, 85
GE (gigabit Ethernet)
NIC teaming, 84
throughput, 70
GLBP (Gateway Load Balancing
Protocol), 379
graters, 29
guaranteed bandwidth, SAM, 121
guaranteed delivery, SAM, 120
guaranteed latency, SAM, 121
H
hard zones (FC), 90, 392
hardware-based protocol decoding, 407
HBA (host bus adapter), 14
Management Server, 401
HDLC-like Framing, 197
HeaderDigest operational text key, 283
high availability, achieving on FC
networks, 98
Host-based virtualization, 34
host-based virtualization, 3435
hot code load, 161
HSM (Hierarchical Storage Management), 396
Hunt Group addressing, 166
hybrid devices, 194
I
I_T (initiator-target) nexus, 248
IANA (Internet Assigned Numbers Authority),
WKP assignment, 218
IAS (Internet attached storage) devices, 10
IBM
Bus-and-Tag parallel channel
architecture, 17
channel architecture, 16
mainframe storage networking
ESCON, 1718
FICON, 1820
iSCSI
507
508
JK
jumbo frames, FCIP data transfer
optimization, 345
Kerberos, 387
L
LACP (Link Aggregation Control
Protocol), 145
lambda, 27
LAN-free backups, 8
Layer 1 (physical layer), 41
Layer 2 (data link layer), 42
Layer 3 (network layer), 42
Layer 3 switching, 206
Layer 4 (transport layer), 43
Layer 5 (session layer), 44
Layer 6 (presentation layer), 44
Layer 7 (application layer), 44
layers of OSI reference model, relativity to
Fibre Channel model layers, 50
LCW (Link Code Word), 148
leading connections, 257
leading login request, 257
lines, 27
link aggregation
Ethernet, 145
FC, 177
on SAM architecture, 125
SPI, 131
link initialization
Ethernet, 146, 149
FC, 177180, 183189
SPI, 132
linked command, 253
link-level ow control, 69
link-state routing protocols
LSAs, 370
OSPF, 372
LIRs (Local Internet Registries), 204
living protocols, 74
LLS_RJT ELS, elds, 309310
load balancing
and load sharing, 377
end node load balancing, 382
FC-based, 380
IP-based, 378
best-path determination, 379
GLBP, 379
on Ethernet, 378
SAM architecture, 123
Network layer
session-oriented
FCIP load balancing, 382
FCP load balancing, 382
iSCSC load balancing, 381
load sharing, 377
local broadcasts, 369
logical IP network boundaries, 205206
login phase (iSCSI), 80, 251
parameters, 282286
Login Request BHS, elds, 258259
Login Response BHS, elds, 260261
loop initialization, effect on FCAL
performance, 87
loop suppression, 60
on Ethernet switching protocols, 371
on IP routing protocols, 368
loop topologies, 60
LS_ACC ELS, elds, 312313
LS_ACC Service Parameter Page, elds, 307309
LSAs (link state advertisements), 370
LUN (Logical Unit Numbers), 66, 111
discovery process, 97
masking, 392
M
MAC-48 address format, 114115
mainframe storage networking
ESCON, 1718
FICON, 1820
management protocols, 388, 395
management stations, 395
manual device/service discovery for IPS
protocols, 79
MaxBurstLength operational text key, 284
MaxConnections operational text key, 285
MaxOutstandingR2T operational text key, 284
MaxRecvDataSegmentLength operational
text key, 284
media access
Ethernet, 138
FC, 167
SPI, 129
media changer element addressing (SAM), 113
media transport element, 113
message synchronization schemes, 289
messages. ICMP, 212213
509
Metcalfe, Robert, 67
mitigating crosstalk, 6
mode select command, 294
modeling languages, XML, 399
move medium command, 113
MSA (Multi-Source Agreement), 126
MSS (Maximum Segment Size) option, 227
MSTP (Multiple Spanning Tree Protocol), 371, 378
MTBF (meantime between failure), 13
multi-access security control, RBAC, 385
multi-byte option format (TCP), 226
MSS option, 227
Sack option, 229230
Sack-Permitted option, 229
Selective Acknowledgement option, 229
Timestamps option, 231232
Window Scale option, 227
multicasts, 369
multidrop bus, 65
multiport bridges, 72
N
N_Port (node port), 179
NAA (Network Address Authority), 114
node names, 246
NAM (Network Analysis Module), 408
name resolution
FC, 161
FCP, 293
iSCSI, 250251
Name Service Heartbeat, 83
names (SAM), 110
NAS lers, iSCSI-to-FCP protocol
conversion, 95
native mode (FICON), 19
NBNS (NetBIOS Name Service), 79
NDMP (Network Data Management
Protocol), 23
network boundaries, 205206
Ethernet, 139140
FC, 167168
network implementations, 39
Network layer, 42
addressing, 197
comparing with Data Link layer
technologies, 195
510
network models
ICMP, 211
messages, 212213
interface and port initialization, 214
IPv4, 193194
address assignment and resolution,
204205
addressing, 198199
delivery mechanisms, 209210
name assignment and resolution,
202203
network boundaries, 205206
network models, 39
network portals, 247
network specications, 39
network-based storage virtualization, 95
networking devices, 194
network-level load balancing, 123
NFS, 22
NIC teaming, 84
niche technologies, FC PTP topology, 87
NL_Ports, 87
node names (iSCSI), 245246
node-level load balancing, 123
non-immediate commands, 255, 292
NOP-In PDUs, iSCSI error handling,
290291
normal arbitration (SPI), 129130
normal name resolution, 250
normal sessions (iSCSI), 251
full feature phase, 252
Login phase, parameters, 282286
termination, 252
NPIV (N_Port_ID Virtualization), 160
O
OASIS (Organization for the Advancement of
Structured Information Standards), 399
OFMarker operational text key, 286
OFMarkInt operational text key, 286
OpenSSH distribution, 388
operating ranges
for Ethernet, 132134
for FC, 150153
operation codes (iSCSI), 256
operational parameter negotiation stage of iSCSI
login phase, 252, 282286
P
PAA (FC Port Analyzer Adapter), 407
packet switching, 43
packets, 43
fragementation, 121122
IPv4, 207209
TCP, format, 223224, 226
UDP, format, 219220
parallel 10GFC implementations, 86
parallel bus interfaces, 15
parameters for iSCSI Login Phase, 282286
parent domains, 203
PATA (parallel ATA), 12
path determination, 367
path vector protocols, 373
Pause Opcode (Pause Operation Code), 355356
PDUs (protocol data units), 40. See also IUs
iSCSI
BHS, 255264, 267282
511
protocol-aware interfaces, 28
PTP (point-to-point) topologies, 60
public extension keys, 286
Q
QAS (Quick Arbitration and Selection),
129130
QoS mechanisms
Ethernet, 355356
FC, 362
FCIP, 364
FCP, 363
IP, 358
SAM, 120121
TCP, 360
queuing, 195
management algorithms, 352
scheduling algorithms, 352
tail drop, 355
advance tail-drop, 356
IP ow control mechanism, 357
R
R2T (Ready To Transfer) PDU, 360
R2T BHS, elds, 271, 273
RACLs (router ACLs), 390
RADIUS (Remote Access Dial-In User
Service), 386
RAID, storage virtualization, 35
RARP (reverse ARP), 73
RBAC (Role Based Access Control), 385
R-commands, 388
RDMA (remote direct memory access), 22
reactive ow control mechanisms, 353
read element status command, 113
read operations, 252
Reason codes
for EBP Rejects, 342
for FCP_RJT, 318
for FC SW_RJT frames, 188189
reassembly, 124
REC ELS, elds, 310311
Recovery Abort procedure, FCP, 321
512
S
Sack option (TCP), 229230
Sack-Permitted option (TCP), 229
SAF-TE (SCSI Accessed Fault-Tolerant
Enclosures), 402
SAL (SCSI application layer), 46
SAM (SCSI-3 architecture model), 45
addressing
element addressing scheme, 112
LUNs, 111
media changer element addressing, 113
SONET/SDH
513
session layer, 44
session-oriented load balancing
FCIP load balancing, 382
FCP load balancing, 382
iSCSI load balancing, 381
sessions
FCIP, establishing, 342
FCP, establishing, 294, 304
iSCSI, connection allegiance, 293
SessionType operational text key, 283
SFF (Small Form Factor) committee, 12
SFTP (SSH FTP), 388
show commands, 405
signal splitters, 407
signaling
Ethernet, 135137
FC, 154155
SPI, 128
single-byte option format (TCP), 226
sliding windows (TCP), 359
slow start algorithm, 76
SLP (Service Location Protocol)
device/service discovery, automated
configuration, 8184
scopes, 82
SMI (Structure of Management
Information), 397
SMIS (Storage Management Initiative
Specication), 399
SNACK Request BHS, elds, 274275
SNAP (Subnetwork Access Protocol), 142
SNIA (Storage Networking Industry
Association), 399
SNMP (Simple Network Management
Protocol), 388
community strings, 388
versus ISMF, 397
SNMPv2, 397
SNMPv3, 398
SOAs (semiconductor optical ampliers), 27
sockets, 223
soft zones (FC), 90, 392
software-based protocol decoding, 407
solicited data, 254
SONET/SDH, 2527
EORs, 27
lines, 27
sections, 27
514
DAFS, 2223
NFS, 22
storage virtualization, 3237
block-oriented, 33
file-oriented, 3334
network-based, 95
storage-enabled networks, 11
STP (Spanning Tree Protocol), 370
stripe sets, 33
STS-1 (synchronous transport signal
level 1), 26
subnet prexes, 202
subnetting, 199200
supplicants, 389
SW_ILS (Switch Internal Link Service), 393
SW_RSCNs (inter-switch state change
notications), 90
switches, 42
auto-sensing, 68
path determination, 367
switching protocols, 370371
FSPF, 374
loop suppression, 371
symmetric ow control, 355
Syslog protocol, 400
T
TACACS+, 387
tail drop, 355
advanced tail-drop, 356
IP flow control mechanism, 357
target discovery, 97
Target Failure, 265
TargetAddress operational text key, 283
TargetAlias operational text key, 284
TargetName operational text key, 283
TargetPortalGroupTag operational
text key, 284
targets, 14
task retry identication, 323
TCP, 221223
connection initialization, 234235
delivery mechanisms, 232234
flow control, 233, 359
keep-alive mechanism, 222
515
516
U
UAs, DA discovery mechanisms, 81
UDP (User Datagram Protocol)
connection initialization, 221
delivery mechanisms, 220
fragmentation and reassembly, 221
packets, format, 219220
ULP multiplexing, 218
ULP throughput, 56
of Ethernet, 71
of FC, 86
of iSCSI, calculating, 76
ULPs (upper-layer protocols), 217
unbound TCP connections, 105
unintelligent electrical interfaces, 12
Unix, R-commands, 388
unsolicited data, 253
Unzoned Name Server (FC Management
Service), 401
V
VACLs (VLAN ACLs), 390
VE_Port functional model (FCIP), 325328
W
WBEM (Web-Based Enterprise
Management), 398
WDM window, 27
Web Services, 399
Window Scale option (TCP), 227
windowing, 75
WKAs (well-known addresses), 89
WKPs (well-known ports), assignment
of, 218
wrapping, 232
WRED (Weighted RED), 357
write operations, 253
WWN (Worldwide Names), 392
WWN assignment (FCIP), 331333
X
XDF (extended distance facility), 18
XDR (external data representation), 22
XML (Extensible Markup Language), 399
XTACACS (Extended TACACS), 387
ciscopress.com
Cisco Press
3 STEPS TO LEARNING
STEP 1
STEP 2
STEP 3
First-Step
Fundamentals
Networking
Technology Guides
STEP 1
STEP 2
STEP 3
Cisco Press
Your first-step to
networking starts here
Are you new to the world of networking? Whether you are beginning your networking career
or simply need a better understanding of a specific technology to have more meaningful
discussions with networking experts, Cisco Press First-Step books are right for you.
No experience required
Includes clear and easily understood explanations
Makes learning easy
Check out each of these First-Step books that cover key networking topics
Computer
Networking
First-Step
ISBN: 1-58720-101-1
LAN Switching
First-Step
Network Security
First-Step
Voice over IP
First-Step
ISBN: 1-58720-099-6
ISBN: 1-58720-156-9
TCP/IP First-Step
Routing First-Step
ISBN: 1-58720-108-9
ISBN: 1-58720-122-4
Wireless
Networks
First-Step
ISBN: 1-58720-111-9
ISBN: 1-58720-100-3
Cisco Press
FUNDAMENTALS SERIES
ESSENTIAL EXPLANATIONS AND SOLUTIONS
1-57870-168-6
IP Addressing Fundamentals
ISBN: 1-58705-067-6
IP Routing Fundamentals
ISBN: 1-57870-071-X
Visit www.ciscopress.com/series for details about the Fundamentals series and a complete list of titles.
Cisco Press
1-58705-139-7
Cisco Press
Learning is serious business.
Invest wisely.
Cisco Press
1-58705-142-7
1-58720-046-5
Gain hands-on
experience with
Practical Studies
books
Practice testing
skills and build
confidence with
Flash Cards and
Exam Practice
Packs
1-58720-079-1
1-58720-083-X
Cisco Press
1-57870-041-8
Cisco Press
Learning is serious business.
Invest wisely.
ciscopress.com
Cisco Press
1-58720-121-6
Visit www.ciscopress.com/netbus for details about the Network Business series and a complete list
of titles.
Cisco Press
SAVE UP TO 30%
Become a member and save at ciscopress.com!
Go to http://www.ciscopress.com/safarienabled
Complete the brief registration form
Enter the coupon code found in the front of this
book before the Contents at a Glance page
If you have difficulty registering on Safari Bookshelf or accessing the online edition,
please e-mail [email protected].
SEARCH THOUSANDS
OF BOOKS FROM
LEADING PUBLISHERS
Safari Bookshelf is a searchable electronic reference library for IT
professionals that features more than 2,000 titles from technical
publishers, including Cisco Press.
Search the full text of thousands of technical books, including more than 70 Cisco Press
titles from authors such as Wendell Odom, Jeff Doyle, Bill Parkhurst, Sam Halabi, and
Karl Solie.
Read the books on My Bookshelf from cover to cover, or just flip to the information
you need.
With a customized library, youll have access to your books when and where you need
themand all you need is a user name and password.