DS8800 Performance Monitoring and Tuning PDF
DS8800 Performance Monitoring and Tuning PDF
DS8800 Performance
Monitoring and Tuning
Understand the performance and
features of the DS8800 architecture
Gero Schmidt
Bertrand Dufrasne
Jana Jamsek
Peter Kimmel
Hiroaki Matsuno
Flavio Morais
Lindsay Oxenham
Antonio Rainero
Denis Senin
ibm.com/redbooks
International Technical Support Organization
July 2012
SG24-8013-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page xv.
This edition applies to the IBM System Storage DS8700 with DS8000 Licensed Machine Code (LMC) level
6.6.2x.xxx and the IBM System Storage DS8800 with DS8000 Licensed Machine Code (LMC) level
7.6.2x.xxx.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
Contents v
6.1.4 Output information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1.5 Workload growth projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1.6 Disk Magic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.2 Disk Magic for System z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.2.1 Process the DMC file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.2.2 zSeries model to merge the two ESS-800s to a DS8300 . . . . . . . . . . . . . . . . . . 188
6.2.3 Disk Magic performance projection for the zSeries model . . . . . . . . . . . . . . . . . 195
6.2.4 Workload growth projection for zSeries model . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.3 Disk Magic for Open Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.3.1 Process the Tivoli Storage Productivity Center CSV output file . . . . . . . . . . . . . 199
6.3.2 Open Systems model to merge the two ESS-800s to a DS8300 . . . . . . . . . . . . 206
6.3.3 Disk Magic performance projection for an Open Systems model . . . . . . . . . . . . 212
6.3.4 Workload growth projection for an Open Systems model . . . . . . . . . . . . . . . . . . 213
6.4 Disk Magic SSD modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.4.1 SSD advisor example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.5 Disk Magic Easy Tier modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.5.1 Easy Tier prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.6 General configuration planning guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.7 Storage Tier Advisor Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.7.1 STAT output samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Contents vii
9.6 HP-UX disk I/O architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
9.6.1 HP-UX High Performance File System (HFS). . . . . . . . . . . . . . . . . . . . . . . . . . . 382
9.6.2 HP-UX Journaled File System (JFS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
9.6.3 HP Logical Volume Manager (LVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
9.6.4 Veritas Volume Manager (VxVM) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.6.5 PV Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.6.6 Native multipathing in HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.6.7 Subsystem Device Driver (SDD) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.6.8 Veritas Dynamic MultiPathing (DMP) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . 389
9.6.9 Array Support Library (ASL) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.6.10 FC adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.7 HP-UX performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.7.1 HP-UX System Activity Report (SAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.7.2 vxstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
9.7.3 HP Perfview/Measureware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
9.8 Verifying your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Contents ix
14.2.2 HyperPAV compared to dynamic PAV test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
14.2.3 PAV and large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
14.3 Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
14.4 How PAV and Multiple Allegiance work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
14.4.1 Concurrent read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
14.4.2 Concurrent write operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
14.5 I/O Priority Queuing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
14.6 I/O Priority Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
14.7 Logical volume sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
14.7.1 Selecting the volume size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
14.7.2 Larger volume compared to smaller volume performance . . . . . . . . . . . . . . . . 508
14.7.3 Planning the volume sizes of your configuration. . . . . . . . . . . . . . . . . . . . . . . . 510
14.8 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
14.8.1 Extended Distance FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
14.8.2 High Performance FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
14.8.3 MIDAW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
14.9 z/OS planning and configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
14.9.1 Channel configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
14.9.2 Considerations for mixed workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
14.10 DS8000 performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
14.11 RMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
14.11.1 I/O response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
14.11.2 I/O response time components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
14.11.3 IOP/SAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
14.11.4 FICON host channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
14.11.5 FICON director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
14.11.6 Processor complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
14.11.7 Cache and NVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
14.11.8 DS8000 FICON/Fibre port and host adapter. . . . . . . . . . . . . . . . . . . . . . . . . . 530
14.11.9 DS8000 extent pool and rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Chapter 15. IBM System Storage SAN Volume Controller attachment . . . . . . . . . . . 533
15.1 IBM System Storage SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
15.1.1 SAN Volume Controller concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
15.1.2 SAN Volume Controller multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
15.1.3 SAN Volume Controller Advanced Copy Services . . . . . . . . . . . . . . . . . . . . . . 538
15.2 SAN Volume Controller performance considerations . . . . . . . . . . . . . . . . . . . . . . . . 539
15.3 DS8000 performance considerations with SAN Volume Controller . . . . . . . . . . . . . 541
15.3.1 DS8000 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
15.3.2 DS8000 rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
15.3.3 DS8000 extent pool implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
15.3.4 DS8000 volume considerations with SAN Volume Controller. . . . . . . . . . . . . . 543
15.3.5 Volume assignment to SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . 544
15.4 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
15.4.1 Monitoring the SAN Volume Controller with TPC for Disk . . . . . . . . . . . . . . . . 545
15.4.2 TPC Reporter for Disk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
15.4.3 The TPC Storage Tiering Reports and Storage Performance Optimizer . . . . . 546
15.5 Sharing the DS8000 between a server and the SAN Volume Controller . . . . . . . . . 547
15.5.1 Sharing the DS8000 between Open Systems servers and the IBM SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
15.5.2 Sharing the DS8000 between System z servers and the IBM SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
15.6 Advanced functions for the DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Contents xi
18.2 FlashCopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
18.2.1 FlashCopy objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
18.2.2 FlashCopy performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
18.2.3 Performance planning for IBM FlashCopy SE . . . . . . . . . . . . . . . . . . . . . . . . . 594
18.3 Metro Mirror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
18.3.1 Metro Mirror configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
18.3.2 Metro Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
18.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
18.4 Global Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
18.4.1 Global Copy configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
18.4.2 Global Copy performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
18.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
18.5 Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
18.5.1 Global Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
18.5.2 Global Mirror session parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
18.5.3 Avoid unbalanced configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
18.5.4 Growth within Global Mirror configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
18.6 z/OS Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
18.6.1 z/OS Global Mirror control dataset placement . . . . . . . . . . . . . . . . . . . . . . . . . 621
18.6.2 z/OS Global Mirror tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
18.6.3 z/OS Global Mirror enhanced multiple reader. . . . . . . . . . . . . . . . . . . . . . . . . . 622
18.6.4 zGM enhanced multiple reader performance improvement . . . . . . . . . . . . . . . 622
18.6.5 XRC Performance Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
18.7 Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
18.7.1 Metro/Global Mirror performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
18.7.2 z/OS Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
18.7.3 z/OS Metro/Global Mirror performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
18.8 Considerations for Easy Tier and remote replication . . . . . . . . . . . . . . . . . . . . . . . . 626
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
Contents xiii
xiv DS8800 Performance Monitoring and Tuning
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX 5L™ i5/OS® Redbooks (logo) ®
AIX® IBM® Resource Measurement Facility™
AS/400® IMS™ RMF™
CICS® iSeries® Storwize®
Cognos® Lotus® Symphony™
DB2 Universal Database™ MVS™ Sysplex Timer®
DB2® Netcool® System i®
developerWorks® OMEGAMON® System p®
DS4000® OS/390® System Storage®
DS6000™ Parallel Sysplex® System z9®
DS8000® Power Systems™ System z®
Easy Tier® POWER6+™ Tivoli Enterprise Console®
ECKD™ POWER6® Tivoli®
Enterprise Storage Server® POWER7® XIV®
ESCON® PowerHA® z/Architecture®
FICON® PowerPC® z/OS®
FlashCopy® PowerVM® z/VM®
GDPS® POWER® z/VSE®
Geographically Dispersed Parallel ProtecTIER® z10™
Sysplex™ Rational® z9®
GPFS™ Redbooks® zSeries®
HACMP™ Redpaper™
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
ITIL is a registered trademark, and a registered community trademark of The Minister for the Cabinet Office,
and is registered in the U.S. Patent and Trademark Office.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows NT, Windows, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.
Snapshot, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and other
countries.
UNIX is a registered trademark of The Open Group in the United States and other countries.
ITIL is a registered trademark, and a registered community trademark of the Office of Government
Commerce, and is registered in the U.S. Patent and Trademark Office.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel
SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
This IBM® Redbooks® publication provides guidance about how to configure, monitor, and
manage your IBM System Storage® DS8800 and DS8700 storage systems to achieve
optimum performance. It describes the DS8800 and DS8700 performance features and
characteristics, including IBM System Storage Easy Tier® and DS8000® I/O Priority
Manager. It also describes how they can be used with the various server platforms that attach
to the storage system. Then, in separate chapters, we detail specific performance
recommendations and discussions that apply for each server environment, as well as for
database and DS8000 Copy Services environments.
We also outline the various tools available for monitoring and measuring I/O performance for
different server environments, as well as describe how to monitor the performance of the
entire DS8000 storage system.
This book is intended for individuals who want to maximize the performance of their DS8800
and DS8700 storage systems and investigate the planning and monitoring tools that are
available.
The IBM System Storage DS8800 and DS8700 storage system features, as described in this
book, are available for the DS8700 with Licensed Machine Code (LMC) level 6.6.2x.xxx or
higher and the DS8800 with Licensed Machine Code (LMC) level 7.6.2x.xxx or higher.
For information about optimizing performance with the previous DS8000 models, DS8100
and DS8300, see the following IBM Redbooks publication: DS8000 Performance Monitoring
and Tuning, SG24-7146.
Performance: Any sample performance measurement data provided in this book is for
comparative purposes only. Remember that the data was collected in controlled laboratory
environments at a specific point in time by using the configurations, hardware, and
firmware levels available at that time. Current performance in real-world environments can
vary. Actual throughput or performance that any user experiences also varies depending
on considerations, such as the I/O access methods in the user’s job, the I/O configuration,
the storage configuration, and the workload processed. The data is intended only to help
illustrate how different hardware technologies behave in relation to each other. Contact
your IBM representative or IBM Business Partner if you have questions about the expected
performance capability of IBM products in your environment.
Gero Schmidt is an IT Specialist in the IBM Advanced Technical Support (ATS) technical
sales support organization in Germany. He joined IBM in 2001 working at the European
Storage Competence Center (ESCC) in Mainz, providing technical support for a broad range
of IBM storage systems (ESS, DS4000®, DS5000, DS6000™, DS8000, storage area
networks (SAN) Volume Controller (SVC), and XIV®) in Open Systems environments. His
primary focus is on IBM Enterprise drive storage solutions, storage system performance, and
IBM Power Systems™ with AIX® including PowerVM® and PowerHA®. He participated in the
product rollout and major release beta test programs of the IBM System Storage
Bertrand Dufrasne is an IBM Certified Consulting I/T Specialist and Project Leader for
System Storage disk products at the International Technical Support Organization, San Jose
Center. He has worked at IBM in various I/T areas. He has authored many IBM Redbooks
publications and has also developed and taught technical workshops. Before joining the
ITSO, he worked for IBM Global Services as an Application Architect. He holds a Masters
degree in Electrical Engineering.
Jana Jamsek is an IT Specialist for IBM Slovenia. She works in Storage Advanced Technical
Support for Europe as a specialist for IBM Storage Systems and the IBM i (i5/OS®) operating
system. Jana has eight years of experience in working with the IBM System i® platform and
its predecessor models, as well as eight years of experience in working with storage. She has
a Masters degree in Computer Science and a degree in Mathematics from the University of
Ljubljana in Slovenia.
Peter Kimmel is an IT Specialist and ATS team lead of the Enterprise Disk Solutions team at
the European Storage Competence Center (ESCC) in Mainz, Germany. He joined IBM
Storage in 1999 and since then worked with all the various Enterprise Storage Server® (ESS)
and System Storage DS8000 generations, with a focus on architecture and performance. He
was involved in the Early Shipment Programs (ESPs) of these early installs, and co-authored
several DS8000 IBM Redbooks publications. Peter holds a Diploma (MSc) degree in Physics
from the University of Kaiserslautern.
Hiroaki Matsuno is an IT Specialist in IBM Japan. He has three years of experience in IBM
storage system solutions working in the IBM ATS System Storage organization in Japan. His
areas of expertise include DS8000 Copy Services, SAN, and Real-time Compression
Appliance in Open Systems environments. He holds a Masters of Engineering from the
University of Tokyo, Japan.
Flavio Morais is a GTS Storage Specialist in Brazil. He has six years of experience in the
SAN/storage field. He holds a degree in Computer Engineering from Instituto de Ensino
Superior de Brasilia. His areas of expertise include DS8000 Planning, Copy Services, TPC
and Performance Troubleshooting. He has extensive experience solving performance
problems with Open Systems.
Antonio Rainero is a Certified IT Specialist working for the Integrated Technology Services
organization in IBM Italy. He joined IBM in 1998 and has more than 10 years of experience in
the delivery of storage services both for z/OS® and Open Systems clients. His areas of
expertise include storage subsystems implementation, performance analysis, SANs, storage
virtualization, disaster recovery, and high availability solutions. Antonio holds a degree in
Computer Science from University of Udine, Italy.
Andreas Bär, Werner Bauer, Lawrence Chiu, Nick Clayton, Thomas Edgerton, Lee La Frese,
Mark Funk, Kirill Gudkov, Hans-Paul Drumm, Rob Jackard, Chip Jarvis, Frank Krüger, Yang
Liu (Loren), Michael Lopez, Joshua Martin, Dennis Ng, Markus Oscheka, Brian Sherman,
Rick Ripberger, Louise Schillig, Günter Schmitt, Uwe Schweikhard, Jim Sedgwick,
Cheng-Chung Song, Markus Standau, Paulus Usong, Alexander Warmuth, Sonny Williams,
and Yan Xu
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Preface xix
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Data continually moves from one component to another component within a storage server.
The objective of server design is hardware of sufficient throughput to keep data flowing
smoothly without waiting because a component is busy. When data stops flowing because a
component is busy, a bottleneck forms. Obviously, it is desirable to minimize the frequency
and severity of bottlenecks.
The ideal storage server is one in which all components are used and bottlenecks are few.
This scenario is the case if the following conditions are met:
The machine is designed well, with all hardware components in balance. To provide this
balance over a range of workloads, a storage server must allow a range of hardware
component options.
The machine is sized well for the client workload. That is, where options exist, the right
quantities of each option are chosen.
The machine is set up well. That is, where options exist in hardware installation and logical
configuration, these options are chosen correctly.
Automatic rebalancing and tiering options can help achieve optimum performance even in an
environment of ever changing workload patterns. But they cannot replace a correct sizing of
the machine.
Throughput numbers are achieved in controlled tests that push as much data as possible
through the storage server as a whole, or perhaps through a single component. At the point of
maximum throughput, the system is so overloaded that response times are greatly extended.
Trying to achieve such throughput numbers in a normal business environment brings protests
from the users of the system, because response times are poor.
To assure yourself that the DS8800 offers the latest and fastest technology, look at the
performance numbers for the individual disks, adapters, and other components of the
DS8800, as well as for the total device. The DS8800 uses the most current technology
available. But, use a more rigorous approach when planning the DS8800 hardware
configuration to meet the requirements of a specific environment.
For additional information about this tool, see 6.1, “Disk Magic” on page 176.
The first method is spreading the workloads across components, which means that you try to
share the use of hardware components across all, or at least many, workloads. The more
hardware components are shared among multiple workloads, the more effectively the
hardware components are used, which reduces total cost of ownership (TCO). For example,
to attach multiple hosts, you can use the same host adapters for all hosts instead of acquiring
a separate set of host adapters for each host. However, the more that components are
shared, the more potential exists that one workload dominates the use of the component.
The second method is isolating workloads to specific hardware components, which means
that specific hardware components are used for one workload, and other hardware
components are used for different workloads. The downside of isolating workloads is that
certain components are unused when their workload is not demanding service. On the
upside, when that workload demands service, the component is available immediately. The
workload does not contend with other workloads for that resource.
Spreading the workload maximizes the usage and performance of the storage server as a
whole. Isolating a workload is a way to maximize the workload performance, making the
workload run as fast as possible. Automatic I/O prioritization can help avoid a situation in
which less-important workloads dominate the mission-critical workloads in shared
environments, and allow more shared environments.
For a detailed discussion, see 4.2, “Configuration principles for optimal performance” on
page 90.
Before these models, the first DS8000 generation offered the DS8100 (2-way) and DS8300
(4-way). The second generation offered the DS8100 Turbo and DS8300 Turbo models with
faster processors. The DS8300 and DS8100, however, do not share the current code level
options that exist for DS8800 and therefore are not covered in this book.
The DS8800 base frame houses the processor complexes, including system memory, up to
eight host adapters, and up to 240 disk modules. The first expansion frame houses up to
eight additional host adapters (for a total of 16) and up to 336 additional disk modules (for a
total of 576). A second or third DS8800 expansion frame houses up to 480 disk modules (for
a total of 1056 or 1536). There are no additional host adapters installed for the second or third
expansion frames. The maximum disk numbers for DS8800 are valid when using the 2.5 inch
small form factor (SFF) disks with 15K or 10K rpm. When using 3.5 inch nearline (7,200 rpm)
disks, the maximum numbers for these disks are half, due to their dimensions.
Table 1-1 provides an overview of the DS8000 models, including processor, memory, host
adapter, and disk specifications for each model. Each of the models comes with one Storage
Facility Image (SFI). All DS8800 Standard models can be upgraded nondisruptively from the
smallest 2-way model up to the largest multi-frame 4-way model.
Number of processor 2 2 2 2 2
complexes
Processor speed 5.0 GHz 5.0 GHz 5.0 GHz 4.7 GHz 4.7 GHz
Disk drive interface SAS 6 Gbps SAS 6 Gbps SAS 6 Gbps FC 4 Gbps FC 4 Gbps
technology
Internal communication
The DS8000 comes with a high-bandwidth, fault-tolerant internal interconnection, which is
also used in the IBM Power Systems servers. It is called RIO-G (Remote I/O) and used for the
cross-cluster communication.
Disk drives
The DS8800 offers a selection of industry standard Serial Attached SCSI (SAS 2.0) disk
drives. Most drive types (15,000 and 10,000 rpm) are 0.76 m (2.5 inch) small form factor
(SFF) sizes, with drive capacities of 146 GB - 900 GB. The solid-state drives (SSDs) are 2.5
inch SAS 2.0. The nearline drives (7,200 rpm) are 3.5 inch size drives with an SAS interface.
With the current maximum number and type of drives, the storage system can scale to over
2 petabytes (PB) of total gross capacity. All drive types of the previous DS8000 generations,
such as DS8700, use 3.5 inch Fibre Channel (FC) disk drives.
Host adapters
The DS8800 offers enhanced connectivity with the availability of either eight-port or four-port
FC/FICON host adapters (DS8700: four ports per host bus adapter (HBA) only). The 8 Gb/s
Fibre Channel/FICON host adapters, which are offered in longwave and shortwave versions,
can also auto-negotiate to 4 Gb/s or 2 Gb/s link speeds. With this flexibility, you can benefit
from the higher performance, 8 Gb/s SAN-based solutions and also maintain compatibility
with existing infrastructures. In addition, you can configure the ports on the adapter with an
intermix of FCP and FICON, which can help protect your investment in Fibre adapters and
increase your ability to migrate to new servers. A DS8800 can support up to a maximum of 16
host adapters, which provide up to 128 FC/FICON ports.
With all these new components, the DS8800 is positioned at the top of the high-performance
category. The following hardware components contribute to the high performance of the
DS8000:
Redundant Array of Independent Disks (RAID)
Array across loops (AAL)
POWER6® symmetrical multiprocessor system (SMP) processor architecture
Switched PCIe 8-Gbps architecture.
AMP provides a provable, optimal sequential read performance and maximizes the sequential
read throughputs of all RAID arrays where it is used, and therefore of the system.
SDD is provided with the DS8000 series at no additional charge. Fibre Channel (SCSI-FCP)
attachment configurations are supported in the AIX, Hewlett-Packard UNIX (HP-UX), Linux,
Microsoft Windows, and Oracle Solaris environments.
In addition, the DS8000 series supports the built-in multipath options of many distributed
operating systems.
Easy Tier determines the appropriate tier of storage based on data access requirements. It
then automatically and nondisruptively moves data, at the subvolume or sublogical unit
number (LUN) level, to the appropriate tier on the DS8000.
IBM announced Easy Tier with R5.1 (LMC 6.5.1) for the DS8700. IBM enhanced Easy Tier
with the R6.1 (LMC 7.6.10/6.6.10) and R6.2 (LMC 7.6.20/6.6.20) microcode versions that are
currently available for DS8800 and DS8700. Starting with DS8000 R6.2 LMC, Easy Tier
Automatic Mode provides automatic inter-tier or cross-tier storage performance and storage
economics management for up to three tiers. It also provides automatic intra-tier performance
management (auto-rebalance) in multi-tier (hybrid) or single-tier (homogeneous) extent pools.
IBM System Storage Easy Tier is designed to balance system resources to address
application performance objectives. It automates data placement across Enterprise-class
(Tier 1), nearline (Tier 2), and SSD (Tier 0) tiers, as well as among the ranks of the same tier
(auto-rebalance). The system can automatically and non-disruptively relocates data (at the
extent level) across any two or three tiers of storage and manually relocates full volumes. The
potential benefit is to align the performance of the system with the appropriate application
workloads. This enhancement can help clients to improve storage usage and address
performance requirements for multi-tier systems that do not yet deploy SSDs. For those
clients that already use some percentage of SSDs, Easy Tier analyzes the system and
migrates only those parts of the volumes, whose workload patterns benefit most from the
valuable SSD space, to the higher tier.
Easy Tier also provides a performance monitoring capability, regardless of whether the Easy
Tier license feature is activated. Easy Tier uses the monitoring process to determine what
data to move and when to move it when using automatic mode. The usage of thin
provisioning (extent space-efficient (ESE) volumes for Open Systems) is also possible with
Easy Tier.
You can enable monitoring independently (with or without the Easy Tier license feature
activated) for information about the behavior and benefits that can be expected if automatic
mode is enabled. Data from the monitoring process is included in a summary report that you
can download to your Microsoft Windows system. Use the IBM System Storage DS8000
Storage Tier Advisor Tool (STAT) application to view the data when you point your browser to
that file.
For downloading the monitoring tool STAT, follow the web link:
http://www.ibm.com/support/docview.wss?uid=ssg1S4000982
IBM Easy Tier is described in detail in IBM System Storage DS8000 Easy Tier, REDP-4667.
Prerequisites
The following conditions must be met to enable Easy Tier:
The Easy Tier license feature is enabled (required for both manual and automatic mode,
except when monitoring is set to All Volumes).
For automatic mode to be active, the following conditions must be met:
– Easy Tier automatic monitoring is set to either All or Auto Mode.
– For Easy Tier to manage pools, the Easy Tier Auto Mode setting must be set to either
Tiered Pools or All Pools (in the Storage Image Properties panel in the DS GUI).
In Easy Tier, both I/O per second (IOPS) and bandwidth algorithms determine when to
migrate your data. This process can help you improve performance.
Using automatic mode, you can use high-performance storage tiers with a smaller cost.
Therefore, you invest a small portion of storage in the high-performance storage tier. You can
use automatic mode for relocation and tuning without intervention. Automatic mode can help
generate cost-savings while optimizing your storage performance.
Three-tier automatic mode is supported with the following Easy Tier functions:
Support for ESE volumes with the thin provisioning of your FB volumes
Support for a matrix of device or disk drive modules (DDMs) and adapter types
Enhanced monitoring of both bandwidth and IOPS limitations
Enhanced data demotion between tiers
Automatic mode auto-performance rebalancing (auto-rebalance), which applies to the
following situations:
– Redistribution within a tier after a new rank is added into a managed pool (adding new
capacity)
– Redistribution within a tier after extent pools are merged
– Redistribution within a tier after a rank is removed from a managed pool
– Redistribution when the workload is imbalanced on the ranks within a tier of a
managed pool (natural skew)
To help manage and improve performance, Easy Tier is designed to identify hot data at the
subvolume or sub-LUN (extent) level, based on ongoing performance monitoring. Then, it
automatically relocates that data to an appropriate storage device in an extent pool that is
managed by Easy Tier. Easy Tier uses an algorithm to assign heat values to each extent in a
storage device. These heat values determine which tier is best for the data, and migration
takes place automatically. Data movement is dynamic and transparent to the host server and
to applications that use the data.
By default, automatic mode is enabled (through the DSCLI and DS Storage Manager) for
heterogeneous pools, when the Easy Tier license feature is activated. You can temporarily
disable automatic mode.
Auto-rebalance
Auto-rebalance is a function of Easy Tier automatic mode to balance the utilization within a
tier by relocating extents across ranks based on usage. Auto-rebalance is now enhanced to
support single-tier managed pools, as well as multi-tier hybrid pools. You can use the Storage
In any tier, placing highly active (hot) data on the same physical rank can cause the hot rank
or the associated DA to become a performance bottleneck. Likewise, over time, skew can
appear within a single tier that cannot be addressed by migrating data to a faster tier alone. It
requires some degree of workload rebalancing within the same tier. Auto-rebalance
addresses these issues within a tier in both hybrid (multi-tier) and homogeneous (single-tier)
pools. It also helps the system to respond in a more timely and appropriate manner to
overload situations, skew, and any under-utilization. These conditions can occur for the
following reasons:
Addition or removal of hardware
Migration of extents between tiers
Merger of extent pools
Changes in the underlying volume configuration
Variations in the workload
The latest version of Easy Tier provides support for auto-rebalancing even within
homogeneous, single-tier extent pools. If you set the Easy Tier Auto Mode control to manage
All Pools, Easy Tier also manages homogeneous extent pools with only a single tier and
performs intra-tier performance rebalancing. If Easy Tier is turned off, no volumes are
managed. If Easy Tier is turned on, it manages all supported volumes, either standard or ESE
volumes. Track space-efficient (TSE) volumes are not supported with Easy Tier and
auto-rebalancing.
Warm demotion
To avoid overloading higher-performance tiers in hybrid extent pools, and thus potentially
degrading overall pool performance, Easy Tier Automatic Mode monitors performance of the
ranks. It can trigger the move of selected extents from the higher-performance tier to the
lower-performance tier based on predefined either bandwidth or IOPS overload thresholds.
The Nearline tier drives perform almost as well as SSDs and Enterprise hard disk drives
(HDDs) for sequential (high bandwidth) operations.
This automatic operation is rank-based, and the target rank is randomly selected from the
lower tier. Warm demote is the highest priority to quickly relieve overloaded ranks. So, Easy
Tier continuously ensures that the higher-performance tier does not suffer from saturation or
overload conditions that might affect the overall performance in the extent pool.
Auto-rebalancing movement takes place within the same tier. Warm demotion takes place
across more than one tier. Auto-rebalance can be initiated when the rank configuration
changes the workload that is not balanced across ranks of the same tier. Warm demotion is
initiated when an overloaded rank is detected.
Cold demotion occurs when Easy Tier detects any of the following scenarios:
Segments in a storage pool become inactive over time, while other data remains active.
This scenario is the most typical use for cold demotion, where inactive data is demoted to
the Nearline tier. This action frees up segments on the Enterprise tier before the segments
on the Nearline tier become hot, which helps the system to be more responsive to new,
hot data.
In addition to cold demote, which uses the capacity in the lowest tier, segments with
moderate bandwidth but low random IOPS requirements are selected for demotion to the
lower tier in an active storage pool. This demotion better utilizes the bandwidth in the
Nearline tier (expanded cold demote).
If all the segments in a storage pool become inactive simultaneously due to either a planned
or an unplanned outage, cold demotion is disabled. Disabling cold demotion assists the user
in scheduling extended outages or when experiencing outages without changing the data
placement.
Figure 1-1 illustrates all of the migration types supported by the latest Easy Tier
enhancements in a three-tier configuration. The auto-rebalance might also include additional
swap operations.
Auto
Rebalance
Highest
P erformance
SSD SSD … SSD
Tier RANK 1 RANK 2 RANK n
Warm
P romote Demote Swap
Higher
ENT HDD ENT HDD E NT HDD
Perf ormance . ..
RA NK 1 RANK 2 RANK n
Tier
Lower
Perf ormance NL HDDRANK NL HDD RANK . .. NL HDD RANK
1 2 m
Tier
A ut oRebalance
In Easy Tier manual mode, you can dynamically relocate a logical volume between extent
pools or within an extent pool to change the extent allocation method of the volume or to
redistribute the volume across new ranks. This capability is referred to as dynamic volume
relocation. You can also merge two existing pools into one pool without affecting the data on
the logical volumes associated with the extent pools. In an older installation with many pools,
you can introduce the automatic mode of Easy Tier with automatic inter-tier and intra-tier
storage performance and storage economics management in multi-rank extent pools with one
or more tiers. Easy Tier manual mode also provides a rank depopulation option to remove a
rank from an extent pool with all the allocated extents on this rank automatically moved to the
other ranks in the pool.
The enhanced functions of Easy Tier manual mode provide additional capabilities. You can
use manual mode to relocate entire volumes from one pool to another pool. Upgrading to a
new disk drive technology, rearranging the storage space, or changing storage distribution for
a workload are typical operations that you can perform with volume relocations.
You can more easily manage configurations that deploy separate extent pools with different
storage tiers or performance characteristics. The storage administrator can easily and
dynamically move volumes to the appropriate extent pool. Therefore, the storage
administrator can meet storage performance or storage economics requirements for these
volumes transparently to the host and the application. Use manual mode to achieve these
operations and increase the options to manage your storage.
Volume migration
You can select which logical volumes to migrate, based on performance considerations or
storage management concerns:
Migrate volumes from one extent pool to another. You might want to migrate volumes to a
different extent pool with more suitable performance characteristics, such as different disk
drives or RAID ranks. For example, a volume that was configured to stripe data across a
single RAID can be changed to stripe data across multiple arrays for better performance.
Also, as different RAID configurations or drive technologies become available, you might
want to move a logical volume to a different extent pool with different characteristics. You
might also want to redistribute the available disk capacity between extent pools.
Change the extent allocation method that is assigned to a volume. You can relocate a
volume within the same extent pool but with a different extent allocation method (EAM).
For example, you might want to change the extent allocation method to help spread I/O
activity more evenly across ranks. If you configured logical volumes in an extent pool with
fewer ranks than now exist in the extent pool, you can use Easy Tier volume migration to
manually redistribute the volumes across new ranks using manual volume rebalance. If
you specify a different extent allocation method for a volume, the new extent allocation
method is effective immediately.
During extent relocation with manual volume rebalance, only one extent at a time is allocated
rather than to pre-allocate the full volume. Only a minimum amount of free capacity is
required in the extent pool, and only the required number of extents is relocated.
This option is preferred wherever the auto-rebalance option can be enabled for automatic
rebalancing of the workload across ranks of the same storage tier It relocates the extents
based on their actual workload pattern and provides an ongoing performance optimization
even when workload patterns change over time, or additional ranks are added. Manual
volume rebalance redistributes the volume capacity without considering any performance
characteristics or natural performance skew.
The manual volume rebalance is revoked in any managed or hybrid extent pools. It does not
matter whether the hybrid pool is currently managed or non-managed. Hybrid pools are
always assumed to be created for Easy Tier automatic mode management. Manual volume
rebalance is only available in non-managed homogeneous (single-tier) extent pools.
Volume rebalance can be achieved by initiating a manual volume migration into the same pool
by using rotate extents as the EAM. Use volume migration to manually rebalance the volume
capacity when a rank is added to a pool, when homogeneous extent pools are merged, or
when large volumes with an EAM of rotate volumes are deleted. Manual volume rebalance is
also referred to as capacity rebalance, because it balances the distribution of extents without
factoring in the actual extent usage or workload.
Use volume rebalance to relocate the smallest number of extents of a volume and restripe the
extents of that volume on all available ranks of the pool where it is located.
Figure 1-2 describes the Easy Tier monitor settings in conjunction with the installation of the
Easy Tier LIC feature.
For an introduction of the STAT features, see 6.7, “Storage Tier Advisor Tool” on page 231.
It is increasingly common to use one storage system to serve many categories of workloads
with different characteristics and requirements. The widespread use of virtualization and the
advent of cloud computing that facilitates consolidating applications into a shared storage
infrastructure are common practices. However, business-critical applications can suffer
performance degradation because of resource contention with less important applications.
Workloads are forced to compete for resources, such as disk storage capacity, bandwidth,
DAs, and ranks.
The I/O Priority Manager maintains statistics for the set of logical volumes in each
performance group that can be queried. If management is performed for the performance
policy, the I/O Priority Manager controls the I/O operations of all managed performance
groups to achieve the goals of the associated performance policies. The performance group
of a volume defaults to PG0, if none is specified. Table 1-2 lists the performance groups that
are predefined with their associated performance policies.
Table 1-2 DS8000 I/O Priority Manager: Performance group to performance policy mapping
Performance Performance Priority QoS target Ceiling Performance policy description
group policy (max. delay)
factor [%]
Each performance group comes with a predefined priority and QoS target. Because
mainframe volumes and Open Systems volumes are on separate extent pools with different
rank sets, they do not interfere with each other, unless in rare cases of overloaded DAs.
Open Systems use several performance groups (PG1 - PG5) and share the priority and QoS
characteristics. You can put applications into different groups for monitoring purposes, without
assigning different QoS priorities.
If the I/O Priority Manager detects resource overload conditions, such as resource contention
that leads to insufficient response times for higher-priority volumes, it throttles the I/O for
those volumes. The volumes are in lower-priority performance groups. This method allows
the higher-performance group applications to run faster and again meet their QoS targets.
Important: Lower-priority I/O operations are delayed by I/O Priority Manager only if
contention exists on a resource that causes a deviation from normal I/O operation
performance. The I/O operations that are delayed are limited to operations that involve the
RAID arrays or DAs that experience contention.
Performance groups are assigned to a volume at the time of the volume creation, as shown in
Figure 1-3.
Figure 1-3 Creating new volumes with an assigned performance group and priority
You can also assign performance groups to existing volumes by using the DSCLI chfbvol and
chckdvol commands. With the DS Storage Manager GUI, at any time, you can reassign
volumes online to other performance groups with a lower or higher QoS priority if
performance targets are either not met or constantly exceeded and inhibiting other
applications.
Modes of operation
I/O Priority Manager can operate in the following modes:
Disabled: I/O Priority Manager does not monitor any resources and does not alter any I/O
response times.
In both monitor and manage modes, I/O Priority Manager can send Simple Network
Management Protocol (SNMP) traps to alert the user when certain resources detect a
saturation event.
An interesting option of I/O Priority Manager is to be able to monitor the performance of the
entire storage system. For instance on a storage system, where all the volumes are still in
their default performance group PG0 (which is not managed by I/O Priority Manager), a
regular performance report of the machine can be obtained. See Example 1-1.
Example 1-1 Monitoring default performance group PG0 for one entire month, in one-day intervals
dscli> lsperfgrprpt -start 32d -stop 1d -interval 1d pg0
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
==============================================================================================================
2011-10-01/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8617 489.376 0.554 0 43 0 0 0 0
2011-10-02/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11204 564.409 2.627 0 37 0 0 0 0
2011-10-03/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 21737 871.813 5.562 0 27 0 0 0 0
2011-10-04/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 21469 865.803 5.633 0 32 0 0 0 0
2011-10-05/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23189 1027.818 5.413 0 54 0 0 0 0
2011-10-06/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 20915 915.315 5.799 0 52 0 0 0 0
2011-10-07/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 18481 788.450 6.690 0 41 0 0 0 0
2011-10-08/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19185 799.205 6.310 0 43 0 0 0 0
2011-10-09/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19943 817.699 6.069 0 41 0 0 0 0
2011-10-10/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 20752 793.538 5.822 0 49 0 0 0 0
2011-10-11/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23634 654.019 4.934 0 97 0 0 0 0
2011-10-12/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23136 545.550 4.961 0 145 0 0 0 0
2011-10-13/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19981 505.037 5.772 0 92 0 0 0 0
2011-10-14/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5962 176.957 5.302 0 93 0 0 0 0
2011-10-15/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2286 131.120 0.169 0 135 0 0 0 0
2011-10-16/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2287 131.130 0.169 0 135 0 0 0 0
2011-10-17/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2287 131.137 0.169 0 135 0 0 0 0
2011-10-18/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 10219 585.908 0.265 0 207 0 0 0 0
2011-10-19/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 22347 1281.260 0.162 0 490 0 0 0 0
2011-10-20/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 13327 764.097 0.146 0 507 0 0 0 0
2011-10-21/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 9353 536.250 0.151 0 458 0 0 0 0
2011-10-22/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 9944 570.158 0.127 0 495 0 0 0 0
2011-10-23/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11753 673.871 0.147 0 421 0 0 0 0
2011-10-24/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11525 660.757 0.140 0 363 0 0 0 0
2011-10-25/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5022 288.004 0.213 0 136 0 0 0 0
2011-10-26/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5550 318.230 0.092 0 155 0 0 0 0
2011-10-27/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8732 461.987 0.313 0 148 0 0 0 0
2011-10-28/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 13404 613.771 1.434 0 64 0 0 0 0
2011-10-29/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25797 689.529 0.926 0 51 0 0 0 0
2011-10-30/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25560 725.174 1.039 0 49 0 0 0 0
2011-10-31/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25786 725.305 1.013 0 49 0 0 0 0
If all volumes are in their default performance group, which is PG0, Example 1-1 on page 20
is possible. However, as soon as we want to start the I/O Priority Manager QoS management
and throttling of the lower-priority volumes, we need to put the volumes into non-default
performance groups. We use PG1 - PG15 for Open Systems and PG19 - PG31 for CKD
volumes.
Monitoring is then possible on the performance-group level, as shown in Example 1-2, the
RAID-rank level, as shown in Example 1-3, or the DA pair level, as shown in Example 1-4 on
page 22. Figure 1-4 shows how to obtain reports by using the DS Storage Manager GUI.
Example 1-2 Showing reports for a certain performance group PG28 for a certain time frame
dscli> lsperfgrprpt -start 3h -stop 2h pg28
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
============================================================================================================
2011-11-01/14:10:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 204 11.719 1.375 9 17 5 0 0 0
2011-11-01/14:15:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 401 23.020 1.273 9 18 5 0 0 0
2011-11-01/14:20:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 1287 73.847 1.156 9 28 5 0 0 0
2011-11-01/14:25:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 315 18.102 1.200 9 19 5 0 0 0
2011-11-01/14:30:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 261 15.013 1.241 9 22 5 0 0 0
2011-11-01/14:35:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 330 18.863 1.195 9 24 5 0 0 0
2011-11-01/14:40:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 535 30.670 1.210 9 19 5 0 0 0
2011-11-01/14:45:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 514 29.492 1.209 9 20 5 0 0 0
2011-11-01/14:50:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 172 9.895 1.191 9 21 5 0 0 0
2011-11-01/14:55:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 91 5.262 0.138 9 299 5 0 0 0
2011-11-01/15:00:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 0 0.000 0.000 9 0 5 0 0 0
2011-11-01/15:05:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 0 0.000 0.000 9 0 5 0 0 0
Figure 1-4 Getting I/O Priority Manager class reports and resource reports by using the GUI
The lsperfrescrpt command displays performance statistics for individual ranks, as shown
in Example 1-3 on page 21:
The first three columns show the average number of IOPS, throughput, and average
response time, in milliseconds, for all I/Os on that rank.
The %Hutl column shows the percentage of time that the rank utilization was high enough
(over 33%) to warrant workload control.
The %hlpT column shows the average percentage of time that I/Os were helped on this
rank for all performance groups. This column shows the percentage of time where lower
priority I/Os were delayed to help higher priority I/Os.
The %dlyT column specifies the average percentage of time that I/Os were delayed for all
performance groups on this rank.
The %impt column specifies, on average, the length of the delay.
The DS8000 prioritizes access to system resources to achieve the desired QoS of the volume
based on the defined performance goals (high, medium, or low) of any volume. I/O Priority
Manager constantly monitors and balances system resources to help applications meet their
performance targets automatically without operator intervention based on input from the
zWLM software.
The zWLM integration for the I/O Priority Manager is available with z/OS V1.11 and higher.
You must apply the following APARs: OA32298, OA34063, and OA34662.
You can run the I/O Priority Manager for z/OS the same way for Open Systems. A user can
control the performance results of the CKD volumes and assign them online into
With z/OS and zWLM software support, the user assigns application priorities through the
Workload Manager. z/OS then assigns an “importance” value to each I/O, based on the
zWLM inputs. Then, based on the prior history of I/O response times for I/Os with the same
importance value, and based on the zWLM expectations for this response time, z/OS assigns
an “achievement” value to each I/O.
The importance and achievement values for each I/O are compared. The I/O becomes
associated with a performance policy, independently of the performance group or policy of the
volume. When saturation or resource contention occurs, I/O is then managed according to the
preassigned zWLM performance policy.
Together these features can help to consolidate various applications on a single DS8000
system. These features optimize overall performance through automated tiering in a simple
and cost-effective manner while sharing resources. The DS8000 can help address storage
consolidation requirements, which, in turn, helps to manage increasing amounts of data with
less effort and lower infrastructure costs.
DS8000 I/O Priority Manager, REDP-4760, describes the I/O Priority Manager in detail. Also,
the following IBM patent provides insight:
http://depatisnet.dpma.de/DepatisNet/depatisnet?action=pdf&docid=US000007228354B2&
switchToLang=en
Understanding the hardware components, the functions that are performed by each
component, and the technology can help you select the correct components to order and the
quantities of each component. However, do not focus too much on any one hardware
component. Instead, ensure that you balance components to work together effectively. The
ultimate criteria for storage server performance is the total throughput.
Storage unit
A storage unit consists of a single DS8000 system (including expansion frames). A storage
unit can consist of several frames: one base frame and up to three expansion frames for a
DS8800 system and up to four expansion frames for a DS8700 system. The storage unit ID is
the DS8000 base frame serial number, ending in 0 (for example, 75-06570).
Processor complex
A DS8800 processor complex is one POWER6+ p570 copper-based symmetric
multiprocessor (SMP) system unit. Each processor complex runs on 5.0 GHz, either in a
2-way or 4-way model.
On all DS8000 models, there are two processor complexes (servers), which are housed in the
base frame. These processor complexes form a redundant pair so that if either processor
complex fails, the surviving processor complex continues to run the workload.
In a DS8000, a server is effectively the software that uses a processor logical partition (a
processor LPAR) and that has access to the memory and processor resources available on a
processor complex. The DS8800 model 951, as well as the DS8700 model 941, are
single-SFI models, and so, have one storage LPAR using 100% of the resources.
A Storage Facility Image has the following DS8000 resources dedicated to its use:
Processors
Cache and persistent memory
I/O enclosures
Disk enclosures
Figure 2-1 on page 27 shows an overview of the architecture of the DS8800 storage system.
The architecture of the previous system, DS8700, is similar to Figure 2-1 on page 27, with the
following exceptions:
Instead of POWER6+, the DS8700 runs POWER6 (with a slightly slower cycle speed).
The internal cabling is Fibre Channel (FC)/copper instead of optical cabling.
The DS8700 protocol to the disks is 2 Gbps. The DS8800 uses 8 Gbps up to the point of
the FC-to-SAS bridge.
The 4 Gbps host adapters of the DS8700 contained a PCI-X-to-PCIe interface, as
explained in 2.5, “Host adapters” on page 46.
On all DS8000 models, each processor complex has its own system memory. Within each
processor complex, the system memory is divided into these parts:
Memory used for the DS8000 control program
Cache
Persistent cache
The amount allocated as persistent memory scales according to the processor memory that
is selected.
Cache processing improves the performance of the I/O operations done by the host systems
that attach to the DS8000. Cache size, the efficient internal structure, and algorithms of the
Read operations
These operations occur when a host sends a read request to the DS8000:
A cache hit occurs if the requested data resides in the cache. In this case, the I/O
operation does not disconnect from the channel/bus until the read is complete. A read hit
provides the highest performance.
A cache miss occurs if the data is not in the cache. The I/O operation logically disconnects
from the host. Other I/O operations occur over the same interface. A stage operation from
the disk subsystem occurs.
The data remains in the cache and persistent memory until it is destaged, at which point it is
flushed from cache. Destage operations of sequential write operations to RAID 5 arrays are
done in parallel mode, writing a stripe to all disks in the RAID set as a single operation. An
entire stripe of data is written across all the disks in the RAID array. The parity is generated
one time for all the data simultaneously and written to the parity disk. This approach reduces
the parity generation penalty associated with write operations to RAID 5 arrays. For RAID 6,
data is striped on a block level across a set of drives, similar to RAID 5 configurations. A
second set of parity is calculated and written across all the drives. This technique does not
apply for the RAID 10 arrays, because there is no parity generation required. Therefore, no
penalty is involved other than a double write when writing to RAID 10 arrays.
It is possible that the DS8000 cannot copy write data to the persistent cache because it is full,
which can occur if all data in the persistent cache waits for destage to disk. In this case,
instead of a fast write hit, the DS8000 sends a command to the host to retry the write
operation. Having full persistent cache is not a good situation, because it delays all write
operations. On the DS8000, the amount of persistent cache is sized according to the total
amount of system memory. The amount of persistent cache is designed so that the probability
is low of full persistent cache occurring in normal processing.
Cache management
The DS8000 system offers superior caching algorithms:
Sequential Prefetching in Adaptive Replacement Cache (SARC)
Adaptive Multi-stream Prefetching (AMP)
Intelligent Write Caching (IWC)
IBM Storage Development in partnership with IBM Research developed these algorithms.
The decision to copy an amount of data into the DS8000 cache can be triggered from two
policies: demand paging and prefetching.
Demand paging means that eight disk blocks (a 4K cache page) are brought in only on a
cache miss. Demand paging is always active for all volumes and ensures that I/O patterns
with locality find at least some recently used data in the cache.
Prefetching means that data is copied into the cache even before it is requested. To prefetch,
a prediction of likely future data access is required. Because effective, sophisticated
prediction schemes need extensive history of the page accesses, the algorithm uses
prefetching only for sequential workloads. Sequential access patterns are commonly found in
video-on-demand, database scans, copy, backup, and recovery. The goal of sequential
prefetching is to detect sequential access and effectively preload the cache with data to
minimize cache misses.
For prefetching, the cache management uses tracks. A track is a set of 128 disk blocks
(16 cache pages). To detect a sequential access pattern, counters are maintained with every
track to record if a track is accessed together with its predecessor. Sequential prefetching
becomes active only when these counters suggest a sequential access pattern. In this
manner, the DS8000 monitors application read patterns and dynamically determines whether
it is optimal to stage into cache:
Only the page requested
The page requested, plus remaining data on the disk track
An entire disk track or multiple disk tracks not yet requested
The decision of when and what to prefetch is made on a per-application basis (rather than a
system-wide basis) to be responsive to the data reference patterns of various applications
that can run concurrently.
With the System z integration of newer DS8000 codes, a host application, such as DB2, can
send cache hints to the storage system and manage the DS8000 prefetching, reducing the
number of I/O requests.
To decide which pages are flushed when the cache is full, sequential data and random
(non-sequential) data are separated into different lists, as illustrated in Figure 2-2 on page 30.
MRU MRU
Desired size
SEQ bottom
LRU
RANDOM bottom
LRU
In Figure 2-2, a page that is brought into the cache by simple demand paging is added to the
Most Recently Used (MRU) head of the RANDOM list. With no further references to that
page, it moves down to the Least Recently Used (LRU) bottom of the list. A page that is
brought into the cache by a sequential access or by sequential prefetching is added to the
MRU head of the sequential (SEQ) list. It moves down that list as more sequential reads are
done. Additional rules control the management of pages between the lists so that the same
pages are not kept in memory twice.
To follow workload changes, the algorithm trades cache space between the RANDOM and
SEQ lists dynamically. Trading cache space allows the algorithm to prevent one-time
sequential requests from filling the entire cache with blocks of data with a low probability of
being read again. The algorithm maintains a desired size parameter for the SEQ list. The
desired size is continually adapted in response to the workload. Specifically, if the bottom
portion of the SEQ list is more valuable than the bottom portion of the RANDOM list, the
desired size of the SEQ list is increased. Otherwise, the desired size is decreased. The
constant adaptation strives to make optimal use of limited cache space and delivers greater
throughput and faster response times for a specific cache size.
SARC performance
IBM simulated a comparison of cache management with and without the SARC algorithm.
The new algorithm, with no change in hardware, provided these results:
Effective cache space: 33% greater
Cache miss rate: 11% reduced
Peak throughput: 12.5% increased
Response time: 50% reduced
Figure 2-3 on page 31 shows the improvement in response time due to SARC.
The SEQ list is managed by the Adaptive Multi-stream Prefetching (AMP) technology, which
was developed by IBM research. AMP introduces an autonomic, workload-responsive,
self-optimizing prefetching technology that adapts both the amount of prefetch and the timing
of prefetch on a per-application basis in order to maximize the performance of the system.
The AMP algorithm solves two problems that plague most other prefetching algorithms:
Prefetch wastage occurs when prefetched data is evicted from the cache before it can be
used.
Cache pollution occurs when less useful data is prefetched instead of more useful data.
By choosing the prefetching parameters, AMP provides optimal sequential read performance
and maximizes the aggregate sequential read throughput of the system. The amount
prefetched for each stream is dynamically adapted according to the needs of the application
and the space available in the SEQ list. The timing of the prefetches is also continuously
adapted for each stream to avoid misses and, at the same time, to avoid any cache pollution.
AMP dramatically improves performance for common sequential and batch processing
workloads. It also provides performance synergy with DB2 by preventing table scans from
being I/O bound. It improves the performance of index scans and DB2 utilities, such as Copy
and Recover. Furthermore, AMP reduces the potential for array hot spots that result from
extreme sequential workload demands.
The CLOCK algorithm uses temporal ordering. It keeps a circular list of pages in memory,
with the “hand” pointing to the oldest page in the list. When a page needs to be inserted in the
cache, then an R (recency) bit is inspected at the “hand” location. If R is zero, the new page is
put in place of the page to which the “hand” points and R is set to 1. Otherwise, the R bit is
cleared and set to zero. Then, the clock hand moves one step clockwise forward and the
process is repeated until a page is replaced.
The CSCAN algorithm uses spatial ordering. The CSCAN algorithm is the circular variation of
the SCAN algorithm. The SCAN algorithm tries to minimize the disk head movement when
servicing read and write requests. It maintains a sorted list of pending requests along with the
position on the drive of the request. Requests are processed in the current direction of the
disk head, until it reaches the edge of the disk. At that point, the direction changes. In the
CSCAN algorithm, the requests are always served in the same direction. When the head
arrives at the outer edge of the disk, it returns to the beginning of the disk and services the
new requests in this direction only. This algorithm results in more equal performance for all
head positions.
The idea of IWC is to maintain a sorted list of write groups, as in the CSCAN algorithm. The
smallest and the highest write groups are joined, forming a circular queue. The addition is to
maintain a recency bit for each write group, as in the CLOCK algorithm. A write group is
always inserted in its correct sorted position, and the recency bit is set to 0 at the beginning.
When a write hit occurs, the recency bit is set to 1. The destage operation proceeds. The
destage pointer is maintained that scans the circular list looking for destage victims. Now, this
algorithm only allows destaging of write groups whose recency bit is 0. The write groups with
a recency bit of 1 are skipped and the recent bit is then turned off and reset to 0. This method
gives an “extra life” to those write groups that are hit since the last time the destage pointer
visited them.
In the DS8000 implementation, an IWC list is maintained for each rank. The dynamically
adapted size of each IWC list is based on the workload intensity on each rank. The rate of
destage is proportional to the portion of nonvolatile storage (NVS) occupied by an IWC list.
The NVS is shared across all ranks in a cluster. Furthermore, destages are smoothed out so
that write bursts are not translated into destage bursts.
IWC has better or comparable peak throughput to the best of CSCAN and CLOCK across a
wide gamut of write-cache sizes and workload configurations. In addition, even at lower
throughputs, IWC has lower average response times than CSCAN and CLOCK. The
random-write parts of workload profiles benefit from the IWC algorithm. The costs for the
destages are minimized, and the number of possible write-miss IOPS greatly improves
compared to a system not utilizing IWC.
The IWC algorithm can be applied to storage systems, servers, and their operating systems.
The DS8000 implementation is the first for a storage system. Because IBM is getting patents
for this algorithm and the other advanced cache algorithms, it is unlikely that a competitive
system uses them.
For SAN Volume Controller (SVC) attachments, consider the SVC node cache in this
calculation, which might lead to a slightly smaller DS8000 cache. However, most installations
come with a minimum of 128 GB of DS8000 cache. Using solid-state drives (SSDs) in the
DS8000 does not change the prior values typically. SSDs are beneficial with cache-unfriendly
workload profiles, because SSDs generally reduce the cost of cache misses.
Most storage servers support a mix of workloads. These general rules can work well, but
many times, they do not. Use a general rule only if you have no other information on which to
base your selection.
When coming from an existing disk storage server environment and you intend to consolidate
this environment into DS8000s, follow these recommendations:
Choose a cache size for the DS8000 series that has a similar ratio between cache size
and disk storage to that of the configuration that you currently use.
When you consolidate multiple disk storage servers, configure the sum of all cache from
the source disk storage servers for the target DS8000 processor memory or cache size.
For example, consider replacing four DS8300s, each with 21 TB and 64 GB cache, with a
single DS8800. The ratio between cache size and disk storage for each DS8300 is 0.3% (64
GB/21 TB). The new DS8800 is configured with 110 TB to consolidate the four 21 TB
DS8300s, plus provide capacity for growth. This DS8800 requires 330 GB of cache to keep
the original cache-to-disk storage ratio. Round up to the next available memory size, which is
384 GB for this DS8800 configuration. When using an SVC in front, round down to 256 GB of
DS8800 cache.
The cache size is not an isolated factor when estimating the overall DS8000 performance.
Consider it with the DS8000 model, the capacity and speed of the disk drives, and the
number and type of host adapters. Larger cache sizes mean that more reads are satisfied
from the cache, which reduces the load on device adapters (DAs) and disk drive modules
(DDMs) associated with reading data from disk. To see the effects of different amounts of
cache on the performance of the DS8000, run a Disk Magic model. See 6.1, “Disk Magic” on
page 176.
DAs are called DA pairs, because DAs are always installed in quantities of two (one DA is
attached to each processor complex). The members of a DA pair are split across two I/O
enclosures for redundancy. The number of installed disk devices determines the number of
required DAs. In any I/O enclosure, the number of individual DAs installed can be zero, one,
or two.
The first DS8000 models, DS8100 and DS8300, used the RIO-G connection with the I/O bays
on them in loops. The approximate 2 GB/s throughput of the RIO-G port limited the maximum
sequential throughput of the older models. The newer architecture (Figure 2-1 on page 27) is
not limited in this way, because all I/O data traffic occurs directly between each I/O bay and
both processor complexes.
DAs are installed in pairs, because each processor complex requires its own DA to connect to
each disk enclosure for redundancy. DAs in a pair are installed in separate I/O enclosures to
eliminate the I/O enclosure as a single point of failure.
Each DA performs the RAID logic and frees up the processors from this task. The actual
throughput and performance of a DA is not only determined by the port speed and hardware
that are used, but also by the firmware efficiency. Figure 2-5 shows the detailed cabling
between the DAs and the 24-drive Gigapacks.
Gigapack Enclosures
Device SF P
SF P Processor Fla sh
SRAM
Processo r Fla sh
SRAM
Adapter
6 G bps SAS 6 Gbp s SAS
AC/DC AC/DC
Power Supply Power Supply
SF P
SAS SAS .. 24.. SAS SAS SAS SAS ..24.. SAS SAS
SF P AC/DC AC/DC
Device SF P
SF P
Power Supply Power Supply
SFP
ASIC SF P SFP
ASIC SFP
8Gb ps F C 8Gb ps FC 8 Gbp s F C 8Gb ps F C
SFP SF P SFP SFP
The ASICs provide the FC-to-SAS bridging function from the external SFP connectors to
each of the ports on the SAS disk drives. The processor is the controlling element in the
system.
Performance is enhanced because both DAs connect to the switched FC subsystem back
end, as shown in Figure 2-6. Each DA port can concurrently send and receive data.
The two switched point-to-point connections to each drive, which also connect both DAs to
each switch, have these characteristics:
There is no arbitration competition and interference between one drive and all the other
drives, because there is no hardware in common for all the drives in the FC-AL loop. This
approach leads to an increased bandwidth with the full 8 Gbps FC speed to the back end
where the FC-to-SAS conversion is made. This approach uses the full SAS 2.0 speed for
each drive.
This architecture doubles the bandwidth over conventional FC-AL implementations due to
two simultaneous operations from each DA to allow for two concurrent read operations
and two concurrent write operations at the same time.
In addition to superior performance, this setup offers improved reliability, availability, and
serviceability (RAS) over conventional FC-AL. The failure of a drive is detected and
reported by the switch. The switch ports distinguish between intermittent failures and
permanent failures. The ports understand intermittent failures, which are recoverable, and
collect data for predictive failure statistics. If one of the switches fails, a disk enclosure
service processor detects the failing switch and reports the failure using the other loop. All
drives can still connect through the remaining switch.
Figure 2-6 High availability and increased bandwidth connect both DAs to two logical loops
The DS8800 now supports two types of high-density storage enclosure: the 2.5 inch SFF
enclosure and the new 3.5 inch large form factor (LFF) enclosure. The new high-density and
lower-cost LFF storage enclosure accepts 3.5 inch drives, offering 12 drive slots. The
previously introduced SFF enclosure offers twenty-four 2.5 inch drive slots. The front of the
LFF enclosure differs from the front of the SFF enclosure, with its 12 drives slotting
horizontally rather than vertically.
DDMs are added in increments of 16 (except for SSDs or nearline disks, which can also be
ordered with a granularity of eight). For each group of 16 disks, eight are installed in the first
enclosure and eight are installed in the next adjacent enclosure. These 16 disks form two
array sites of eight DDMs each, from which RAID arrays are built during the logical
configuration process. For each array site, four disks are from the first (or third, fifth, and so
on), and four disks are from the second (or forth, sixth, and so on) enclosure.
All disks within a disk enclosure pair must be the same capacity and rotation speed. A disk
enclosure pair that contains less than 48 DDMs must also contain dummy carriers called
fillers. These fillers are used to maintain airflow.
SSDs can be ordered in increments of eight drives (= one array-site/rank = half drive set).
However, we suggest that you order them in increments of 16 drives (= two array-sites/ranks
= one drive set). Then, you can create a balanced configuration across the two processor
complexes, especially with the high-performance capabilities of SSDs. In general, an uneven
number of ranks of similar type drives (especially SSDs) can cause an imbalance in
resources, such as cache or processor usage. Use a balanced configuration with an even
number of ranks and extent pools, for instance, one even and one odd extent pool, and each
with one SSD rank. This balanced configuration enables Easy Tier automatic SSD/hard disk
drive (HDD) cross-tier performance optimization on both processor complexes. It distributes
the overall workload evenly across all resources. With only one SSD rank and multiple HDD
ranks of the same drive type, you can only assign the SDD rank to one extent pool. And, you
can run Easy Tier automated SSD/HDD cross-tier performance optimization on one
processor complex only.
For nearline HDDs, however, especially with only three array-sites per 3.5 inch disk enclosure
pair on a DS8800 system (in contrast to six array-sites in a 2.5 inch disk enclosure pair), it is
not so critical as it is with SSDs to go with an uneven number of ranks. nearline HDDs show
generally lower performance characteristics.
By putting half of each array on one loop and half of each array on another loop, there are
more data paths into each array. This design provides a performance benefit, particularly in
situations where a large amount of I/O goes to one array, such as sequential processing and
array rebuilds.
The 15K and 10K rpm disks are also available as encrypted drives or Full Disk Encryption
(FDE). FDE drives have essentially the same performance as the non-encrypted drives.
Additional drive types are constantly in evaluation and added to this list when available.
These disks provide a range of options to meet the capacity and performance requirements of
various workloads, and to introduce automated multi-tiering.
Figure 2-7 shows a comparison between the DS8700 15K rpm drives (blue) to the current
15K and 10K rpm drives in the DS8800 for one RAID-5 array, for both random and sequential
workloads.
Figure 2-7 Comparison 2.5-inch SAS (DS8800) versus 3.5-inch FC drives (DS8700) for one RAID-5 array
For random reads, which are the most prevalent type of reads in a workload profile, consider
these comparisons:
When comparing 4K random reads, the 2.5 inch 10K SAS drives deliver only 10% fewer
IOPS than the 3.5 inch 15K FC drives.
The 2.5 inch 15K SAS drives deliver around 10% more IOPS than the 3.5 inch 15K FC
drives.
For sequential loads, the DS8800 2.5 inch 10K SAS drives perform equivalently or better than
the DS8700 3.5 inch 15K FC drives. The 2.5 inch SFF drives consume half as much power as
the 3.5 inch HDDs and save on floor space.
The 10K rpm Enterprise-SAS HDDs with larger capacities are often combined with SSDs to
replace installations with 15K rpm FC small-capacity drives. This combination decreases the
footprint. Ensure that you analyze the workload and size the system.
Another difference between these drive types is the RAID rebuild time after a drive failure.
This rebuild time grows with larger capacity drives. Therefore, RAID 6 is used for the
large-capacity nearline drives to prevent a second disk failure during rebuild from causing a
loss of data. See 3.1.2, “RAID 6 overview” on page 53.
Easy Tier in automatic mode considers these differences and measures for each extent the
accumulated response times. It also differentiates the workload profile between
random/sequential, reads/writes. It considers the blocksize to determine whether to move an
extent upward to SSDs. Loads with large-block workload profiles are not moved up to SSDs,
because HDDs can respond to these I/O requests.
Figure 2-8 shows that for small-block (4K) 100% random-read loads, one SSD rank can
deliver over 40,000 IOPS. Writes, or a read+write mixture, show smaller results, but the
results are still good when compared to one HDD rank (as shown in Figure 2-7 on page 39).
We also see that the DS8800 performs better when used with SSDs than the DS8700 due to
the enhanced DAs.
Figure 2-8 SSD comparison DS8800 (2.5 inch) versus DS8700 (3.5 inch) for random and sequential with one array
RAID 5
Sample data: The sample performance measurement data shown in Figure 2-8 is for
comparative purposes only. We collected the data in a controlled laboratory environment at
a specific point in time by using the configurations, hardware, and firmware levels available
at that time. Current performance in real-world environments varies. The data is intended
to help illustrate how different hardware technologies behave in relation to each other.
Contact your IBM representative or IBM Business Partner if you have questions about the
expected performance capability of IBM products in your environment.
Sequential loads, as shown in Figure 2-8, are slightly faster than the HDD sequential
throughputs that are shown in Figure 2-7 on page 39. However, the difference is not large
enough to justify using SSDs for large-block sequential loads.
SSDs are available as single-level cell (SLC) and multi-level cell (MLC), which refers to
whether we use two or four voltage states per flash cell (NAND transistor). MLC SSDs allow
higher capacities, by storing 2 bits per cell, but fewer writes per flash cell. SSD technology is
evolving at a fast pace. In the past, we used SLC SSDs for enterprise-class applications.
Over-provisioning techniques are used for failing cells, and data in worn-out cells is copied
proactively. There are several algorithms for wear-leveling across the cells. The algorithms
include allocating a rarely used block for the next block to use or moving data internally to
less-used cells to enhance lifetime. Error detection and correction mechanisms are used, and
bad blocks are flagged.
SSDs are mature and can be used in critical production environments. Their high
performance for small-block/random workloads makes them financially viable for part of a
hybrid HDD/SSD pool mix.
DS8800
A disk enclosure (Gigapack) pair of the DS8800 holds 24 drives when using the 2.5 inch SFF
drives, or 12 drives for the LFF HDDs. Each disk enclosure is installed in a specific physical
location within a DS8000 frame and in a specific sequence. Disk enclosure pairs are installed
in most cases bottom-up. The fastest drives of a configuration (usually a solid-state drive
(SSD) drive set) are installed first. Then, the Enterprise HDDs are installed. Then, the
nearline HDDs are installed. Figure 2-9 shows an exception with the 2-way models.
Figure 2-9 shows how each disk enclosure is associated with a certain DA pair and position in
the DA loop. When ordering more drives, you can order more DA pairs for the DS8800 up to a
maximum of eight DA pairs.
HMC HMC
Eth ernet Switch Ethe rnet Switch
a
CE C
C EC a CE C
C EC
b
b
C EC
CE C C EC
CE C
0 1 1 0
2 3 3 2 2 3 3 2
2-way A 4-way A
Figure 2-9 DS8800 A-Frame DA pair installation order for 2-way and 4-way Standard models
B A
The C and D interfaces of the DAs can potentially be used to attach more disk enclosures that
are not used yet in the current DS8800 model.
DS8700
The DS8700 models can connect up to four expansion frames. The base frame and the fourth
expansion frame hold 128 DDMs. The first, second, and third expansion frames hold 256
DDMs. In total, 1024 (3.5 inch) drives can be installed in a DS8700.
Figure 2-11 on page 44 shows which disk enclosures are associated with which DA pair. With
more than 512 disks, each DA pair handles more than 64 DDMs (eight ranks). For usual
workload profiles, this number is not a problem. It is rare to overload DAs. The performance of
many storage systems is determined by the number of disk drives, as long as there are no HA
bottlenecks.
2.4.9 The Business Class model for DS8800 only: Feature Code 1250
There are two models of the DS8800 2-way. For typical environments, the Enterprise Class or
Standard model scales up to 144 disks in the A frame. For simpler environments without high
performance requirements, the Business Class cabling model (FC1250) scales up to 240
drives with only two DA pairs. Standard cabling needs four DA pairs and the 4-frame model to
fill all 240 drive slots of the A frame.
HMC
Eth ernet Switch
a
C EC
C EC a
b
b
C EC
C EC
2 3 3 2
2-way A
Business Class Cabling
Figure 2-12 DS8800 2-way Business Class
The 2-way Business Class feature is designed for clients with lower performance
requirements. For example, a client might replace a DS6800 that was attached to a System z
host, but still need a DS8800 for attachment reasons. These clients can save on the number
of processors, HAs, and DAs, and still have 240 possible disk drives available. However, the
Business Class model is practically limited to the A frame only.
In your environment, perhaps your sequential throughput requirements are high, but your
capacity requirements are low. For instance, you might have capacity requirements for 256
disks only, but you still want the full sequential throughput potential of all DAs. For these
situations, IBM offers the Performance Accelerator feature (FC1980). This feature provides
one new DA pair for each 32 DDMs. This feature is offered for DS8700 941 models that have
one base frame and one expansion frame. For example, this feature provides six DA pairs
with 192 disk drives. By using 256 drives, you can use the maximum of eight DA pairs.
Few clients needed this feature due to the high performance of the DAs without it.
You can see the high numbers that we achieved; it is difficult to overload one DA now. The
internal architecture is improved between the DS8800 and earlier DS8000 models.
3,500
3,000 DA-Pair DS8800
2,500 DA-Pair DS8700
MB/sec
Sample data: The sample performance measurement data shown in Figure 2-13 and
Figure 2-14 is for comparative purposes only. We collected the data in a controlled
laboratory environment at a specific point in time by using the configurations, hardware,
and firmware levels available at that time. Current performance in real-world environments
varies. The data is intended to help illustrate how different hardware technologies behave
in relation to each other. Contact your IBM representative or IBM Business Partner if you
have questions about the expected performance capability of the IBM products in your
environment.
Each DS8800 FC adapter offers 4 Gbps or 8 Gbps FC ports. The cable connector that is
required to attach to this adapter is an LC type. Each of the ports on a DS8800 HA can also
independently be either FCP or FICON. The type of the port can be changed through the DS
Storage Manager GUI or by using the DS8000 command-line interface (DSCLI) commands.
A port cannot be both FICON and FCP simultaneously, but it can be changed as required.
The front end with the 8 Gbps ports scales up to 128 ports for a DS8800, using the 8-port
HAs, which results in a theoretical aggregated host I/O bandwidth of 128 x 8 Gbps.
The 8 Gbps adapter ports can each negotiate to 8, 4, or 2 Gbps speeds (not 1 Gbps). For
attachments to 1 Gbps hosts, use a switch in between.
The 8-port HAs offer essentially the same total maximum throughput when taking loads of all
its ports together as the 4-port HAs of the DS8800. Therefore, the 8-port HAs are meant for
more attachment options, but not for more performance.
Compared to previous HA generations in DS8700 and DS8300, the DS8800 HAs have a
much higher possible total aggregated throughput over all ports, apart from the differences in
nominal port speed. More detailed HA measurements are in the IBM System Storage
DS8800 Performance Whitepaper, WP102025.
Automatic Port Queues is a mechanism the DS8800 uses to self-adjust the queue based on
the workload. The Automatic Port Queues mechanism allows higher port queue
oversubscription while maintaining a fair share for the servers and the accessed LUNs. The
port that the queue fills goes into SCSI Queue Fill mode, where it accepts no additional
commands to slow down the I/Os. By avoiding error recovery and the 30-second blocking
SCSI Queue Full recovery interval, the overall performance is better with Automatic Port
Queues.
After you determine your throughput workload requirements, you must choose the
appropriate number of connections to put between your open system hosts and the DS8000
to sustain this throughput. Use an appropriate number of HA cards to satisfy high throughput
demands. The number of host connections per host system is primarily determined by the
required bandwidth.
Host connections frequently go through various external connections between the server and
the DS8000. Therefore, you need enough host connections for each server so that if half of
the connections fail, processing can continue at the level before the failure. This
availability-oriented approach requires that each connection carry only half the data traffic
that it otherwise might carry. These multiple lightly loaded connections also help to minimize
the instances when spikes in activity might cause bottlenecks at the HA or port. A
multiple-path environment requires at least two connections. Four connections are typical,
and eight connections are not unusual.
In a System z environment, you need to select a SAN switch or director that also supports
FICON. An availability-oriented approach applies to the System z environments similar to the
Open Systems approach. Plan enough host connections for each server so that if half of the
connections fail, processing can continue at the level before the failure.
See 4.10.1, “I/O port planning considerations” on page 147 for additional guidelines.
For a list of drive combinations and RAID configurations, see 8.5.2, “Disk capacity,” in IBM
System Storage DS8000 Architecture and Implementation, SG24-8886.
Performance of the RAID 5 array returns to normal when the data reconstruction onto the
spare device completes. The time taken for sparing can vary, depending on the size of the
failed DDM and the workload on the array, the switched network, and the DA. The use of
arrays across loops (AAL) both speeds up rebuild time and decreases the impact of a rebuild.
Smart Rebuild, introduced with the DS8000 Licensed Machine Code (LMC) R6.2, further
reduces the risk of a second drive failure for RAID 5 ranks during rebuild by detecting a failing
drive early and copying the drive data to the spare drive in advance. If a RAID 5 array is
predicted to fail, a “rebuild” is initiated by copying off that failing drive to the spare drive before
it fails, decreasing the overall rebuild time. If the drive fails during the copy operation, the
rebuild continues from the parity information like a regular rebuild.
RAID 6 is best used in combination with large capacity disk drives, for example, such as 2 TB
and 3 TB nearline drives, because of longer rebuild times and the increased risk of an
additional medium error in addition to the failed drive during the rebuild. In many
environments today, RAID 6 is also considered already for 600 GB (and above) Enterprise
drives in cases where reliability is favored over performance and the trade-off in performance
versus reliability can be accepted. However, with the Smart Rebuild capability that was
introduced with the DS8000 LMC R6.2, the risk of a second drive failure for RAID 5 ranks is
also further reduced.
Comparing RAID 6 to RAID 5 performance provides about the same results on reads. For
random writes, the throughput of a RAID 6 array is only two thirds of a RAID 5 array because
of the additional parity handling. Workload planning is important before implementing RAID 6,
specifically for write-intensive applications, including Copy Services targets and FlashCopy
SE repositories where they are not generally recommended to be used. Yet, when properly
sized for the I/O demand, RAID 6 is a considerable reliability enhancement.
During the rebuild of the data on the new drive, the DA can still handle I/O requests of the
connected hosts to the affected array. A performance degradation can occur during the
reconstruction, because DAs and back-end resources are involved in the rebuild. Additionally,
any read requests for data on the failed drive require data to be read from the other drives in
the array, and then the DA reconstructs the data. Any later failure during the reconstruction
within the same array (second drive failure, second coincident medium errors, or a drive
failure and a medium error) can be recovered without loss of data.
Performance of the RAID 6 array returns to normal when the data reconstruction, on the
spare device, completes. The rebuild time varies, depending on the size of the failed DDM
and the workload on the array and the DA. The completion time is comparable to a RAID 5
rebuild, but slower than rebuilding a RAID 10 array in a single drive failure.
RAID 10 is not as commonly used as RAID 5, mainly because more raw disk capacity is
required for every gigabyte of effective capacity. Typically, RAID 10 is used for workloads with
a high random write ratio.
While this data reconstruction occurs, the DA can still service read and write requests to the
array from the hosts. There might be degradation in performance while the sparing operation
is in progress, because DA and switched network resources are used to reconstruct the data.
Write operations are not affected. Performance of the RAID 10 array returns to normal when
the data reconstruction, onto the spare device, completes. The time taken for sparing can
vary, depending on the size of the failed DDM and the workload on the array and the DA.
In relation to RAID 5, RAID 10 sparing completion time is a little faster. Rebuilding a RAID 5
6+P configuration requires six reads plus one parity operation for each write. A RAID 10 3+3
configuration requires one read and one write (a direct copy).
Each enclosure places two Fibre Channel (FC) switches onto each loop. FC/SAS DDMs are
purchased in groups of 16 (drive set). Half of the new DDMs go into one disk enclosure, and
half of the new DDMs go into the other disk enclosure of the pair. The same setup applies to
SSD and NL SAS drives where we also have a half drive set purchase option with only eight
DDMs.
An array site consists of eight DDMs. Four DDMs are taken from one enclosure in the disk
enclosure pair, and four are taken from the other enclosure in the pair. Therefore, when a
RAID array is created on the array site, half of the array is on each disk enclosure.
One disk enclosure of the pair is on one FC switched loop, and the other disk enclosure of the
pair is on a second switched loop. The array, or AAL, is split across two loops.
AAL is used to increase performance. When the DA writes a stripe of data to a RAID 5 array,
it sends half of the write to each switched loop. By splitting the workload in this manner, each
loop is worked evenly. This setup aggregates the bandwidth of the two loops and improves
performance. If RAID 10 is used, two RAID 0 arrays are created. Each loop hosts one RAID 0
array. When servicing read I/O, half of the reads can be sent to each loop, again improving
performance by balancing workload across loops.
A minimum of one spare is created for each array site assigned to an array until the following
conditions are met:
Minimum of four spares per DA pair
Minimum of four spares for the largest capacity array site on the DA pair
Minimum of two spares of capacity and rpm greater than or equal to the fastest array site
of any capacity on the DA pair
Floating spares
The DS8000 implements a smart floating technique for spare DDMs. A floating spare is
defined this way. When a DDM fails and the data it contained is rebuilt onto a spare, then
when the disk is replaced, the replacement disk becomes the spare. The data is not migrated
to another DDM, such as the DDM in the original position that the failed DDM occupied.
The DS8000 microcode takes this idea one step further. It might choose to allow the hot
spare to remain where it is moved, but it can instead choose to migrate the spare to a more
optimum position. This migration can better balance the spares across the DA pairs, the
loops, and the disk enclosures. It might be preferable that a DDM that is currently in use as an
array member is converted to a spare. In this case, the data on that DDM is migrated in the
background onto an existing spare. This process does not fail the disk that is being migrated,
although it reduces the number of available spares in the DS8000 until the migration process
is complete.
The DS8000 uses this smart floating technique so that the larger or higher rpm DDMs are
allocated as spares. Allocating the larger or higher rpm DDMs as spares ensures that a spare
can provide at least the same capacity and performance as the replaced drive. If we rebuild
the contents of a 450 GB DDM onto a 600 GB DDM, one-fourth of the 600 GB DDM is
wasted, because that space is not needed. When the failed 450 GB DDM is replaced with a
new 450 GB DDM, the DS8000 microcode most likely migrates the data back onto the
recently replaced 450 GB DDM. When this process completes, the 450 GB DDM rejoins the
array and the 600 GB DDM becomes the spare again.
Another example is if we fail a 146 GB 15K rpm DDM onto a 600 GB 10K rpm DDM. The data
is now moved to a slower DDM and wastes significant space. The array has a mix of RPMs,
which is not desirable. When the failed disk is replaced, the replacement is the same type as
the failed 15K rpm disk. Again, a smart migration of the data is performed after suitable
spares are available.
Overconfiguration of spares
The DDM sparing policies support the overconfiguration of spares. This possibility might be of
interest to certain installations, because it allows the deferral of the repair of certain DDM
failures until a later repair action is required.
By using the DSCLI Tool, it is possible to check for any failed drives when issuing the
DSCLI lsddm -state not_normal command. See Example 3-1.
The definition of virtualization is the abstraction process from the physical disk drives to a
logical volume that is presented to hosts and servers in a way that they see it as though it
were a physical disk.
When talking about virtualization, we mean the process of preparing physical disk drives
(DDMs) to become an entity that can be used by an operating system, which means we are
talking about the creation of logical unit numbers (LUNs).
The DDMs are mounted in disk enclosures and connected in a switched FC topology, by
using a Fibre Channel Arbitrated Loop (FC-AL) protocol. The DDM physical installation differs
between DS8800 and DS8700:
For the DS8700, DDMs are mounted in 16 DDM enclosures. You can order disk drives in
groups of 8 or 16 drives of the same capacity and revolutions per minute (rpm). The
options for 8-drive sets apply for the 600 GB solid-state drives (SSDs) and 2 TB nearline
drives.
The DS8800 small form factor disks are mounted in 24 DDM enclosures. Disk drives can
be ordered in groups of eight or 16 drives of the same capacity and rpm. The option for
8-drive sets apply for the 300 GB SSDs and 3 TB nearline drives. The DS8800 now also
supports a new high-density and lower-cost large form factor (LFF) storage enclosure.
This enclosure accepts 3.5 inch drives, offering 12 drive slots. The previously introduced
small form factor (SFF) enclosure offers twenty-four 2.5 inch drive slots. The appearance
of the front of the LFF enclosure differs from the appearance of the front of the SFF
enclosure, with its 12 drives that slot horizontally rather than vertically.
The disk drives can be accessed by a pair of DAs. Each DA has four paths to the disk drives.
One device interface from each DA connects to a set of FC-AL devices so that either DA can
access any disk drive through two independent switched fabrics (the DAs and switches are
redundant).
Because of the switching design, each drive is in close reach of the DA, and certain drives
require a few more hops through the FC switch. Therefore, it is not a loop but a switched
FC-AL loop with the FC-AL addressing schema, that is, Arbitrated Loop Physical Addressing
(AL-PA).
Array
Site
Switch
Loop 1 Loop 2
Figure 3-1 Array site
As you can see in Figure 3-1, array sites span loops. Four DDMs are taken from loop 1 and
another four DDMs from loop 2. Array sites are the building blocks that are used to define
arrays.
3.2.2 Arrays
An array is created from one array site. Forming an array means defining it as a specific RAID
type. The supported RAID types are RAID 5, RAID 6, and RAID 10 (see 3.1, “RAID levels and
spares” on page 52). For each array site, you can select a RAID type. The process of
selecting the RAID type for an array is also called defining an array.
Important: In the DS8000 implementation, one array is defined by using one array site.
On the right side in Figure 3-2, the terms, D1, D2, D3, and so on, represent the set of data
contained on one disk within a stripe on the array. If, for example, 1 GB of data is written, it is
distributed across all disks of the array.
Array
Site
D1 D7 D13 ...
D2 D8 D14 ...
D3 D9 D15 ...
D6 P D17 ...
So, an array is formed by using one array site, and while the array can be accessed by each
adapter of the DA pair, it is managed by one DA. Later in the configuration process, you
define the adapter and the server that manage this array.
3.2.3 Ranks
In the DS8000 virtualization hierarchy, there is another logical construct, a rank. When
defining a new rank, its name is chosen by the DS Storage Manager, for example, R1, R2,
and R3. You must add an array to a rank.
Important: In the DS8000 implementation, a rank is built by using only one array.
The available space on each rank is divided into extents. The extents are the building blocks
of the logical volumes. An extent is striped across all disks of an array, as shown in Figure 3-3
on page 60, and indicated by the small squares in Figure 3-4 on page 62.
IBM System z users or administrators typically do not deal with gigabytes or gibibytes, and
instead they think of storage in the original 3390 volume sizes. A 3390 Model 3 is three times
the size of a Model 1. A Model 1 has 1113 cylinders (about 0.94 GB). The extent size of a
CKD rank is one 3390 Model 1 or 1113 cylinders.
Figure 3-3 shows an example of an array that is formatted for FB data with 1 GiB extents (the
squares in the rank indicate that the extent is composed of several blocks from DDMs).
D1 D7 D13 ...
Data
Data D2 D8 D14 ...
Data
Data
RAID D3 D9 D15 ...
Array
D4 D10 D16 ...
Data
Data D5 D11 P ...
Creation of
a Rank
. .. .
. .. .
. .. .
FB Rank
1 GiB 1 GiB 1 GiB 1 GiB . .. .
. .. .
of 1 GiB
. .. .
extents
. .. .
. .. .
It is still possible to define a CKD volume with a capacity that is an integral multiple of one
cylinder or a fixed block LUN with a capacity that is an integral multiple of 128 logical blocks
(64 KB). However, if the defined capacity is not an integral multiple of the capacity of one
extent, the unused capacity in the last extent is wasted. For example, you can define a one
cylinder CKD volume, but 1113 cylinders (one extent) are allocated and 1112 cylinders are
wasted.
Encryption group
A DS8000 series can be ordered with encryption-capable disk drives. If you plan to use
encryption before creating a rank, you must define an encryption group. For more
information, see IBM System Storage DS8700 Disk Encryption Implementation and Usage
Guidelines, REDP-4500. Currently, the DS8000 series supports only one encryption group.
All ranks must be in this encryption group. The encryption group is an attribute of a rank. So,
your choice is to encrypt everything or nothing. You can switch on (create an encryption
group) encryption later, but then all ranks must be deleted and re-created, which means that
your data is also deleted.
With Easy Tier, it is possible to mix ranks with different characteristics and features in
managed extent pools to achieve the best performance results. With R6.2, you can mix up to
three storage classes or storage tiers within the same extent pool that feature SSD,
Enterprise, and nearline-class disks.
Important: In general, do not mix ranks with different RAID levels or disk types (size and
RPMs) in the same extent pool if you are not implementing Easy Tier automatic
management of these pools. Easy Tier has algorithms for automatically managing
performance and data relocation across storage tiers and even rebalancing data within a
storage tier across ranks in multi-tier or single-tier extent pools, providing automatic
storage performance and storage economics management with best price, performance,
and energy savings costs.
There is no predefined affinity of ranks or arrays to a storage server. The affinity of the rank
(and its associated array) to a certain server is determined at the point that the rank is
assigned to an extent pool.
One or more ranks with the same extent type (FB or CKD) can be assigned to an extent pool.
One rank can be assigned to only one extent pool. There can be as many extent pools as
there are ranks.
With storage pool striping (the extent allocation method (EAM) rotate extents), you can create
logical volumes striped across multiple ranks. This approach typically enhances performance.
To benefit from storage pool striping (see “Rotate extents (storage pool striping) extent
allocation method” on page 68), multiple ranks in an extent pool are required.
Storage pool striping enhances performance, but it also increases the failure boundary. When
one rank is lost, for example, in the unlikely event that a whole RAID array failed due to
multiple failures at the same time, the data of this single rank is lost. Also, all volumes in the
pool that are allocated with the rotate extents option are exposed to data loss. To avoid
exposure to data loss for this event, consider mirroring your data to a remote DS8000.
When an extent pool is defined, it must be assigned with the following attributes:
Server affinity (or rank group)
Storage type (either FB or CKD)
Encryption group
Just like ranks, extent pools also belong to an encryption group. When defining an extent
pool, you must specify an encryption group. Encryption group 0 means no encryption.
Encryption group 1 means encryption. Currently, the DS8000 series supports only one
encryption group, and encryption is on for all extent pools or off for all extent pools.
The minimum reasonable number of extent pools on a DS8000 is two. One extent pool is
assigned to storage server 0 (rank group 0), and the other extent pool is assigned to storage
server 1 (rank group 1) so that both DS8000 storage servers are active. In an environment
where FB storage and CKD storage share the DS8000 storage system, four extent pools are
Extent pools are expanded by adding more ranks to the pool. Ranks are organized in two
rank groups: Rank group 0 is controlled by storage server 0 (processor complex 0), and rank
group 1 is controlled by storage server 1 (processor complex 1).
Important: For the best performance, balance capacity between the two servers and
create at least two extent pools, with one extent pool per DS8000 storage server.
Server1
Server0
1G iB 1GiB 1G iB 1GiB
FB FB FB FB
Dynamic extent pool merge allows one extent pool to be merged into another extent pool
while the logical volumes in both extent pools remain accessible to the host servers. Dynamic
extent pool merge can be used for the following scenarios:
Use dynamic extent pool merge for the consolidation of smaller extent pools of the same
storage type into a larger homogeneous extent pool that uses storage pool striping.
Creating a larger extent pool allows logical volumes to be distributed evenly over a larger
number of ranks, which improves overall performance by minimizing skew and reducing
the risk of a single rank that becomes a hot spot. In this case, a manual volume rebalance
needs to be initiated to restripe all existing volumes evenly across all available ranks in the
new pool. Newly created volumes in the merged extent pool allocate capacity
Important: You can only apply dynamic extent pool merge among extent pools associated
with the same DS8000 storage server affinity (storage server 0 or storage server 1) or rank
group. All even-numbered extent pools (P0, P2, P4, and so on) belong to rank group 0 and
are serviced by storage server 0. All odd-numbered extent pools (P1, P3, P5, and so on)
belong to rank group 1 and are serviced by storage server 1 (unless one DS8000 storage
server failed or is quiesced with a failover to the alternate storage server).
Additionally, the dynamic extent pool merge is not supported in these situations:
If source and target pools have different storage types (FB and CKD)
If both extent pools contain track space-efficient (TSE) volumes
If there are TSE volumes on the SSD ranks
If you select an extent pool that contains volumes that are being migrated
If the combined extent pools have 2 PB or more of extent space-efficient (ESE) volume
logical capacity
For more information about dynamic extent pool merge, see IBM System Storage DS8000
Easy Tier, REDP-4667.
With the Easy Tier feature, you can easily add capacity and even single ranks to existing
extent pools without concern about performance.
Manual volume rebalance is a feature of the Easy Tier manual mode and is only available in
non-managed, single-tier (homogeneous) extent pools. It allows a balanced redistribution of
the extents of a volume across all ranks in the pool. This feature is not available in managed
or hybrid pools where Easy Tier is supposed to manage the placement of the extents
automatically on the ranks in the pool based on their actual workload pattern, available
storage tiers, and rank utilizations.
This feature is useful for redistributing extents after adding new ranks to a non-managed
extent pool. Also, this feature is useful after merging homogeneous extent pools to balance
the capacity and the workload of the volumes across all available ranks in a pool.
For further information about manual volume rebalance, see IBM System Storage DS8000
Easy Tier, REDP-4667.
Auto-rebalance
With Easy Tier automatic mode enabled for single-tier or multi-tier extent pools, for the
DS8000 R6.2 LMC or higher, you can benefit from Easy Tier automated intra-tier
performance management (auto-rebalance), which relocates extents based on rank
utilization, and reduces skew and avoids rank hot spots. Easy Tier relocates subvolume data
on extent level based on actual workload pattern and rank utilization (workload rebalance)
rather than balance the capacity of a volume across all ranks in the pool (capacity rebalance
as achieved with manual volume rebalance).
When adding capacity to managed pools, Easy Tier automatic mode performance
management, auto-rebalance, takes advantage of the new ranks and automatically populates
the new ranks that are added to the pool when rebalancing the workload within a storage tier
and relocating subvolume data. Starting with the DS8000 R6.2 LMC, auto-rebalance can be
enabled for hybrid and homogeneous extent pools.
For more information about auto-rebalance, see IBM System Storage DS8000 Easy Tier,
REDP-4667.
Important: There is currently no Copy Services support for logical volumes larger than
2 TiB (2240 bytes). Do not create LUNs larger than 2 TiB if you want to use the DS8000
Copy Services for those LUNs, unless you want to use it as managed disks in an IBM SAN
Volume Controller (SVC) with release 6.2 (or higher), and use SVC Copy Services instead.
LUNs can be allocated in binary GiB (230 bytes), decimal GB (109 bytes), or 512 or 520 byte
blocks. However, the physical capacity that is allocated for a LUN always is a multiple of
1 GiB, so it is a good idea to have LUN sizes that are a multiple of a gibibyte. If you define a
LUN with a LUN size that is not a multiple of 1 GiB, for example, 25.5 GiB, the LUN size is
25.5 GiB, but a capacity of 26 GiB is physically allocated, wasting 0.5 GiB of physical storage
capacity.
CKD volumes
A System z CKD volume is composed of one or more extents from one CKD extent pool. CKD
extents are the size of 3390 Model 1, which has 1113 cylinders. However, when you define a
System z CKD volume, you do not specify the number of 3390 Model 1 extents but the
number of cylinders that you want for the volume.
The maximum size for a CKD volume was originally 65,520 cylinders. Later, the LMC
5.4.0.xx.xx introduced the Extended Address Volume (EAV) that can grow up to 262668
cylinders (about 223 GB). Starting with microcode R6.2, this limit is raised to 1182006
cylinders (EAV II). Currently, it is possible to create CKD volumes up to 935 GB. For more
information about EAV volumes, see DS8000 Series: Architecture and Implementation,
SG24-8886.
Important: EAV volumes are supported by z/OS Version 1.10 or later. EAV II volumes are
supported by z/OS Version 1.12 or later.
If the number of specified cylinders is not an exact multiple of 1113 cylinders, part of the
space in the last allocated extent is wasted. For example, if you define 1114 or 3340
cylinders, 1112 cylinders are wasted. For maximum storage efficiency, consider allocating
volumes that are exact multiples of 1113 cylinders. In fact, consider multiples of 3339
cylinders for future compatibility.
A CKD volume cannot span multiple extent pools, but a volume can have extents from
different ranks in the same extent pool. You can stripe a volume across all ranks in an extent
pool by using the rotate extents EAM to distribute the capacity of the volume. This EAM
distributes the workload evenly across all ranks in the pool, minimizes skew, and reduces the
risk of single extent pools that become a hot spot. For more information, see “Rotate extents
(storage pool striping) extent allocation method” on page 68.
IBM i LUNs
IBM i LUNs are also composed of fixed block 1 GiB extents. There are, however, special
aspects with System i LUNs. LUNs created on a DS8000 are always RAID-protected based
on RAID 5, RAID 6, or RAID 10 arrays. However, you might want to deceive IBM i and tell it
that the LUN is not RAID-protected. Then, the IBM i performs its own mirroring. System i
LUNs can have the attribute unprotected; in which case, the DS8000 reports that the LUN is
not RAID-protected.
For the DS8800 with LMC 7.6.1.xx or the DS8700 with LMC 6.5.1.xx, thin-provisioned
volumes are supported. Two types of space-efficient volumes are available: extent
space-efficient (ESE) volumes and track space-efficient (TSE) volumes. These concepts are
described in detail in DS8000 Thin Provisioning, REDP-4554.
A space-efficient volume does not occupy all of its physical capacity at one time when it is
created. The space gets allocated when data is written to the volume. The sum of all the
virtual capacity of all space-efficient volumes can be larger than the available physical
capacity, which is known as over-provisioning or thin provisioning.
TSE volumes on the DS8000 require the IBM FlashCopy SE feature (licensing is required).
TSE volumes were introduced in the DS8000 family with the DS8000 LMC Release 3.0 (LMC
5.3x.xx). TSE volumes were designed to support the IBM FlashCopy SE function. The
allocation of real capacity for TSE logical volumes is from a repository volume, which is
created per extent pool. To create a repository volume, the amount of real capacity that is
required must be supplied. In addition, the repository also requires a virtual capacity to
provide a pool for the TSE logical volumes. The ratio of the virtual capacity to the real capacity
represents the storage over-commitment.
Extent space-efficient (ESE) volumes on the DS8000 require the IBM System Storage
DS8000 Thin Provisioning feature (licensing is required). The ESE volumes are
thin-provisioned volumes designated for standard host access. The virtual capacity pool for
ESE volumes is created per extent pool. There is no dedicated repository with physical
capacity required for ESE volumes as for TSE volumes. The available physical capacity pool
for allocation is the unused physical capacity in the extent pool. When an ESE logical volume
is created, the volume is not allocated physical data capacity. However, the DS8000 allocates
capacity for metadata that it uses to manage space allocation. Additional physical capacity is
allocated when writes to an unallocated address occur. Before the write, a new extent is
dynamically allocated, initialized, and eventually assigned to the volume (QuickInit).
The idea behind space-efficient volumes is to allocate physical storage at the time that it is
needed to satisfy temporary peak storage needs.
Important: No Copy Services support exists for logical volumes larger than 2 TiB (2 240
bytes). Thin-provisioned volumes (ESE) as released with the DS8000 LMC R6.2 are fully
supported by FlashCopy but are not yet supported with CKD volumes and IBM i volumes.
Thin-provisioned volumes are not fully supported with all DS8000 Copy Services or
advanced functions yet. These restrictions might change with future DS8000 LMC
releases, so check the related documentation for your DS8000 LMC release.
Tip: Serial Advanced Technology Attachment (SATA) and nearline SAS drives are not
generally recommended to be used for space-efficient FlashCopy SE repository data
because of performance considerations.
The repository is an object within an extent pool. It is similar to a volume within the extent
pool. The repository has a physical size and a virtual size. The physical size of the repository
is the amount of space that is allocated in the extent pool. It is the physical space that is
available for all TSE logical volumes, in total, in this extent pool. The repository is striped
across all ranks within the extent pool. There is one repository only per extent pool.
Important: The size of the repository and the virtual space that it uses are part of the
extent pool definition. Each extent pool can have a TSE volume repository, but this physical
space cannot be shared across extent pools. Virtual space in an extent pool is used for
both TSE and ESE volumes. The repository is only used for TSE volumes for FlashCopy
SE. ESE volumes use available extents in the extent pool in a similar fashion as standard,
fully provisioned volumes as far as their specified extent allocation method, but extents are
only allocated as needed when data is written to the ESE volume.
The logical size of the repository is limited by the available virtual capacity for space-efficient
volumes. For example, a repository of 100 GB of reserved physical storage exists, and you
define a virtual capacity of 200 GB. In this case, you can define 10 TSE LUNs of 20 GB each.
So, the logical capacity can be larger than the physical capacity. You cannot fill all the
volumes with data, because the total physical capacity is limited by the repository size, which
is 100 GB in this example.
Important: In the current implementation of TSE volumes, it is not possible to expand the
physical size of the repository. Therefore, careful planning for the size of the repository is
required before it is used. If a repository needs to be expanded, all TSE volumes within this
extent pool must be deleted, and the repository deleted and re-created with the required
size.
Space allocation
Space for a space-efficient volume is allocated when a write occurs. More precisely, it is
allocated when a destage from the cache occurs and a new track or extent needs to be
allocated. The TSE allocation unit is a track (64 KB for Open Systems LUNs or 57 KB for CKD
volumes).
Because space is allocated in extents or tracks, the system needs to maintain tables that
indicate their mapping to the logical volumes, so there is a performance impact involved with
space-efficient volumes. The smaller the allocation unit (a track for TSE volumes), the larger
the tables and possible effect on performance.
Virtual space is created as part of the extent pool definition. This virtual space is mapped to
ESE volumes in the extent pool and TSE volumes in the repository, as needed. The virtual
capacity equals the total logical capacity of all ESE and TSE volumes. No physical storage
capacity (other than for the metadata) is allocated until write activity occurs to the ESE or TSE
volumes.
Tip: In the DS8000 LMC R6.2, ESE volumes are fully supported by Easy Tier. TSE
volumes are not supported by Easy Tier.
When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed.
Tip: Rotate extents and rotate volume EAMs determine the distribution of a volume
capacity and, thus, the volume workload distribution across the ranks in an extent pool.
Rotate extents (the default EAM) evenly distributes the capacity of a volume at a granular 1
GiB extent level across the ranks in a homogeneous extent pool. It is the preferred method
to reduce skew and minimize hot spots, improving overall performance.
You might want to consider this allocation method when you manage performance manually.
The workload of one volume is going to one rank. This method helps you identify performance
bottlenecks more easily. However, by placing all volume data and workload on one rank, you
increase skew and the likelihood of limiting overall system performance with single ranks that
become a bottleneck.
In managed homogeneous extent pools with only a single storage tier, the initial extent
allocation for a new volume is the same as with rotate extents or storage pool striping. For a
volume, the appropriate DSCLI command, showfbvol or showckdvol, used with the -rank
option, allows the user to list the number of allocated extents of a volume on each associated
rank in the extent pool.
Starting with the DS8000 R6.1 LMC, the EAM attribute of any volume created or already in a
managed extent pool is changed to managed after Easy Tier automatic mode is enabled for
the pool. When enabling Easy Tier automatic mode for all extent pools, that is, hybrid and
homogeneous extent pools, all volumes immediately become managed by Easy Tier. And,
the EAM attribute of all volumes on a DS8000 system is changed to managed. When set to
managed, the EAM attribute setting for the volume is permanent. All previous volume EAM
attribute information, such as rotate extents or rotate volumes, is lost.
However, if you use, for example, Physical Partition striping in AIX already, double striping
probably does not improve performance any further.
If you decide to use storage pool striping, it is probably better to use this allocation method for
all volumes in the extent pool to keep the ranks equally allocated and utilized.
Tip: When configuring a new DS8000, do not generally mix volumes that use the rotate
extents EAM (storage pool striping) and volumes that use the rotate volumes EAM in the
same extent pool.
Striping a volume across multiple ranks also increases the failure boundary. If you have extent
pools with many ranks and all volumes striped across these ranks, you lose the data of all
volumes in the pool if one rank fails. For example, two disk drives in the same RAID 5 rank fail
at the same time.
If multiple EAM types are used in same extent pool, use Easy Tier manual mode to change
the EAM from rotate volumes to rotate extents and vice versa. Use volume migration (dynamic
volume relocation) in non-managed, homogeneous extent pools.
However, before switching any EAM of a volume, consider that you might need to change
other volumes on same extent pool before distributing your volume across ranks. For
example, you create various volumes with the rotate volumes EAM and only a few with rotate
extents. Now, you want to switch only one volume to rotate extents. The ranks might not have
enough free extents available for Easy Tier to properly balance all the extents evenly over all
ranks. In this case, you might have to apply multiple steps and switch every volume to the new
EAM type before changing only one volume. Depending on your case, you might also
consider moving volumes to another extent pool prior to reorganizing all volumes and extents
in the extent pool.
Certain cases, where you merge other extent pools, you also must plan to reorganize the
extents on the new extent pool by using, for example, manual volume rebalance, so that you
can properly redistribute the extents of the volumes across the ranks in the pool.
For more information about Easy Tier features and modes, see IBM System Storage DS8000
Easy Tier, REDP-4667.
This construction method of using fixed extents to form a logical volume in the DS8000 series
allows flexibility in the management of the logical volumes. We can delete LUNs/CKD
volumes, resize LUNs/volumes, and reuse the extents of those LUNs to create other
LUNs/volumes, maybe of different sizes. One logical volume can be removed without
affecting the other logical volumes defined on the same extent pool.
After the logical volume is created and available for host access, it is placed in the normal
configuration state. If a volume deletion request is received, the logical volume is placed in the
deconfiguring configuration state until all capacity associated with the logical volume is
deallocated and the logical volume object is deleted.
The reconfiguring configuration state is associated with a volume expansion request. The
transposing configuration state is associated with an extent pool merge. The migrating,
migration paused, migration error, and migration canceled configuration states are associated
with a volume relocation request.
Important: Before you can expand a volume, you must remove any Copy Services
relationships that involve that volume.
Important: Dynamic volume relocation can be applied among extent pools associated with
the same DS8000 storage server affinity (storage server 0 or storage server 1) or rank
group only. All volumes in even-numbered extent pools (P0, P2, P4, and so on) belong to
rank group 0 and are serviced by storage server 0. All volumes in odd-numbered extent
pools (P1, P3, P5, and so on) belong to rank group 1 and are serviced by storage server 1
(unless one DS8000 storage server failed or was quiesced with a failover to the alternate
storage server).
Important: A volume migration with dynamic volume relocation back into the same extent
pool (for example, manual volume rebalance for restriping purposes) is not supported in
managed of hybrid extent pools. Hybrid pools are always supposed to be prepared for
Easy Tier automatic management. In pools under control of Easy Tier automatic mode, the
volume placement is managed automatically by Easy Tier. It relocates extents across
ranks and storage tiers to optimize storage performance and storage efficiency. However, it
is always possible to migrate volumes across extent pools, no matter if those pools are
managed, non-managed, or hybrid pools.
For additional information about this topic, see IBM System Storage DS8000 Easy Tier,
REDP-4667.
On the DS8000 series, there is no fixed binding between a rank and a logical subsystem. The
capacity of one or more ranks can be aggregated into an extent pool. The logical volumes
configured in that extent pool are not necessarily bound to a specific rank. Different logical
volumes on the same logical subsystem can even be configured in separate extent pools. The
available capacity of the storage facility can be flexibly allocated across logical subsystems
and logical volumes. You can define up to 255 LSSs on a DS8000 system.
For each LUN or CKD volume, you must select an LSS when creating the volume. The LSS is
part of the volume ID ‘abcd’ and reflected in the first two digits ‘ab’. The volume ID is in
hexadecimal notation and needs to be specified upon volume creation.
You can have up to 256 volumes in one LSS. There is, however, one restriction. Volumes are
created from extents of an extent pool. An extent pool, however, is associated with one
DS8000 storage server (also called a CEC or processor complex): server 0 or server 1. The
LSS number also reflects this affinity to one of these DS8000 storage servers. All
even-numbered LSSs (X'00', X'02', X'04', up to X'FE') are serviced by storage server 0 (rank
group 0). All odd-numbered LSSs (X'01', X'03', X'05', up to X'FD') are serviced by storage
server 1 (rank group 1). LSS X’FF’ is reserved.
All logical volumes in an LSS must be either CKD or FB. LSSs are even grouped into address
groups of 16 LSSs. All LSSs within one address group must be of the same storage type,
either CKD or FB. The first digit of the LSS ID or volume ID specifies the address group. For
more information, see 3.2.10, “Address groups” on page 73.
System z users are familiar with a logical control unit (LCU). System z operating systems
configure LCUs to create device addresses. There is a one-to-one relationship between an
For Open Systems, LSSs do not play an important role other than associating a volume with a
specific rank group and server affinity (storage server 0 or storage server 1) or grouping hosts
and applications together under selected LSSs for the DS8000 Copy Services relationships
and management.
Tip: Certain management actions in Metro Mirror, Global Mirror, or Global Copy operate at
the LSS level. For example, the freezing of pairs to preserve data consistency across all
pairs is at the LSS level. The option to put all or a set of volumes of a certain application in
one LSS can make the management of remote copy operations easier under certain
circumstances.
Important: LSSs for FB volumes are created automatically when the first FB logical
volume on the LSS is created, and deleted automatically when the last FB logical volume
on the LSS is deleted. CKD LSSs require user parameters to be specified and must be
created before the first CKD logical volume can be created on the LSS. They must be
deleted manually after the last CKD logical volume on the LSS is deleted.
All devices in an address group must be either CKD or FB. LSSs are grouped into address
groups of 16 LSSs. LSSs are numbered X'ab', where a is the address group. So all LSSs
within one address group have to be of the same type, CKD or FB. The first LSS defined in an
address group sets the type of that address group. For example, LSS X'10' to LSS X'1F' are
all in the same address group and therefore can all be used only for the same storage type,
either FB or CKD.
Figure 3-5 on page 74 shows the concept of volume IDs, LSSs, and address groups.
Server1
LSS X'1E'
Exten t Pool FB-1 LSS X'1F ' Extent Poo l FB-2
Rank-y
Rank-c
Figure 3-5 Volume IDs, logical subsystems, and address groups on the DS8000 storage systems
The volume ID X'gabb' in hexadecimal notation is composed of the address group X'g', and
the LSS ID X'ga', and the volume number X'bb' within the LSS. For example, LUN X'2101'
denotes the second (X'01') LUN in LSS X'21' of address group 2.
Host attachment
Host bus adapters (HBAs) are identified to the DS8000 in a host attachment or host
connection construct that specifies the HBA worldwide port names (WWPNs). A set of host
ports (host connections) can be associated through a port group attribute in the DSCLI that
allows a set of HBAs to be managed collectively. This group is called a host attachment within
the GUI.
Each host attachment can be associated with a volume group to define which LUNs that HBA
is allowed to access. Multiple host attachments can share the volume group. The host
attachment can also specify a port mask that controls which DS8000 I/O ports the HBA is
allowed to log in to. Whichever ports the HBA logs in to, it sees the same volume group that is
defined in the host attachment that is associated with this HBA.
When used in conjunction with Open Systems hosts, a host attachment object that identifies
the HBA is linked to a specific volume group. You must define the volume group by indicating
which FB logical volumes are to be placed in the volume group. Logical volumes can be
added to or removed from any volume group dynamically.
One host connection can be assigned to one volume group only. However, the same volume
group can be assigned to multiple host connections. An FB logical volume can be assigned to
one or more volume groups. Assigning a logical volume to different volume groups allows a
LUN to be shared by hosts, each configured with its own dedicated volume group and set of
volumes (in case a set of volumes that is not identical is shared between the hosts).
Next, we created logical volumes within the extent pools (optionally striping the volumes),
assigning them a logical volume number that determined to which logical subsystem they are
associated and indicated which server manages them. Space-efficient volumes can be
created within the repository of the extent pool. Then, the LUNs can be assigned to one or
more volume groups. Finally, the HBAs were configured into a host attachment that is
associated with a volume group.
This virtualization concept provides for greater flexibility. Logical volumes can dynamically be
created, deleted, migrated, and resized. They can be grouped logically to simplify storage
management. Large LUNs and CKD volumes reduce the required total number of volumes,
which also contributes to a reduction of management efforts.
Data
1 GiB FB
1 GiB FB
1 GiB FB
Data
1 GiB FB
1 GiB FB
1 GiB FB
Data
Data
Server0
Data
1 GiB FB
1 GiB FB
1 GiB FB
Data
Parity
Spare
X'2x' FB
4096
addresses
LSS X'27'
X'3x' CK D
4096
addresses
It can also help you fine-tune system performance from an extent pool perspective, for
example, sharing the resources of an extent pool evenly between application workloads or
isolating application workloads to dedicated extent pools. Data placement can help you when
planning for dedicated extent pools with different performance characteristics and storage
tiers without using Easy Tier automatic management. Plan your configuration carefully to
meet your performance goals by minimizing potential performance limitations that might be
introduced by single resources that become a bottleneck due to workload skew. For example,
use rotate extents as the default EAM to help reduce the risk of single ranks that become a
hot spot and limit the overall system performance due to workload skew.
If workload isolation is required in your environment, you can isolate workloads and I/O on the
rank and DA levels on the DS8000 systems, if required.
Important: It is important to understand that the greatest amount of isolation that you can
achieve within the DS8000 is at the array/rank level, not the physical drive level.
You can manually manage storage tiers related to different homogeneous extent pools of the
same storage class and plan for appropriate extent pools for your specific performance
needs. Remember that DS8800 and DS8700 systems offer the IBM System Storage Easy
In a RAID 5 6+P or 7+P array, the amount of capacity equal to one disk is used for parity
information. However, the parity information is not bound to a single disk. Instead, it is striped
across all the disks in the array, so all disks of the array are involved to service I/O requests
equally.
In a RAID 6 5+P+Q or 6+P+Q array, the amount of capacity equal to two disks is used for
parity information. As with RAID 5, the parity information is not bound to single disks, but
instead is striped across all the disks in the array. So, all disks of the array service I/O
requests equally. However, a RAID 6 array might have one less drive available than a RAID 5
array configuration. Nearline drives, for example, allow RAID 6 configurations only.
In a RAID 10 3+3 array, the available usable space is the capacity of only three disks. Two
disks of the array site are used as spare disks. When a LUN is created from the extents of this
array, the data is always mirrored across two disks in the array. Each write to the array must
be performed twice to two disks. There is no additional parity information in a RAID 10 array
configuration.
Important: The spares in the mirrored RAID 10 configuration act independently; they are
not mirrored spares.
In a RAID 10 4+4 array, the available usable space is the capacity of four disks. No disks of
the array site are used as spare disks. When a LUN is created from the extents of this array,
the data is always mirrored across two disks in the array. Each write to the array must be
performed twice to two disks.
Important: The stripe width for the RAID arrays differs in size with the number of active
disks that hold the data. Due to the different stripe widths that make up the extent from
each type of RAID array, it is generally not a preferred practice to intermix RAID array types
within the same extent pool, especially in homogeneous extent pool configurations that do
not use Easy Tier automatic management. With Easy Tier enabled, the benefit of
automatic storage performance and storage efficiency management combined with Easy
Tier micro-tiering capabilities typically outperforms the disadvantage of different RAID
arrays within the same pool.
Even in single-tier homogeneous extent pools, you can already benefit from Easy Tier
automatic mode (DSCLI chsi ETautomode=all). It manages the subvolume data placement
within the managed pool based on rank utilization and thus reduces workload skew and hot
spots (auto-rebalance).
In multi-tier hybrid extent pools, you can fully benefit from Easy Tier automatic mode (DSCLI
chsi ETautomode=all|tiered). It provides full automatic storage performance and storage
economics management by optimizing subvolume data placement in a managed extent pool
across different storage tiers and even across ranks within each storage tier (auto-rebalance).
Easy Tier automatic mode and hybrid extent pool configurations offer the most efficient way to
use different storage tiers. It optimizes storage performance and storage economics across
three drive tiers to manage more applications effectively and efficiently with a single DS8000
system at an optimum price versus the performance and footprint ratio.
The data placement and extent distribution of a volume across the ranks in an extent pool can
always be displayed by using the DSCLI showfbvol -rank or showckdvol -rank command, as
shown in Example 3-2 on page 85.
Before configuring extent pools and volumes, be aware of the basic configuration principles
about workload sharing, isolation, and spreading, as described in 4.2, “Configuration
principles for optimal performance” on page 90.
The first example shown in Figure 3-7 on page 79 illustrates an extent pool with only one
rank, which is also referred to a single-rank extent pool. This approach is common if you plan
to use the SAN Volume Controller (SVC), for example, or if you plan a configuration that uses
the maximum isolation that you can achieve on the rank/extpool level. In this type of a
single-rank extent pool configuration, all volumes created are bound to a single rank. This
type of configuration requires careful logical configuration and performance planning,
because single ranks are likely to become a hot spot and might limit overall system
performance. It also requires the highest administration and management effort, because
workload skew typically varies over time. You might constantly monitor your system
performance and need to react to hot spots. It also considerably limits the benefits that a
DS8000 system can offer regarding its virtualization and Easy Tier automatic management
capabilities.
In these configurations, we highly advise that you use host-based striping methods to achieve
a balanced distribution of the data and the I/O workload across the ranks and back-end disk
resources. For example, you can use IBM AIX LVM or SVC to stripe the volume data across
ranks.
In Figure 3-7, we show the data placement of two volumes created in an extent pool with a
single rank. Volumes created in this extent pool always use extents from rank R6 and, thus,
are limited to the capacity and performance capability of this single rank. Without the use of
any host-based data and workload striping methods across multiple volumes from different
extent pools and ranks, this rank is likely to experience rank hot spots and performance
bottlenecks.
Also, in this example, one host can easily degrade the whole rank, depending on its I/O
workload, and affect multiple hosts that share volumes on the same rank if you have more
than one LUN allocated in this extent pool.
Extpool P2
R6
Host B – LUN 6
Host B – LUN 6
Host C – LUN 7
Host C – LUN 7
Single Tier
The second example, which is shown in Figure 3-8 on page 81, illustrates an extent pool with
multiple ranks of the same storage class or storage tier, which we refer to as a homogeneous
or single-tier extent pool. In general, an extent pool with multiple ranks also is called a
multi-rank extent pool.
Although in principle, both EAMs (rotate extents and rotate volumes) are available for
non-managed homogeneous extent pools, it is preferable to use the default allocation method
of rotate extents (storage pool striping). Use this EAM to distribute the data and thus the
workload evenly across all ranks in the extent pool and minimize the risk of workload skew
and a single rank that becomes a hot spot.
The use of the EAM rotate volumes still can isolate volumes to separate ranks, even in
multi-rank extent pools, wherever such configurations are required. This EAM minimizes the
configuration effort, because a set of volumes distributed across different ranks can be
created with a single command. Plan your configuration and performance needs to implement
host-level-based methods to balance the workload evenly across all volumes and ranks.
Figure 3-8 on page 81 is an example of storage pool striping for LUNs 1 - 4. It shows more
than one host and more than one LUN distributed across the ranks. In contrast to our
suggestion, we also show an example of LUN 5 being created with the rotate volumes EAM in
the same pool. The storage system tries to allocate the continuous space available for this
volume on a single rank (R1) until there is insufficient capacity left on this rank and then it
spills over to the next available rank (R2). All workload on this LUN is limited to these two
ranks. This approach considerably increases the workload skew across all ranks in the pool
and the likelihood that these two ranks might become a bottleneck for all volumes in the pool,
which reduces overall pool performance.
Enabling Easy Tier automatic mode for homogeneous, single-tier extent pools always is an
additional option to let the DS8000 system manage system performance in the pools based
on rank utilization (auto-rebalance). The EAM of all volumes in the pool becomes managed in
this case. With Easy Tier and its advanced micro-tiering capabilities that take different RAID
levels and drive characteristics into account for determining the rank utilization in managed
pools, even a mix of different drive characteristics and RAID levels of the same storage tier
might be an option for certain environments.
Multiple hosts with multiple LUNs, as shown in Figure 3-8 on page 81, share the resources
(resource sharing) in the extent pool, that is, ranks, DAs, and physical spindles. If one host or
LUN has a high workload, I/O contention can result and easily affect the other application
workloads in the pool, especially if all applications have their workload peaks at the same
time. Alternatively, applications can benefit from a much larger amount of disk spindles and
thus larger performance capabilities in a shared environment in contrast to workload isolation
and only dedicated resources. With resource sharing, we generally expect that not all
applications peak at the same time, so that each application typically benefits from the larger
amount of disk resources that it can use. The resource sharing and storage pool striping in
non-managed extent pools method is a good approach for most cases if no other
requirements, such as workload isolation or a specific quality of service requirements, dictate
another approach.
With Easy Tier and I/O Priority Manager, the DS8000 systems offer advanced features when
taking advantage of resource sharing to minimize administration efforts and reduce workload
skew and hot spots while benefitting from automatic storage performance, storage
economics, and workload priority management. The use of these features in the DS8000
environments is highly encouraged. These features generally help to provide excellent overall
system performance while ensuring quality of service (QoS) levels by prioritizing workloads in
shared environments at a minimum of administration effort and at an optimum
price-performance ratio.
R1 R2 R3 R4
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 5
Single Tier
When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed. This
situation is true no matter which EAM is specified at volume creation. The volume is under
control of Easy Tier. Easy Tier moves extents to the most appropriate storage tier and rank in
the pool based on performance aspects. Any specified EAM, such as rotate extents or rotate
volumes, is ignored.
In managed or hybrid extent pools, an initial EAM that is similar to rotate extents for new
volumes is used. The same situation applies if an existing volume is manually moved to a
managed or hybrid extent pool by using volume migration or dynamic volume relocation. In
hybrid or multi-tier extent pools (whether currently managed or non-managed by Easy Tier),
initial volume creation always starts on the ranks of the Enterprise tier first. The Enterprise tier
is also called the home tier. The extents of a new volume are distributed in a rotate extents or
storage pool striping fashion across all available ranks in this home tier in the extent pool, as
long as sufficient capacity is available. Only when all capacity on the home tier in an extent
pool is consumed does volume creation continue on the ranks of the Nearline tier. When all
capacity on the Enterprise tier and Nearline tier is exhausted, then volume creation continues
allocating extents on the SSD tier. The initial extent allocation in non-managed hybrid pools
differs from the extent allocation in single-tier extent pools with rotate extents (the extents of a
volume are not evenly distributed across all ranks in the pool because of the different
treatment of the different storage tiers). However, the attribute for the EAM of the volume is
shown as rotate extents, as long as the pool is not under Easy Tier automatic mode control.
After Easy Tier automatic mode is enabled for a hybrid pool, the EAM of all volumes in that
pool becomes managed.
The ratio of SSD capacity to hard disk drive (HDD) disk capacity in a hybrid pool depends on
the workload characteristics and skew. Ideally, there must be enough SSD capacity to hold
the active (hot) extents in the pool, but not more, to not waste expensive SSD capacity. For
new DS8000 orders, 3 - 5% of SSD capacity might be a reasonable percentage to plan with
hybrid pools in typical environments. This configuration can result in the movement of 50% of
the small and random I/O workload from Enterprise drives to SSDs. This configuration
provides a reasonable initial estimate if measurement data is not available to support
configuration planning. See “Drive Selection with Easy Tier” in IBM System Storage DS8800
and DS8700 Performance with Easy Tier 3rd Generation, WP102024, for additional
information.
The Storage Tier Advisor Tool (STAT) also provides guidance for SSD capacity planning
based on the existing workloads on a DS8000 with Easy Tier monitoring capabilities. For
additional information about the STAT, see 6.7, “Storage Tier Advisor Tool” on page 231.
Figure 3-9 shows a configuration of a managed 2-tier extent pool with an SSD and Enterprise
storage tier. All LUNs are managed by Easy Tier. Easy Tier automatically and dynamically
relocates subvolume data to the appropriate storage tier and rank based on their workload
patterns. Figure 3-9 shows multiple LUNs from different hosts allocated in the 2-tier pool with
hot data already migrated to SSDs. Initial volume creation in this pool always allocates
extents on the Enterprise tier first, as long as capacity is available, before Easy Tier
automatically starts promoting extents to the SSD tier.
Extpool P5
R14 R15
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 6
SSD ENT
Two Tiers
Figure 3-9 Multi-tier extent pool with SSD and Enterprise ranks
Extpool P4
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 6
ENT NL
Two Tiers
Figure 3-10 Multi-tier extent pool with Enterprise and nearline ranks
Figure 3-11 on page 84 shows a configuration of a managed 3-tier extent pool with an SSD,
Enterprise, and Nearline storage tier. All LUNs are managed by Easy Tier. Easy Tier
automatically and dynamically relocates subvolume data to the appropriate storage tier and
rank based on their workload patterns. With more than one rank in the Enterprise storage tier,
Easy Tier also balances the workload and resource usage across the ranks within this
storage tier (auto-rebalance). Figure 3-11 on page 84 shows multiple LUNs from different
hosts allocated in the 3-tier pool with hot data promoted to the SSD tier and cold data
demoted to the Nearline tier. Initial volume creation in this pool always allocates extents on
the Enterprise tier first, as long as capacity is available, before Easy Tier automatically starts
promoting extents to the SSD tier or demoting extents to the Nearline tier.
We highly encourage the use of hybrid extent pool configurations under automated Easy Tier
management. It provides ease of use with minimum administration and performance
management efforts while optimizing the system performance, price, footprint, and energy
costs.
R7 R8 R9
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 6
SSD ENT NL
Three Tiers
Figure 3-11 Multi-tier extent pool with SSD, Enterprise, and nearline ranks
The initial allocation of extents for a volume in a managed single-tier pool is similar to the
rotate extents EAM or storage pool striping. So, the extents are evenly distributed across all
ranks in the pool right after the volume creation. The initial allocation of volumes in hybrid
extent pools differs slightly. The extent allocation always begins in a rotate extents-like fashion
on the ranks of the Enterprise tier first, and then continues on the nearline ranks, and
eventually on the SSD ranks, if the capacity on the Enterprise and nearline ranks is not
sufficient.
After the initial extent allocation of a volume in the pool, the extents and their placement on
the different storage tiers and ranks are managed by Easy Tier. Easy Tier collects workload
statistics for each extent in the pool and creates migration plans to relocate the extents to the
appropriate storage tiers and ranks. The extents are promoted to higher tiers or demoted to
lower tiers based on their actual workload patterns. The data placement of a volume in a
managed pool is no longer static or determined by its initial extent allocation. The data
placement of the volume across the ranks in a managed extent pool is subject to change over
time to constantly optimize storage performance and storage economics in the pool. This
process is ongoing and always adapting to changing workload conditions. After Easy Tier
data collection and automatic mode are enabled, it might require up to 24 hours before the
first migration plan is created and being applied. For Easy Tier migration plan creation and
timings, see IBM System Storage DS8000 Easy Tier, REDP-4667.
The DSCLI showfbvol -rank or showckdvol -rank command can help to show the current
extent distribution of a volume across the ranks, as shown in Example 3-2 on page 85. In this
Example 3-2 Use the showfbvol -rank to show the volume to rank relationship in a multi-tier pool
dscli> showfbvol -rank 8601
Name -
ID 8601
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512
addrgrp 8
extpool P6
exts 100
captype DS
cap (2^30B) 100.0
cap (10^9B) -
cap (blocks) 209715200
volgrp V8
ranks 3
dbexts 0
sam Standard
repcapalloc -
eam managed
reqcap (blocks) 209715200
realextents 100
virtualextents 0
migrating 0
perfgrp PG0
migratingfrom -
resgrp RG0
==============Rank extents==============
rank extents
============
R0 16
R18 24
R22 60
The volume heat distribution (volume heat map), provided by the STAT, helps you to identify
hot, warm, and cold extents for each volume and its current distribution across the storage
tiers in the pool. For more information about the STAT, see 6.7, “Storage Tier Advisor Tool” on
page 231.
Easy Tier constantly adapts to changing workload conditions and relocates extents, so the
extent locations of a volume are subject to a constant change over time that depends on its
workload pattern. However, the number of relocations decreases and eventually becomes
marginal, as long as the workload pattern about the decision windows of Easy Tier remains
steady.
ted
crea ier ier
T1 ns T2 asyT T3 asyT
Lu E E timeline
Figure 3-12 Data placement in a managed 3-tier extent pool with Easy Tier over time
Important: Before reading this chapter, familiarize yourself with the material covered in
Chapter 3, “Logical configuration concepts and terminology” on page 51.
This chapter introduces a step-by-step approach to configuring the DS8000 workload and
performance considerations:
Review the tiered storage concepts and Easy Tier
Understand the configuration principles for optimal performance:
– Workload isolation
– Workload resource-sharing
– Workload spreading
Analyze workload characteristics to determine isolation or resource-sharing
Plan allocation of the DS8000 disk and host connection capacity to identified workloads
Plan spreading volumes and host connections for the identified workloads
Plan array sites
Plan RAID arrays and ranks with RAID-level performance considerations
Plan extent pools with single-tier and multi-tier extent pool considerations
Plan address groups, logical subsystems (LSSs), volume IDs, and count key data (CKD)
Parallel Access Volumes (PAVs)
Plan I/O port IDs, host attachments, and volume groups
Implement and document the DS8000 logical configuration
Typically, an optimal design keeps the active operational data in Tier 0 and Tier 1 and uses
Tiers 2 and 3 for less active data, as shown in Figure 4-2 on page 89.
The benefits associated with a tiered storage approach mostly relate to cost. By introducing
SSD storage as tier 0, you might more efficiently address the highest performance needs
while reducing the Enterprise class storage, system footprint, and energy costs. A tiered
storage approach can provide the performance you need and save significant costs
associated with storage, because lower-tier storage is less expensive. Environmental savings,
such as energy, footprint, and cooling reductions, are possible. However, the overall
management effort increases when managing storage capacity and storage performance
needs across multiple storage classes.
With dramatically high I/O rates, low response times, and IOPS-energy-efficient
characteristics, SSDs address the highest performance needs and also potentially can
achieve significant savings in operational costs. Although, the current acquisition cost per GB
is higher than HDDs. To satisfy most workload characteristics, SSDs need to be used
efficiently in conjunction with HDDs in a well-balanced tiered storage architecture. It is critical
to choose the right mix of storage tiers and the right data placement to achieve optimal
storage performance and economics across all tiers at a low cost.
With the DS8000, you can easily implement tiered storage environments that use SSD,
Enterprise, and Nearline class storage tiers. Different storage tiers can be isolated to
separate extent pools and volume placement can be managed manually across extent pools
where required. Or, better and highly encouraged, volume placement can be managed
automatically on a subvolume level in hybrid extent pools by IBM System Storage Easy Tier
automatic mode with minimum management effort for the storage administrator. Easy Tier is
a no-cost feature on DS8800 and DS8700 systems (however, a license must be ordered). For
more information about Easy Tier, see 1.3.4, “Easy Tier” on page 10.
Although Easy Tier manual mode also helps you manage storage tiers across extent pools on
a volume level by providing essential features such as dynamic volume relocation/volume
migration, dynamic extent pool merge, and rank depopulation, consider Easy Tier automatic
mode and hybrid extent pools for managing tiered storage on the DS8000. The overall
management and performance monitoring effort increases considerably when manually
managing storage capacity and storage performance needs across multiple storage classes
and does not achieve the efficiency as provided with Easy Tier automatic mode data
relocation on subvolume level (extent level). With Easy Tier, client configurations show less
potential to waste SSD capacity than with volume-based tiering methods. And the use of
Easy Tier is easy. Configure hybrid extent pools (mixed SSD/HDD storage pools) and turn
Easy Tier on. It then provides automated data relocation across the storage tiers and ranks in
the extent pool in order to optimize storage performance and storage economics. It also
rebalances the workload across the ranks within each storage tier (auto-rebalance) based on
rank utilization to minimize skew and hot spots. Furthermore, it constantly adapts to changing
In environments with homogeneous system configurations or isolated storage tiers that are
bound to different homogeneous extent pools, you can benefit from Easy Tier automatic
mode. With R6.2 of the DS8000 LMC, Easy Tier provides automatic intra-tier performance
management by rebalancing the workload across ranks (auto-rebalance) in homogeneous
single-tier pools based on rank utilization. Easy Tier automatically minimizes skew and rank
hot spots and helps to reduce the overall management effort for the storage administrator.
Depending on the particular storage requirements in your environment, with the DS8000
architecture, you can address a vast range of storage needs combined with ease of
management. On a single DS8000 system, you can perform these tasks:
Isolate workloads to selected extent pools (or down to selected ranks and DAs)
Share resources of other extent pools with different workloads
Use Easy Tier to automatically manage multi-tier extent pools with different storage tiers
(or homogeneous extent pools)
Adapt your logical configuration easily and dynamically at any time to changing
performance or capacity needs by migrating volumes across extent pools, merging extent
pools, or removing ranks from one extent pool (rank depopulation) and moving them to
another pool
Easy Tier helps you to consolidate more workloads onto a single DS8000 system by
automating storage performance and storage economics management across up to three
drive tiers. In addition, I/O Priority Manager, as described in 1.3.5, “I/O Priority Manager” on
page 17, can help you align workloads to quality of service (QoS) levels to prioritize separate
system workloads that compete for the same shared and possibly constrained storage
resources to meet their performance goals.
For many initial installations, an approach with two extent pools (with or without different
storage tiers) and enabled Easy Tier automatic management might be the simplest way to
start if you have FB or CKD storage only; otherwise, four extent pools are required. You can
plan for more extent pools based on your specific environment and storage needs, for
example, workload isolation for some pools, different resource sharing pools for different
departments or clients, or specific Copy Services considerations.
And, you can take advantage of the new DS8000 features offered by Easy Tier and I/O
Priority Manager. Both features pursue different goals and can be combined.
The DS8000 I/O Priority Manager provides a significant benefit for resource-sharing
workloads. It aligns QoS levels to separate workloads that compete for the same shared and
possibly constrained storage resources. I/O Priority Manager can prioritize access to these
system resources to achieve the desired QoS for the volume based on predefined
performance goals (high, medium, or low). I/O Priority Manager constantly monitors and
balances system resources to help applications meet their performance targets automatically,
without operator intervention. I/O Priority Manager only acts if resource contention is
detected.
Isolation provides guaranteed availability of the hardware resources that are dedicated to the
isolated workload. It removes contention with other applications for those resources.
However, isolation limits the isolated workload to a subset of the total DS8000 hardware so
that its maximum potential performance might be reduced. Unless an application has an
entire DS8000 Storage Image dedicated to its use, there is potential for contention with other
applications for any hardware (such as cache and processor resources) that is not dedicated.
One preferred practice to isolation is to identify lower priority workloads with heavy I/O
demands and to separate them from all of the more important workloads. You might be able
to isolate multiple lower priority workloads with heavy I/O demands to a single set of hardware
resources and still meet their lower service-level requirements, particularly if their peak I/O
demands are at different times. In addition, I/O Priority Manager can help to prioritize different
workloads, if required.
You can partition the DS8000 disk capacity for isolation at several levels:
Rank level: Certain ranks are dedicated to a workload. That is, volumes for one workload
are allocated on these ranks. The ranks can be a different disk type (capacity or speed), a
different RAID array type (RAID 5, RAID 6, or RAID 10, arrays with spares or arrays
without spares), or a different storage type (CKD or FB) than the disk types, RAID array
types, or storage types that are used by other workloads. Workloads that require different
storage types dictate rank, extent pool, and address group isolation. You might consider
workloads with heavy random activity for rank isolation, for example.
Extent pool level: Extent pools are logical constructs that represent a group of ranks
serviced by storage server 0 or storage server 1. You can isolate different workloads to
different extent pools, but you always need to be aware of the rank and DA pair
associations. While physical isolation on rank and DA level involves building appropriate
extent pools with a selected set of ranks or ranks from a specific DA pair, different extent
pools with a subset of ranks from different DA pairs typically share DAs. As long as the
workloads are purely disk bound and limited by the capability of the disk spindles rather
than the DA pair, it can be considered as an isolation level for the set of ranks and disk
spindles set aside in the pool. In combination with Easy Tier on a system with multiple
extent pools of different storage tier combinations and likely sharing DA adapters across
different extent pools, you might also consider different extent pools also as isolation levels
for workloads for different tier combinations (1-tier, 2-tier, and 3-tier pools) and Easy Tier
management modes (managed/non-managed). Be aware that isolated workloads to
different extent pools might share a DA adapter as a physical resource, which can be a
potential limiting physical resource under certain extreme conditions. For example, a
condition might be high IOPS rates in combination with SSD ranks or high-bandwidth
utilizations in combination with sequential workloads.
DA level: All ranks on one or more DA pairs are dedicated to a workload. That is, only
volumes for this workload are allocated on the ranks that are associated with one or more
DAs. These ranks can be a different disk type (capacity or speed), RAID array type
(RAID 5, RAID 6, or RAID 10, arrays with spares or arrays without spares), or storage type
(CKD or FB) than the disk types, RAID types, or storage types that are used by other
workloads. You must consider workloads with heavy, large blocksize, and sequential
activity for DA-level isolation, because these workloads tend to consume all of the
available DA resources.
The DS8000 host connection subsetting for isolation can also be done at several levels:
I/O port level: Certain DS8000 I/O ports are dedicated to a workload. This subsetting is
common. Workloads that require Fibre Channel connection (FICON) and Fibre Channel
Protocol (FCP) must be isolated at the I/O port level, because each I/O port on the 4-port
or 8-port FCP/FICON-capable HA card can be configured to support only one of these
protocols. Although Open Systems host servers and remote mirroring links use the same
protocol (FCP), they are typically isolated to different I/O ports. You must also consider
workloads with heavy large block sequential activity for HA isolation, because they tend to
consume all of the I/O port resources that are available to them.
HA level: Certain HAs are dedicated to a workload. FICON and FCP workloads do not
necessarily require HA isolation, because separate I/O ports on the same 4-port/8-port
FCP/FICON-capable HA card can be configured to support each protocol (FICON or
FCP). However, it can be considered a preferred practice to separate FCP and FICON to
different HBAs if available. Furthermore, host connection requirements might dictate a
unique type of HA card (Long wave (LW) or Short wave (SW)) for a workload. Workloads
with heavy large block sequential activity must be considered for HA isolation, because
they tend to consume all of the I/O port resources that are available to them.
I/O enclosure level: Certain I/O enclosures are dedicated to a workload. This approach is
not generally necessary.
Multiple resource-sharing workloads can have logical volumes on the same ranks and can
access the same DS8000 HAs or I/O ports. Resource-sharing allows a workload to access
more DS8000 hardware than can be dedicated to the workload, providing greater potential
performance, but this hardware sharing can result in resource contention between
applications that impacts overall performance at times. It is important to allow
Easy Tier extent pools typically are shared by multiple workloads, because Easy Tier with its
automatic data relocation and performance optimization across multiple storage tiers
provides the most benefit for mixed workloads.
To better understand the resource-sharing principle for workloads on disk arrays, see 3.3.2,
“Extent pool considerations” on page 78.
You must allocate the DS8000 hardware resources to either an isolated workload or multiple
resource-sharing workloads in a balanced manner. That is, you must allocate either an
isolated workload or resource-sharing workloads to the DS8000 ranks that are assigned to
DAs and both processor complexes in a balanced manner. You must allocate either type of
workload to I/O ports that are spread across HAs and I/O enclosures in a balanced manner.
You must distribute volumes and host connections for either an isolated workload or a
resource-sharing workload in a balanced manner across all DS8000 hardware resources that
are allocated to that workload.
You must create volumes as evenly as possible across all ranks and DAs allocated to those
workloads.
One exception to the recommendation of spreading volumes might be when specific files or
datasets are never accessed simultaneously, such as multiple log files for the same
application where only one log file is in use at a time. In that case, you can place the volumes
required by these datasets or files on the same resources.
You must also configure host connections as evenly as possible across the I/O ports, HAs,
and I/O enclosures that are available to either an isolated or a resource-sharing workload.
Then, you can use host server multipathing software to optimize performance over multiple
host connections. For more information about multipathing software, see Chapter 8, “Host
attachment” on page 311.
Additionally, you must identify any workload that is so critical that its performance can never
be allowed to be negatively impacted by other workloads.
Next, define a balanced set of hardware resources that can be dedicated to any isolated
workloads, if required. Then, allocate the remaining DS8000 hardware for sharing among the
resource-sharing workloads. Carefully consider the appropriate resources and storage tiers
for Easy Tier and multi-tier extent pools in a balanced manner. Also, plan ahead for
appropriate I/O Priority Manager alignments of QoS levels to resource-sharing workloads
where needed.
The next step is planning extent pools and assigning volumes and host connections to all
workloads in a way that is balanced and spread - either across all dedicated resources (for
any isolated workload) or across all shared resources (for the multiple resource-sharing
workloads). For spreading workloads evenly across a set of ranks, consider multi-rank extent
pools that use storage pool striping, which we introduce in “Rotate extents (storage pool
striping) extent allocation method” on page 68 and discuss also in 4.8, “Planning extent pools”
on page 115.
Without the explicit need for workload isolation or any other requirements for multiple extent
pools, starting with two extent pools (with or without different storage tiers) and a balanced
distribution of the ranks and DAs might be the simplest configuration to start with using
resource-sharing throughout the whole DS8000 system and Easy Tier automatic
management - provided that you have FB or CKD storage. Otherwise, four extent pools are
required for a reasonable minimum configuration, two for FB storage and two for CDK
storage, and each pair distributed across both DS8000 storage servers. In addition, you can
plan to align your workloads to expected QoS levels with I/O Priority Manager.
The final step is the implementation of host-level striping (when appropriate) and multipathing
software, if desired. If you planned for Easy Tier, do not consider host-level striping, because
it dilutes the workload skew and is counterproductive to the Easy Tier optimization.
For example, the ratio of SSD capacity to HDD capacity in a hybrid pool depends on the
workload characteristics and skew. Ideally, there must be enough SSD capacity to hold the
The Storage Tier Advisor Tool (STAT) also can provide guidance for capacity planning of the
available storage tiers based on the existing workloads on a DS8000 system with Easy Tier
monitoring enabled. For additional information, see 6.7, “Storage Tier Advisor Tool” on
page 231.
You must also consider organizational and business considerations in determining which
workloads to isolate. Workload priority (the importance of a workload to the business) is a key
consideration. Application administrators typically request dedicated resources for high
priority workloads. For example, certain database online transaction processing (OLTP)
workloads might require dedicated resources to guarantee service levels.
The most important consideration is preventing lower priority workloads with heavy I/O
requirements from impacting higher priority workloads. Lower priority workloads with heavy
random activity need to be evaluated for rank isolation. Lower priority workloads with heavy,
large blocksize, sequential activity must be evaluated for DA and I/O port isolation.
Workloads that require different disk drive types (capacity and speed), different RAID types
(RAID 5, RAID 6, or RAID 10), or different storage types (CKD or FB) dictate isolation to
different DS8000 arrays, ranks, and extpools. For more information about the performance
implications of various RAID types, see 4.7.1, “RAID-level performance considerations” on
page 103.
Workloads that use different I/O protocols (FCP or FICON) dictate isolation to different I/O
ports. However, workloads that use the same disk drive types, RAID type, storage type, and
I/O protocol need to be evaluated for separation or isolation requirements.
Workloads with heavy, continuous I/O access patterns must be considered for isolation to
prevent them from consuming all available DS8000 hardware resources and impacting the
performance of other types of workloads. Workloads with large blocksize and sequential
activity must be considered for separation from those workloads with small blocksize and
random activity.
Isolation of only a few workloads that are known to have high I/O demands can allow all the
remaining workloads (including the high priority workloads) to share hardware resources and
achieve acceptable levels of performance. More than one workload with high I/O demands
might be able to share the isolated DS8000 resources, depending on the service level
requirements and the times of peak activity.
The following examples are I/O workloads, files, or datasets that might have heavy and
continuous I/O access patterns:
You must consider workloads for all applications for which DS8000 storage is allocated,
including current workloads to be migrated from other installed storage subsystems and new
workloads that are planned for the DS8000. Also, consider projected growth for both current
and new workloads.
For existing applications, consider historical experience first. For example, is there an
application where certain datasets or files are known to have heavy, continuous I/O access
patterns? Is there a combination of multiple workloads that might result in unacceptable
performance if their peak I/O times occur simultaneously? Consider workload importance
(workloads of critical importance and workloads of lesser importance).
For existing applications, you can also use performance monitoring tools that are available for
the existing storage subsystems and server platforms to understand current application
workload characteristics:
Read/write ratio
Random/sequential ratio
Average transfer size (blocksize)
Peak workload (I/Os per second for random access and MB per second for sequential
access)
Peak workload periods (time of day and time of month)
Copy Services requirements (Point-in-Time Copy and Remote Mirroring)
Host connection utilization and throughput (FCP host connections and FICON)
Remote mirroring link utilization and throughput
Estimate the requirements for new application workloads and for current application workload
growth. You can obtain information about general workload characteristics in Chapter 5,
“Understanding your workload” on page 157.
As new applications are rolled out and current applications grow, you must monitor
performance and adjust projections and allocations. You can obtain more information in
Appendix A, “Performance management process” on page 631 and in Chapter 7, “Practical
performance management” on page 235.
You can use the Disk Magic modeling tool to model the current or projected workload and
estimate the required DS8000 hardware resources. We introduce Disk Magic in 6.1, “Disk
Magic” on page 176.
The Storage Tier Advisor Tool can also provide workload information and capacity planning
recommendations associated with a specific workload to reconsider the need for isolation and
Choose the DS8000 resources to dedicate in a balanced manner. If ranks are planned for
workloads in multiples of two, half of the ranks can later be assigned to extent pools managed
by processor complex 0, and the other ranks can be assigned to extent pools managed by
processor complex 1. You must also note the DAs to be used. If I/O ports are allocated in
multiples of four, they can later be spread evenly across all I/O enclosures in a DS8000 frame
if four or more HA cards are installed. If I/O ports are allocated in multiples of two, they can
later be spread evenly across left and right I/O enclosures.
Review the DS8000 resources to share for balance. If ranks are planned for resource-sharing
workloads in multiples of two, half of the ranks can later be assigned to processor complex 0
extent pools, and the other ranks can be assigned to processor complex 1 extent pools. You
must also identify the DAs to use. If I/O ports are allocated for resource-sharing workloads in
multiples of four, they can later be spread evenly across all I/O enclosures in a DS8000 frame
if four or more HA cards are installed. If I/O ports are allocated in multiples of two, they can
later be spread evenly across left and right I/O enclosures.
Host connection: In this chapter, we use host connection in a general sense to represent
a connection between a host server (either z/OS or Open Systems) and the DS8000.
After the spreading plan is complete, use the DS8000 hardware resources that are identified
in the plan as input to order the DS8000 hardware.
However, there are host server performance considerations related to the number and size of
volumes. For example, for System z servers, the number of Parallel Access Volumes (PAVs)
that are needed can vary with volume size. For more information about PAVs, see 14.2,
“Parallel Access Volumes” on page 500. For System i servers, we suggest that you use a
volume size that is half the size of the disk drives used. There also can be Open Systems
host server or multipathing software considerations related to the number or the size of
volumes, so you must consider these factors in addition to workload requirements.
There are significant performance implications with the assignment of logical volumes to
ranks and DAs. The goal of the entire logical configuration planning process is to ensure that
volumes for each workload are on ranks and DAs that allow all workloads to meet
performance objectives.
Follow these steps for spreading volumes across allocated hardware for each isolated
workload, and then for each workload in a group of resource-sharing workloads:
1. Review the required number and the size of the logical volumes that are identified during
the workload analysis.
2. Review the number of ranks allocated to the workload (or group of resource-sharing
workloads) and the associated DA pairs.
3. Evaluate the use of multi-rank or multi-tier extent pools. Use rotate extents (storage pool
striping) as the EAM in homogeneous extent pools to spread workloads evenly across the
There are significant performance implications from the assignment of host connections to
I/O ports, HAs, and I/O enclosures. The goal of the entire logical configuration planning
process is to ensure that host connections for each workload access I/O ports and HAs that
allow all workloads to meet the performance objectives.
Follow these steps for spreading host connections across allocated hardware for each
isolated workload, and then for each workload in a group of resource-sharing workloads:
1. Review the required number and type (SW, LW, FCP, or FICON) of host connections that
is identified in the workload analysis. You must use a minimum of two host connections to
different DS8000 HA cards to ensure availability. Some Open Systems hosts might impose
limits on the amount of paths and volumes, for example, VMware ESX. In such cases, you
might consider not exceeding four paths per volume, which in general is a good approach
for performance and availability and also for other operating systems. The DS8800
front-end host ports are 8 Gbps capable and if the expected workload is not explicitly
saturating the adapter and port bandwidth with high sequential loads, you might share
ports with many hosts.
2. Review the HAs that are allocated to the workload (or group of resource-sharing
workloads) and the associated I/O enclosures.
During the DS8000 installation, array sites are dynamically created and assigned to DA pairs.
Array site IDs (Sx) do not have any fixed or predetermined relationship to disk drive physical
locations or to the disk enclosure installation order. The relationship between array site IDs
and physical disk locations or DA assignments can differ between the DS8000s, even on the
DS8000s with the same number and type of disk drives.
After the DS8000 hardware is installed, you can use the output of the DS8000 command-line
interface (DSCLI) lsarraysite command to display and document array site information,
including disk drive type and DA pair. You must check the disk drive type and DA pair for each
array site to ensure that arrays, ranks, and ultimately volumes created from the array site are
created on the DS8000 hardware resources required for the isolated or resource-sharing
workloads.
The result of this step is the addition of specific array site IDs to the plan of workload
assignment to ranks.
Now, we look at an example of a DS8800 configuration to review the available disk drive
types, DA pairs, and array sites and to describe several isolation and spreading
considerations.
2 3 6 7
DA Adapters DA Adapters
0 1 1 0 4 5 5 4
2 3 3 2 6 7 7 6
Figure 4-4 DS8800 configuration example with disk and I/O enclosures
Drives: In the schematic, the first two racks on the left (base frame and first expansion
frame) contain the I/O enclosures. Each I/O enclosure can contain up to two DAs, a total of
eight DAs per rack and 16 total in a full configuration and all expansions possible. The disk
enclosure fillers for SSDs are also allocated on the first two racks on the left, but they are
spread on all DAs. All other positions are for HDDs only.
The individual hardware configuration always needs to match your configuration and isolation
plans first. In this example, you probably have SSD (if supplied) and HDD drives on each DA
pair. When isolating an important workload to a dedicated DA pair with SSD and HDD ranks,
you might use Easy Tier and multi-tier pools for the isolated workload (which can also be a set
of individual workloads that share these resources and are isolated from the other workloads
on the system). If a subset of ranks from a DA pair is set aside in this example for an isolated
Remember that the sequence of steps when creating the arrays and ranks finally determines
the numbering scheme of array IDs and rank IDs, because these IDs are chosen
automatically by the system during creation. The logical configuration does not depend on a
specific ID numbering scheme, but a specific ID numbering scheme might help you plan the
configuration and manage performance more easily.
Storage servers: Array sites, arrays, and ranks do not have a fixed or predetermined
relationship to any DS8000 processor complex (storage server) before they are finally
assigned to an extent pool and thus a rank group (rank group 0/1 is managed by processor
complex 0/1).
RAID 5 is one of the most commonly used levels of RAID protection, because it optimizes
cost-effective performance while emphasizing usable capacity through data striping. It
provides fault tolerance if one disk drive fails by using XOR parity for redundancy. Hot spots
within an array are avoided by distributing data and parity information across all of the drives
in the array. The capacity of one drive in the RAID array is lost for holding the parity
information. RAID 5 provides a good balance of performance and usable storage capacity.
RAID 6 provides a higher level of fault tolerance than RAID 5 in disk failures but also provides
less usable capacity than RAID 5, because the capacity of two drives in the array is set aside
to hold the parity information. As with RAID 5, hot spots within an array are avoided by
distributing data and parity information across all of the drives in the array. Still, RAID 6 offers
more usable capacity than RAID 10 by providing an efficient method of data protection in
double disk errors, such as two drive failures, two coincident medium errors, or a drive failure
and a medium error during a rebuild. Because the likelihood of media errors increases with
the capacity of the physical disk drives, consider the use of RAID 6 in conjunction with large
capacity disk drives and higher data availability requirements. For example, consider RAID 6
where rebuilding the array in a drive failure takes a long time. RAID 6 can also be used with
smaller serial-attached Small Computer System Interface (SCSI) (SAS) drives, when the
primary concern is a higher level of data protection than is provided by RAID 5.
RAID 10 optimizes high performance while maintaining fault tolerance for disk drive failures.
The data is striped across several disks, and the first set of disk drives is mirrored to an
identical set. RAID 10 can tolerate at least one, and in most cases, multiple disk failures as
long as the primary copy and the secondary copy of a mirrored disk pair do not fail at the
same time.
In addition to the considerations for data protection and capacity requirements, the question
typically arises about which RAID level performs better, RAID 5, RAID 6, or RAID 10. As with
most complex issues, the answer is that it depends. There are a number of workload
attributes that influence the relative performance of RAID 5, RAID 6, or RAID 10, including
the use of cache, the relative mix of read as compared to write operations, and whether data
is referenced randomly or sequentially.
Regarding read I/O operations, either random or sequential, there is generally no difference
between RAID 5, RAID 6, and RAID 10. When a DS8000 subsystem receives a read request
from a host system, it first checks whether the requested data is already in cache. If the data
is in cache (that is, a read cache hit), there is no need to read the data from disk, and the
RAID level on the arrays does not matter. For reads that must be satisfied from disk (that is,
the array or the back end), the performance of RAID 5, RAID 6, and RAID 10 is roughly equal,
because the requests are spread evenly across all disks in the array. In RAID 5 and RAID 6
arrays, data is striped across all disks, so I/Os are spread across all disks. In RAID 10, data is
striped and mirrored across two sets of disks, so half of the reads are processed by one set of
disks, and half of the reads are processed by the other set, reducing the utilization of
individual disks.
Regarding random write I/O operations, the different RAID levels vary considerably in their
performance characteristics. With RAID 10, each write operation at the disk back end initiates
On modern disk systems, such as the DS8000, write operations are generally cached by the
storage subsystem and thus handled asynchronously with short write response times for the
attached host systems. So, any RAID 5 or RAID 6 write penalties are generally shielded from
the attached host systems in disk response time. Typically, a write request that is sent to the
DS8000 subsystem is written into storage server cache and persistent cache, and the I/O
operation is then acknowledged immediately to the host system as completed. As long as
there is room in these cache areas, the response time seen by the application is only the time
to get data into the cache, and it does not matter whether RAID 5, RAID 6, or RAID 10 is
used. However, if the host systems send data to the cache areas faster than the storage
server can destage the data to the arrays (that is, move it from cache to the physical disks),
the cache can occasionally fill up with no space for the next write request, and therefore, the
storage server signals the host system to retry the I/O write operation. In the time that it takes
the host system to retry the I/O write operation, the storage server likely can destage part of
the data, which provides free space in the cache and allows the I/O operation to complete on
the retry attempt.
When random small block write data is destaged from cache to disk, RAID 5 and RAID 6
arrays can experience a severe write penalty with four or six required back-end disk
operations. RAID 10 always requires only two disk operations per small block write request.
Because RAID 10 performs only half the disk operations of RAID 5, for random writes, a
RAID 10 destage completes faster and reduces the busy time of the disk subsystem. So, with
steady and heavy random write workloads, the back-end write operations to the ranks (the
physical disk drives) can become a limiting factor, so that only a RAID 10 configuration
(instead of additional RAID 5 or RAID 6 arrays) provides enough back-end disk performance
at the rank level to meet the workload performance requirements.
While RAID 10 clearly outperforms RAID 5 and RAID 6 in small-block random write
operations, RAID 5 and also RAID 6 show excellent performance in sequential write I/O
operations. With sequential write requests, all of the blocks required for the RAID 5 parity
calculation can be accumulated in cache, and thus the destage operation with parity
calculation can be dynamic as a full stripe write without the need for additional disk operations
to the array. So, with only one additional parity block for a full stripe write (for example, seven
data blocks plus one parity block for a 7+P RAID 5 array), RAID 5 requires less disk operation
at the back end than a RAID 10, which always requires twice the write operations due to data
mirroring. RAID 6 also benefits from sequential write patterns with most of the data blocks,
which are required for the double parity calculation, staying in cache and thus reducing the
amount of additional disk operations to the back end considerably. For sequential writes, a
RAID 5 destage completes faster and reduces the busy time of the disk subsystem.
Comparing RAID 5 to RAID 6, the performance of small block random read and the
performance of a sequential read are roughly equal. Due to the higher write penalty, the
RAID 6 small block random write performance is explicitly less than with RAID 5. Also, the
RAID 10 is not as commonly used as RAID 5 for two key reasons. First, RAID 10 requires
more raw disk capacity for every GB of effective capacity. Second, when you consider a
standard workload with a typically high number of read operations and only a few write
operations, RAID 5 generally offers the best trade-off between overall performance and
usable capacity. In many cases, RAID 5 write performance is adequate, because disk
systems tend to operate at I/O rates below their maximum throughputs, and differences
between RAID 5 and RAID 10 are primarily observed at maximum throughput levels.
Consider RAID 10 for critical workloads with a high percentage of steady random write
requests, which can easily become rank limited. RAID 10 provides almost twice the
throughput as RAID 5 (because of the “write penalty”). The trade-off for better performance
with RAID 10 is about 40% less usable disk capacity. Larger drives can be used with RAID 10
to get the random write performance benefit while maintaining about the same usable
capacity as a RAID 5 array with the same number of disks.
Table 4-1 shows a short overview of the advantages and disadvantages for the RAID level
reliability, space efficiency, and random write performance.
Table 4-1 RAID-level comparison of reliability, space efficiency, and write penalty
RAID level Reliability Space efficiencya Performance
(number of erasures) write penalty
(number of disk
operations)
In general, workloads that effectively use storage subsystem cache for reads and writes see
little difference between RAID 5 and RAID 10 configurations. For workloads that perform
better with RAID 5, the difference in RAID 5 performance over RAID 10 is typically small.
However, for workloads that perform better with RAID 10, the difference in RAID 10
performance over RAID 5 performance or RAID 6 performance can be significant.
RAID 5 tends to have a slight performance advantage for sequential writes. RAID 10
performs better for random writes. RAID 10 is generally considered to be the RAID type of
choice for business-critical workloads with many random write requests (typically more than
35% writes) and low response time requirements.
For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed
time, although RAID 5 and RAID 6 require more disk operations and therefore are more likely
to affect other disk activity on the same disk array.
You can select RAID types for each array site. So, you can select the RAID type based on the
specific performance requirements of the data for that site. The best way to compare the
performance of a specific workload that uses RAID 5, RAID 6, or RAID 10 is to run a Disk
Magic model. For additional information about the capabilities of this tool, see 6.1, “Disk
Magic” on page 176.
For workload planning purposes, it might be convenient to have a general idea of the I/O
performance that a single RAID array can provide. Figure 4-5 on page 108 and Figure 4-6 on
page 109 show measurement results1 for a single RAID array built from eight 3.5 inch large
form factor (LFF) 146 GB 15k FC disk drives when configured as RAID 5, RAID 6, or
RAID 10.
1 The measurements were done with IOmeter (http://www.iometer.org) on Microsoft Windows Server 2003
utilizing the entire available capacity on the array for I/O requests. The performance data that is contained here was
obtained in a controlled, isolated environment. Actual results that might be obtained in other operating environments
can vary. There is no guarantee that the same or similar results can be obtained elsewhere.
100
75
Response Time [ms]
50
25
0
0 500 1000 1500 2000 2500
IOps
Figure 4-5 Single rank RAID-level comparison for a 4 KB random workload with 100% reads
These numbers are not DS8000 specific, because they represent the limits that you can
expect from a simple set of eight physical disks that form a RAID array. As outlined in 2.4.5,
“SAS 2.5 inch drives compared to FC 3.5 inch drives” on page 39, you can expect slightly
higher maximum I/O rates with 2.5 inch small form factor (SFF) 146 GB 15k SAS drives as
used in DS8800 systems.
For small block random read workloads, there is no significant performance difference
between RAID 5, RAID 6, and RAID 10, as seen in Figure 4-5. Without considering read
cache hits, 1800 read IOPS with high back-end response times above 30 ms mark the upper
limit of the capabilities of a single array for random read access by using all of the available
capacity of that array.
Small-block random writes, however, make the difference between the various RAID levels
and performance. For a typical 70:30 random small block workload with 70% reads (no read
cache hits) and 30% writes, as shown in Figure 4-6 on page 109, the different performance
characteristics between the RAID levels become evident. With an increasing number of
random writes, RAID 10 clearly outperforms RAID 5 and RAID 6. For a standard random
small block 70:30 workload, 1500 IOPS mark the upper limit of a RAID 10 array, 1100 IOPS
for a RAID 5 array, and 900 IOPS for a RAID 6 array.
In both Figure 4-5 and Figure 4-6 on page 109, no read cache hits are considered.
Furthermore, the I/O requests are spread across the entire available capacity of each RAID
array. So, depending on the read cache hit ratio of a workload and the capacity used on the
array (using less capacity on an array means reducing disk arm movements and reducing
average access times), you can expect typically lower overall response times and higher I/O
rates. Also, the read:write ratio, as well as the access pattern of a particular workload, either
random or sequential, determine the achievable performance of a rank. Figure 4-5 and
Figure 4-6 on page 109 (examples of small block random I/O requests) help show the
performance capabilities of a single rank for different RAID levels.
100
75
Response Time [ms]
50
25
0
0 200 400 600 800 1000 1200 1400 1600 1800
IOps
Figure 4-6 Single rank RAID-level comparison for a 4 KB random workload with 70% reads 30% writes
Despite the different RAID levels and the actual workload pattern (read:write ratio, sequential
access, or random access), the limits of the maximum I/O rate per rank also depend on the
type of disk drives used. As a mechanical device, each disk drive can process a limited
number of random I/O operations per second, depending on the drive characteristics. So the
number of disk drives used for a specific amount of storage capacity finally determines the
achievable random IOPS performance. The 15k drives offer approximately 30% more random
IOPS performance than 10k drives. Generally, for random IOPS planning calculations, you
can use 160 IOPS per 15k FC drive and 120 IOPS per 10k FC drive. At these levels of disk
utilization, you might see already elevated response times. So for excellent response time
expectations, consider lower IOPS limits. Low spinning, large-capacity nearline disk drives
offer a considerably lower maximum random access I/O rate per drive (approximately half of a
15k drive). Therefore, they are only intended for environments with fixed content, data
archival, reference data, or for near-line applications that require large amounts of data at low
cost.
On the DS8000, four spare drives (the minimum) are required in a fully populated DA pair, so
certain arrays contain spares:
RAID 5: 6+P+S
RAID 6: 5+P+Q+S
RAID 10: 3x2+2S
This requirement leads to arrays with different storage capacities and performance
characteristics although they were created from array sites with identical disk types and RAID
levels. The spares are assigned during array creation. Typically, the first arrays created from
an unconfigured set of ranks on a certain DA pair contain spare drives until the minimum
requirements, as outlined in 3.1.5, “Spare creation” on page 56, are met.
Regarding the distribution of the spare drives, you might need to plan the sequence of array
creation carefully if a mixture of RAID 5, RAID 6, and RAID 10 arrays are required on the
same DA pair. Otherwise, you might not meet your initial capacity requirements and end up
with more spare drives on the system than required, wasting storage capacity. For example, if
you plan for two RAID 10 (3x2+2S) arrays on a certain DA pair with homogeneous array sites,
create these arrays first, because these arrays already reserve two spare drives per array, so
that the final RAID 5 or RAID 6 arrays do not contain spares. Or, if you prefer to obtain
RAID 10 (4x2) arrays without spare drives, you can instead create four RAID 5 or RAID 6
arrays, which then contain the required number of spare drives before creating the RAID 10
arrays.
To spread the available storage capacity and thus the overall workload evenly across both
DS8000 processor complexes, you must assign an equal number of arrays that contain
spares to processor complex 0 (rank group 0) and processor complex 1 (rank group 1).
Performance can differ between RAID arrays that contain spare drives and RAID arrays
without spare drives, because the arrays without spare drives offer more storage capacity and
also provide more active disk spindles for processing I/O operations.
Important: You must confirm spare allocation after array creation and adjust the logical
configuration plan before creating ranks and assigning them to extent pools.
When creating arrays from array sites, it might help to order the array IDs by DA pair, array
size (that is, arrays with or without spares), RAID level, or disk type, depending on the
available hardware resources and workload planning considerations.
You can take the mapping of the array sites to particular DA pairs from the output of the
DSCLI lsarraysite command. Array sites are numbered starting with S1, S2, and S3 by the
DS8000 microcode. Arrays are numbered with system-generated IDs starting with A0, A1,
and A2 in the sequence that they are created.
The following examples provide basic understanding about how the sequence of creation of
RAID arrays from array sites can determine the sequence of array IDs. A certain sequence of
array IDs can be convenient for different logical configuration approaches. Two approaches
are described, one for sorting the arrays by DA pair and one for sorting the arrays by capacity
and cycling through the available DA pairs in a round-robin fashion on a homogeneously
equipped DS8000 system. However, the sequence of array IDs and rank IDs is not important
for the logical configuration and typically depends on your hardware configuration. Especially
with different disk drives and RAID levels on the system and the use of Easy Tier and
multi-tier extent pools, the sequence is less important and can be automatic with the DS GUI
or any way that is convenient for the administrator. The following examples are related to a
partially configured DS8700 system with typically a multiple of eight ranks per DA pair. A
DS8800 system typically shows a multiple of six ranks per DA pair in partially configured
systems.
Example 4-2 Array ID sequence sorted by array size and cycling through all available DA pairs
dscli> lsarray -l
Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt
===========================================================================================
A0 Unassigned Normal 6 (5+P+Q+S) S1 - 2 146.0 ENT unsupported
A1 Unassigned Normal 6 (5+P+Q+S) S9 - 6 146.0 ENT unsupported
A2 Unassigned Normal 6 (5+P+Q+S) S17 - 7 146.0 ENT unsupported
A3 Unassigned Normal 6 (5+P+Q+S) S2 - 2 146.0 ENT unsupported
A4 Unassigned Normal 6 (5+P+Q+S) S10 - 6 146.0 ENT unsupported
A5 Unassigned Normal 6 (5+P+Q+S) S18 - 7 146.0 ENT unsupported
A6 Unassigned Normal 6 (5+P+Q+S) S3 - 2 146.0 ENT unsupported
A7 Unassigned Normal 6 (5+P+Q+S) S11 - 6 146.0 ENT unsupported
A8 Unassigned Normal 6 (5+P+Q+S) S19 - 7 146.0 ENT unsupported
A9 Unassigned Normal 6 (5+P+Q+S) S4 - 2 146.0 ENT unsupported
A10 Unassigned Normal 6 (5+P+Q+S) S12 - 6 146.0 ENT unsupported
A11 Unassigned Normal 6 (5+P+Q+S) S20 - 7 146.0 ENT unsupported
A12 Unassigned Normal 6 (6+P+Q) S5 - 2 146.0 ENT unsupported
A13 Unassigned Normal 6 (6+P+Q) S13 - 6 146.0 ENT unsupported
A14 Unassigned Normal 6 (6+P+Q) S21 - 7 146.0 ENT unsupported
A15 Unassigned Normal 6 (6+P+Q) S6 - 2 146.0 ENT unsupported
A16 Unassigned Normal 6 (6+P+Q) S14 - 6 146.0 ENT unsupported
A17 Unassigned Normal 6 (6+P+Q) S22 - 7 146.0 ENT unsupported
A18 Unassigned Normal 6 (6+P+Q) S7 - 2 146.0 ENT unsupported
A19 Unassigned Normal 6 (6+P+Q) S15 - 6 146.0 ENT unsupported
A20 Unassigned Normal 6 (6+P+Q) S23 - 7 146.0 ENT unsupported
A21 Unassigned Normal 6 (6+P+Q) S8 - 2 146.0 ENT unsupported
A22 Unassigned Normal 6 (6+P+Q) S16 - 6 146.0 ENT unsupported
A23 Unassigned Normal 6 (6+P+Q) S24 - 7 146.0 ENT unsupported
Depending on the installed hardware resources in the DS8000 storage subsystem, you might
have different numbers of DA pairs and different numbers of arrays per DA pair. Also, you
might not be able to strictly follow your initial array ID numbering scheme anymore when
upgrading storage capacity by adding array sites to the storage unit later.
The association between ranks, arrays, array sites, and DA pairs can be taken from the output
of the DSCLI command lsarray -l, as shown in Example 4-3, for a DS8700 system with a
homogeneous disk configuration.
Example 4-3 Rank, array, array site, and DA pair association as shown by the lsarray -l command on DS8700
dscli> lsarray -l
Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt
===========================================================================================
A0 Assigned Normal 6 (5+P+Q+S) S1 R0 2 146.0 ENT unsupported
A1 Assigned Normal 6 (5+P+Q+S) S9 R1 6 146.0 ENT unsupported
A2 Assigned Normal 6 (5+P+Q+S) S17 R2 7 146.0 ENT unsupported
Example 4-4 is another example for the DSCLI command lsarray -l on a DS8800 system
with different disk types, RAID levels, and storage classes (SSD, ENT, and NL). You can see
that the rank ID sequence does not necessarily follow the array ID sequence. There are
different numbers of arrays per DA pair, for example, six SFF HDD arrays on DA3 and DA1,
six SFF and three LFF HDD arrays on DA2, and two SSD arrays on DA0.
Example 4-4 Rank, array, array site, and DA pair association as shown by the lsarray -l command on DS8800
dscli> lsarray -l
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) diskclass encrypt
==========================================================================================
A0 Assigned Normal 5 (6+P+S) S1 R0 0 300.0 SSD unsupported
A1 Assigned Normal 5 (6+P+S) S2 R1 0 300.0 SSD unsupported
A2 Assigned Normal 5 (6+P+S) S3 R17 2 600.0 ENT unsupported
A3 Assigned Normal 5 (6+P+S) S4 R18 2 600.0 ENT unsupported
A4 Assigned Normal 5 (6+P+S) S5 R26 2 600.0 ENT unsupported
A5 Assigned Normal 5 (6+P+S) S6 R20 2 600.0 ENT unsupported
A6 Assigned Normal 5 (7+P) S7 R2 1 146.0 ENT unsupported
A7 Assigned Normal 5 (6+P+S) S8 R3 1 146.0 ENT unsupported
A8 Assigned Normal 5 (6+P+S) S9 R6 1 146.0 ENT unsupported
A9 Assigned Normal 5 (6+P+S) S10 R7 1 146.0 ENT unsupported
A10 Assigned Normal 5 (7+P) S11 R8 1 146.0 ENT unsupported
A11 Assigned Normal 5 (6+P+S) S12 R9 1 146.0 ENT unsupported
A12 Assigned Normal 5 (7+P) S13 R5 2 600.0 ENT unsupported
A13 Assigned Normal 5 (7+P) S14 R16 2 600.0 ENT unsupported
A14 Assigned Normal 6 (5+P+Q+S) S15 R21 2 3000.0 NL unsupported
A15 Assigned Normal 6 (5+P+Q+S) S16 R23 2 3000.0 NL unsupported
A16 Assigned Normal 6 (5+P+Q+S) S17 R22 2 3000.0 NL unsupported
A18 Assigned Normal 5 (7+P) S19 R10 3 146.0 ENT unsupported
A19 Assigned Normal 5 (6+P+S) S20 R11 3 146.0 ENT unsupported
A20 Assigned Normal 5 (6+P+S) S21 R12 3 146.0 ENT unsupported
A21 Assigned Normal 5 (7+P) S22 R13 3 146.0 ENT unsupported
A22 Assigned Normal 5 (6+P+S) S23 R14 3 146.0 ENT unsupported
A23 Assigned Normal 5 (6+P+S) S24 R15 3 146.0 ENT unsupported
Unassigned ranks do not have a fixed or predetermined relationship to any DS8000 processor
complex. Each rank can be assigned to any extent pool or any rank group. Only when
assigning a rank to an extent pool and thus rank group 0 or rank group 1 does the rank
become associated with processor complex 0 or processor complex 1. Ranks from rank
group 0 (even-numbered extent pools: P0, P2, and P4, for example) are managed by
processor complex 0, and ranks from rank group 1 (odd-numbered extent pools: P1, P3, and
P5, for example) are managed by processor complex 1.
For a balanced distribution of the overall workload across both processor complexes, half of
the ranks must be assigned to rank group 0 and half of the ranks must be assigned to rank
group 1. Also, the ranks with and without spares must be spread evenly across both rank
groups. Furthermore, it is important that the ranks (from each DA pair) are distributed evenly
across both processor complexes; otherwise, you might seriously limit the available back-end
bandwidth and thus the overall throughput of the system. If, for example, all ranks of a DA pair
are assigned to only one processor complex, only one DA card of the DA pair is used to
access the set of ranks, and thus, only half of the available DA pair bandwidth is available.
This practice also is especially important with solid-state drives and small block random I/O
workloads. To be able to use the full back-end random I/O performance of two (or more) SSD
ranks within a certain DA pair, the SSD I/O workload must be balanced across both DAs of
the DA pair. This balance can be achieved by assigning half of the SSD ranks of each DA pair
to even extent pools (P0, P2, and P4, managed by storage server 0 or rank group 0) and half
of the SSD ranks to odd extent pools (P1, P3, and P5, managed by storage server 1 or rank
group 1). If, for example, all SSD ranks of the same DA pair are assigned to the same rank
group (for example, rank group 0 with extent pool P0 and P2), only one DA card of the DA pair
is used to access this set of SSD ranks, providing only half of the available I/O processing
capability of the DA pair and severely limiting the overall SSD performance.
Example 4-5 Rank, array, and capacity information as shown by the lsrank -l command on DS8700
dscli> lsrank -l
ID Group State datastate array RAIDtype extpoolID extpoolnam stgtype exts usedexts encryptgrp
=======================================================================================================
R0 - Unassigned Normal A0 6 - - fb 634 - -
R1 - Unassigned Normal A1 6 - - fb 634 - -
R2 - Unassigned Normal A2 6 - - fb 634 - -
R3 - Unassigned Normal A3 6 - - fb 634 - -
R4 - Unassigned Normal A4 6 - - fb 634 - -
R5 - Unassigned Normal A5 6 - - fb 634 - -
R6 - Unassigned Normal A6 6 - - fb 634 - -
R7 - Unassigned Normal A7 6 - - fb 634 - -
R8 - Unassigned Normal A8 6 - - fb 634 - -
R9 - Unassigned Normal A9 6 - - fb 634 - -
R10 - Unassigned Normal A10 6 - - fb 634 - -
R11 - Unassigned Normal A11 6 - - fb 634 - -
R12 - Unassigned Normal A12 6 - - fb 763 - -
R13 - Unassigned Normal A13 6 - - fb 763 - -
R14 - Unassigned Normal A14 6 - - fb 763 - -
R15 - Unassigned Normal A15 6 - - fb 763 - -
R16 - Unassigned Normal A16 6 - - fb 763 - -
R17 - Unassigned Normal A17 6 - - fb 763 - -
R18 - Unassigned Normal A18 6 - - fb 763 - -
R19 - Unassigned Normal A19 6 - - fb 763 - -
Extent pools are automatically numbered with system-generated IDs starting with P0, P1, and
P2 in the sequence in which they are created. Extent pools that are created for rank group 0
are managed by processor complex 0 and have even-numbered IDs (P0, P2, and P4, for
example). Extent pools that are created for rank group 1 are managed by processor complex
1 and have odd-numbered IDs (P1, P3, and P5, for example). Only in a failure condition or
during a concurrent code load is the ownership of a certain rank group temporarily moved to
the alternate processor complex.
To achieve a uniform storage system I/O performance and avoid single resources that
become bottlenecks (called “hot spots”), it is preferable to distribute volumes and workloads
evenly across all of the ranks (disk spindles) and DA pairs that are dedicated to a workload by
creating appropriate extent pool configurations.
The assignment of the ranks to extent pools together with an appropriate concept for the
logical configuration and volume layout is the most essential step to optimize overall storage
system performance. A rank can be assigned to any extent pool or rank group. Each rank
provides a particular number of storage extents of a certain storage type (either FB or CKD)
to an extent pool. An extent pool finally aggregates the extents from the assigned ranks and
provides the logical storage capacity for the creation of logical volumes for the attached host
systems. For more information about effective capacity and extents, see 8.5.2, “Disk capacity”
in IBM System Storage DS8000: Architecture and Implementation, SG24-8886.
When an appropriate DS8000 hardware base is selected for the planned workloads (that is,
isolated and resource-sharing workloads), the next goal is to provide a logical configuration
concept. This concept widely guarantees a balanced workload distribution across all available
hardware resources within the storage subsystem at any time - from the beginning, when only
part of the available storage capacity is used, up to the end, when almost all of the capacity of
the storage system is allocated.
On the DS8000, we can configure homogeneous single-tier extent pools, with ranks of the
same storage class, and hybrid multi-tier extent pools with ranks from different storage
classes. The EAMs, such as rotate extents or storage pool striping, provide easy to use
capacity-based methods of spreading the workload data across the ranks in an extent pool.
Furthermore, the use of Easy Tier automatic mode to automatically manage and maintain an
optimal workload distribution across these resources over time provides excellent workload
spreading with the best performance at a minimum administrative effort.
In addition, the various extent pool configurations (homogeneous or hybrid pools, managed or
not managed by Easy Tier) can further be combined with the DS8000 I/O Priority Manager to
prioritize workloads that are sharing resources to meet QoS goals in cases when resource
contention might occur.
In the following sections, we present concepts for the configuration of single-tier and multi-tier
extent pools to spread the workloads evenly across the available hardware resources. Also,
Starting on the left, we show a DS8000 configuration (red color) with multiple homogeneous
extent pools of different storage tiers with the Easy Tier automatic mode not enabled (or
without the Easy Tier feature). With dedicated storage tiers bound to individual extent pools,
the suitable extent pool for each volume must be chosen manually based on workload
requirements. Furthermore, the performance and workload in each extent pool must be
closely monitored and managed down to the rank level to adapt to workload changes over
time, This monitoring increases overall management effort considerably. Depending on
workload changes or application needs, workloads and volumes must be migrated from one
highly utilized extent pool to another less utilized extent pool or from a lower storage tier to a
higher storage tier. Easy Tier manual mode can help to easily and dynamically migrate a
volume from one extent pool to another. However, data placement across tiers must be
managed manually and occurs on the volume level only, which might, for example, waste
costly SSD capacity. Typically, only a part of the capacity of a specific volume is hot and best
suited for SSD placement. The workload can become imbalanced across the ranks within an
extent pool and limit the overall performance even with storage pool striping due to natural
workload skew. Workload spreading within a pool is only based on spreading the volume
capacity evenly across the ranks, not taking any data access patterns or performance
statistics into account. After adding new capacity to an existing extent pool, you must restripe
the volume data within an extent pool manually, for example, by using manual volume
rebalance, to maintain a balanced workload distribution across all ranks in a specific pool.
As shown in the second configuration (orange color), we can easily optimize the first
configuration and already reduce management efforts considerably by enabling Easy Tier
automatic mode. Thus, we automate intra-tier performance management (auto-rebalance) in
these homogeneous single-tier extent pools. Easy Tier controls the workload spreading within
each extent pool and automatically relocates data across the ranks based on rank utilization
to minimize skew and avoid rank hot spots. Performance management is shifted from rank to
extent pool level with correct data placement across tiers and extent pools at the volume level.
Furthermore, when adding new capacity to an existing extent pool, Easy Tier automatic mode
automatically takes advantage of the capacity and performance capabilities of the new ranks
without the need for manual interaction.
We can further reduce management effort by merging extent pools and building different
combinations of 2-tier hybrid extent pools, as shown in the third configuration (blue color). We
introduce an automatically managed tiered storage architecture but still isolate, for example,
our high performance production workload from our development/test environment. We
introduce an ENT/SSD pool for our high performance and high priority production workload,
efficiently boosting ENT performance with SSDs and automate storage tiering from
enterprise-class drives to SSDs by using Easy Tier automated cross-tier data relocation and
storage performance optimization at the subvolume level. And, we create an ENT/NL pool for
our development/test environment or other enterprise-class applications to maintain
enterprise-class performance while shrinking the footprint and reducing costs by combining
enterprise-class drives with large-capacity nearline drives that use Easy Tier automated
The minimum management effort combined with the highest amount of automated storage
optimization can be achieved by creating 3-tier hybrid extent pools and by using Easy Tier
automatic mode across all three tiers, as shown in the fourth configuration (green color). We
use the most efficient way of automated data relocation to the appropriate storage tier with
automatic storage performance and storage economics optimization on the subvolume level
at a minimum administrative effort. We use automated cross-tier management across three
storage tiers and automated intra-tier management within each storage tier in each extent
pool.
ET automode=off
1-tier (homogeneous) pools
Figure 4-7 Ease of storage management versus automatic storage optimization by Easy Tier
All of these configurations can be combined with I/O Priority Manager to prioritize workloads
when sharing the same resources and provide QoS levels in case resource contention
occurs.
With the Easy Tier feature enabled, you can also take full advantage of the Easy Tier manual
mode features, such as dynamic volume relocation (volume migration), dynamic extent pool
merge, and rank depopulation to dynamically modify your logical configuration. When
merging extent pools with different storage tiers, you can gradually introduce more automatic
storage management with Easy Tier at any time. Or with rank depopulation, you can reduce
multi-tier pools and automated cross-tier management according to your needs.
Our examples are generic. A single DS8000 system with its tremendous scalability can
manage many applications effectively and efficiently, so typically multiple extent pool
configurations exist on a large system for various needs for isolated and resource sharing
Easy Tier and I/O Priority Manager eventually simplify management with single-tier and
multi-tier extent pools and help to spread workloads easily across shared hardware resources
under optimum conditions and best performance, automatically adapting to changing
workload conditions. You can choose from various extent pool configurations for your
resource isolation and resource sharing workloads, combined with Easy Tier and I/O Priority
Manager.
Single-tier extent pools consist of one or more ranks that can be referred to as single-rank or
multi-rank extent pools.
One reason for single-rank extent pools in the past was the simple one-to-one mapping
between ranks and extent pools. Because a volume is always created from a single extent
pool, with single-rank extent pools, you can precisely control the volume placement across
selected ranks and thus manually manage the I/O performance of the different workloads at
the rank level. Furthermore, you can quickly obtain the relationship of a volume to its extent
pool using the output of the DSCLI lsfbvol or lsckdvol command. Thus, with single-rank
extent pools, there is a direct relationship between volumes and ranks based on the extent
pool of the volume. This relationship helps you manage and analyze performance down to
rank level more easily, especially with host-based tools, such as Resource Measurement
Facility™ (RMF™) on System z in combination with a hardware-related assignment of
LSS/LCU IDs. However, the administrative effort increases considerably, because you must
create the volumes for a specific workload in multiple steps from each extent pool separately
when distributing the workload across multiple ranks.
With single-rank extent pools, you choose a configuration design that limits the capabilities of
a created volume to the capabilities of a single rank for capacity and performance. A single
volume cannot exceed the capacity or the I/O performance provided by a single rank. So, for
demanding workloads, you need to create multiple volumes from enough ranks from different
extent pools and use host-level-based striping techniques, such as volume manager striping,
to spread the workload evenly across the ranks dedicated to a specific workload. You are also
likely to waste storage capacity easily if extents remain left on ranks in different extent pools,
Furthermore, you benefit less from the advanced DS8000 virtualization features, such as
dynamic volume expansion (DVE), FlashCopy SE, storage pool striping, Easy Tier automatic
performance management, and workload spreading, which use the capabilities of multiple
ranks within a single extent pool.
Single-rank extent pools are selected for environments where isolation or management of
volumes on the rank level is desirable, such as in z/OS environments. Single-rank extent
pools are selected for configurations using storage appliances, such as the SAN Volume
Controller, where the selected RAID arrays are provided to the appliance as simple back-end
storage capacity and where the advanced virtualization features on the DS8000 are not
required or not wanted in order to avoid multiple layers of data striping. However, the use of
homogeneous multi-rank extent pools and storage pool striping to minimize the storage
administrative effort by shifting the performance management from the rank to the extent pool
level and letting the DS8000 maintain a balanced data distribution across the ranks within a
specific pool is popular. It provides excellent performance in relation to the reduced
management effort.
Also, you do not need to strictly use only single-rank extent pools or only multi-rank extent
pools on a storage system. You can base your decision on individual considerations for each
workload group that is assigned to a set of ranks and thus extent pools. The decision to use
single-rank and multi-rank extent pools depends on the logical configuration concept that is
chosen for the distribution of the identified workloads or workload groups for isolation and
resource-sharing.
In general, single-rank extent pools might not be good in the current complex and mixed
environments unless you know that this level of isolation and micro-performance
management is required for your specific environment. If not managed correctly, workload
skew and rank hot spots that limit overall system performance are likely to occur.
With a homogeneous multi-rank extent pool, you take advantage of the advanced DS8000
virtualization features to spread the workload evenly across the ranks in an extent pool to
achieve a well-balanced data distribution with considerably less management effort.
Performance management is shifted from the rank level to the extent pool level. An extent
pool represents a set of merged ranks (a larger set of disk spindles) with a uniform workload
distribution. So, the level of complexity for standard performance and configuration
management is reduced from managing many individual ranks (micro-performance
management) to a few multi-rank extent pools (macro-performance management).
The DS8000 EAMs, such as rotate volumes (-eam rotatevols) and rotate extents (-eam
rotateexts), take care of spreading the volumes and thus the individual workloads evenly
across the ranks within homogeneous multi-rank extent pools. Rotate extents is the default
and preferred EAM to distribute the extents of each volume successively across all ranks in a
pool to achieve a well-balanced capacity-based distribution of the workload. Rotate volumes is
an exception today, but it can help to implement a strict volume to rank relationship. It reduces
the configuration effort compared to single-rank extent pools by easily distributing a set of
volumes to different ranks in a specific extent pool for workloads where the use of host-based
striping methods is still preferred. The size of the volumes must fit to the available capacity on
each rank. The number of volumes created for this workload in a specific extent pool must
Even multi-rank extent pools provide some level of control of volume placement across the
ranks in cases where it is necessary to manually enforce a special volume allocation scheme.
You can use the DSCLI command chrank -reserve to reserve all of the extents from a rank in
an extent pool from being used for the next creation of volumes. Alternatively, you can use the
DSCLI command chrank -release to release a rank and make the extents available again.
Multi-rank extent pools that use storage pool striping are the general configuration approach
today on modern DS8800 and DS8700 systems to spread the data evenly across the ranks in
a homogeneous multi-rank extent pool and thus reduce skew and the likelihood of single rank
hot spots. Without Easy Tier automatic mode management, such non-managed,
homogeneous multi-tier extent pools consist only of ranks of the same drive type and RAID
level. Although not required (and probably not realizable for smaller or heterogeneous
configurations), you can take the effective rank capacity into account, grouping ranks with and
without spares into different extent pools when using storage pool striping to ensure a strict
balanced workload distribution across all ranks up to the last extent. Otherwise, take
additional considerations for the volumes created from the last extents in a mixed
homogeneous extent pool that contains ranks with and without spares, because these
volumes are probably allocated only on part of the ranks with the larger capacity and without
spares.
In combination with Easy Tier, a more efficient and automated way of spreading the
workloads evenly across all ranks in homogeneous multi-rank extent pool is available. The
automated intra-tier performance management (auto-rebalance) of Easy Tier on DS8800 and
DS8700 storage systems efficiently spreads the workload evenly across all ranks. It
automatically relocates the data across the ranks of the same storage class in an extent pool
based on rank utilization to achieve and maintain a balanced distribution of the workload,
minimizing skew and avoiding rank hot spots. You can enable auto-rebalance for
homogeneous extent pool by setting the Easy Tier management scope to all extent pools
(ETautomode=all).
In addition, Easy Tier automatic mode can also handle storage device variations within a tier
that uses a micro-tiering capability. An example of storage device variations within a tier is an
intermix of ranks with different disk rpm or RAID levels within the same storage class of an
extent pool. A typical micro-tiering scenario is, for example, when, after a hardware upgrade,
new 15K rpm Enterprise disk drives intermix with existing 10K rpm Enterprise disk drives. In
these configurations, the Easy Tier automatic mode micro-tiering capability takes into account
the different performance profiles of each micro-tier and performs intra-tier (auto-rebalance)
optimizations. Easy Tier does not handle a micro-tier like an additional tier; it is still part of a
specific tier. For this reason, the hotness of an extent does not trigger any promotion or
demotion across micro-tiers of the same tier, because the extent relocation across micro-tiers
can only occur as part of the auto-rebalance feature and is based on rank utilization.
Figure 4-8 provides two configuration examples for using dedicated homogeneous extent
pools with storage classes in combination with and without Easy Tier automatic mode
management.
SSD SSD - Manual cross-tier workload management on volume level using Easy Tier manual mode
P0 P1
features for volume migrations (dynamic volume relocation) across pools and tiers.
ENT ENT - Manual intra-tier workload management and workload spreading within extent pools using
ENT P2 P3 DS8000 extent allocation methods such as storage pool striping based on a balanced
ENT
volume capacity distribution.
ENT ENT - Strictly homogeneous pools with ranks of the same drive characteristics and RAID level.
ENT P4 P5 - Isolation of workloads across different extent pools and storage tiers on volume level,
ENT
limiting the most efficient use of the available storage capacity and tiers.
NL P6 NL P7 - Highest administration and performance management effort with constant resource
ET automode=none utilization monitoring, workload balancing and manual placement of volumes across
1-tier extent pools ranks, extent pools and storage tiers.
- Efficiently taking advantage of new capacity when added to existing pools typically
requires manual restriping of volumes using manual volume rebalance.
Figure 4-8 Single-tier extent pool configuration examples and Easy Tier benefits
With multi-rank extent pools, you can fully use the features of the DS8000 virtualization
architecture and Easy Tier that provide ease of use when you manage more applications
effectively and efficiently with a single DS8000 system. Consider multi-rank extent pools and
the use of Easy Tier automatic management especially for mixed workloads that are to be
spread across multiple ranks. Multi-rank extent pools help to simplify management and
volume creation. They also allow the creation of single volumes that can span multiple ranks
and thus exceed the capacity and performance limits of a single rank.
However, for performance management and analysis reasons in balanced multi-rank extent
pools, you sometimes might want to relate volumes associated with a specific I/O workload to
ranks that provide the physical disk spindles for servicing the workload I/O requests and
determine the I/O processing capabilities. The DSCLI showfbvol -rank or showckdvol -rank
command can be used to show this relationship by displaying the extent distribution of a
volume across the ranks, as shown in Example 3-2 on page 85.
For more information about data placement in extent pool configurations, see 3.3.2, “Extent
pool considerations” on page 78.
Important: Multi-rank extent pools offer numerous advantages with respect to ease of use,
space efficiency, and the DS8000 virtualization features. Multi-rank extent pools, in
combination with Easy Tier automatic mode, provide both ease of use and excellent
performance for standard environments with workload groups that share a set of
homogeneous resources.
A multi-tier extent pool can consist of one of the following storage class combinations with up
to three storage tiers:
SSD + Enterprise disk
SSD + nearline disk
Enterprise disk + nearline disk
SSD + Enterprise disk + nearline disk
Multi-tier extent pools are especially suited for mixed, resource sharing workloads. Tiered
storage, as described in 4.1, “Review the tiered storage concepts and Easy Tier” on page 88,
is an approach of utilizing types of storage throughout the storage infrastructure. It is a mix of
higher performing/higher cost storage with lower performing/lower cost storage and placing
data based on its specific I/O access characteristics. While SSDs can help to efficiently boost
enterprise-class performance, you can additionally shrink the footprint and reduce costs by
adding large capacity nearline drives while maintaining enterprise class performance.
Correctly balancing all the tiers eventually leads to the lowest cost and best performance
solution.
Always create hybrid extent pools for Easy Tier automatic mode management. The extent
allocation for volumes in hybrid extent pools differs from the extent allocation in homogeneous
pools. Any specified EAM, such as rotate extents or rotate volumes, is ignored when a new
volume is created in or migrated into a hybrid pool. The EAM is changed to managed, as soon
as the Easy Tier automatic mode is enabled for the pool, and the volume is under the control
of Easy Tier. Easy Tier then automatically moves extents to the most appropriate storage tier
and rank in the pool based on performance aspects.
Workload spreading across the resources (ranks and DAs) in a managed hybrid pool is
automatic by Easy Tier by using intra-tier (auto-rebalance) and cross-tier data relocation to
optimize storage performance and storage economics based on data performance
characteristics. Easy Tier automatic mode adapts to changing workload conditions and
automatically promotes hot extents from the lower tier to the upper tier (Enterprise to SSD
and nearline to Enterprise). Or, it demotes colder extents from the higher tier to the lower tier
(swap extents from SSD with hotter extents from Enterprise tier, or demote cold extents from
Enterprise tier to Nearline tier). Easy Tier automatic mode optimizes the Nearline tier by
demoting some of the sequential workload to the Nearline tier to better balance sequential
workloads. Auto-rebalance rebalances extents across the ranks of the same tier based on
rank utilization to minimize skew and avoid hot spots. Auto-rebalance takes different device
characteristics into account when different devices or RAID levels are mixed within the same
storage tier (micro-tiering).
Regarding the requirements of your workloads, you can create a pair or multiple pairs of
extent pools with different 2-tier or 3-tier combinations that depend on your needs and
available hardware resources. You can, for example, create separate 2-tier SSD/ENT and
ENT/NL extent pools to isolate your production environment from your development
environment. You can boost the performance of your production application with SSDs and
optimize storage economics for your development applications with NL drives.
Or, you can create 3-tier extent pools for mixed, large resource-sharing workload groups and
benefit from fully automated storage performance and economics management at a minimum
management effort. You can boost the performance of your high-demand workloads with
SSDs and reduce the footprint and costs with NL drives for the lower demand data.
With Easy Tier, managing the data location of your volumes on the extent level across all
ranks in a managed hybrid extent pool, you can use the DSCLI showfbvol -rank or
showckdvol -rank command to display the current extent distribution of a volume across the
ranks, as shown in Example 3-2 on page 85. Additionally, the volume heat distribution
(volume heat map), as provided by the STAT, can help to identify the amount of hot, warm,
and cold extents for each volume and its current distribution across the storage tiers in the
pool. For more information about the STAT, see 6.7, “Storage Tier Advisor Tool” on page 231.
The ratio of SSD, ENT, and NL disk capacity in a hybrid pool depends on the workload
characteristics and skew and must be planned when ordering the drive hardware for the
identified workloads. See “Multi-tier extent pools” on page 81 and “Drive Selection with Easy
Tier” in IBM System Storage DS8800 and DS8700 Performance with Easy Tier 3rd
Generation, WP102024, for additional information.
With the Easy Tier manual mode features, such as dynamic extent pool merge, dynamic
volume relocation, and rank depopulation, you can modify existing configurations easily,
depending on your needs. You can grow from a manually managed single-tier configuration
into a partially or fully automatically managed tiered storage configuration. You add tiers or
merge appropriate extent pools and enable Easy Tier at any time. For more information about
Easy Tier, see IBM System Storage DS8000 Easy Tier, REDP-4667.
Figure 4-9 provides two configuration examples of using hybrid extent pools with Easy Tier.
One example shares all resources for all workloads with automatic management across three
tiers. The other example isolates workload groups with different optimization goals to 2-tier
extent pool combinations.
Figure 4-9 Multi-tier extent pool configuration examples and Easy Tier benefits
When bringing a new DS8800 system into production to replace an older one, with the older
storage system often not using Easy Tier, consider the timeline of the implementation stages,
by which you migrate all servers from the older to the new storage system.
Consider a staged approach when migrating servers to a new multi-tier DS8800 system:
Assign the resources for the high-performing and response time sensitive workloads first,
then add the less performing workloads. The other way might lead to situations where all
initial resources, such as the Enterprise tier in hybrid extent pools, are allocated already by
the secondary workloads. This situation does not leave enough space on the Enterprise
tier for the primary workloads, which then must be initially on the Nearline tier.
Split your servers into several subgroups, where you migrate each subgroup one by one,
and not all at once. Then, allow the Easy Tier several days to learn and optimize. Some
extents are moved to SSDs and some extents are moved to nearline. You regain space on
the Enterprise HDDs. After a server subgroup learns and reaches a steady state, the next
server subgroup can be migrated. You gradually allocate the capacity in the hybrid extent
pool by optimizing the extent distribution of each application one by one while regaining
space in the Enterprise tier (home tier) for the next applications.
The EAM determines how a volume is created within a multi-rank extent pool for the
allocation of the extents on the available ranks. The selection of the appropriate EAM is key
when manually spreading the workload evenly across all ranks in non-managed
homogeneous extent pools without Easy Tier automatic mode. The EAM can be selected at
the volume level and is not an attribute of an extent pool, except if the pool is managed by
Easy Tier and the EAM is automatically changed to managed. You can have volumes that are
created with the rotate extents and rotate volume methods in the same extent pool. To lose
the benefits of a rotate extents algorithm with a uniform workload distribution across the
ranks, do not carelessly use both extent allocation algorithms together within the same extent
The rotate extents algorithm spreads the extents of a single volume and the I/O activity of
each volume across all the ranks in an extent pool. Furthermore, it ensures that the allocation
of the first extent of successive volumes starts on different ranks rotating through all available
ranks in the extent pool to optimize workload spreading across all ranks.
With the default rotate extents (rotateexts) algorithm, the extents (1 GiB for FB volumes and
1113 cylinders or approximately 0.94 GiB for CKD volumes) of each single volume are spread
across all ranks within an extent pool (provided the size of the volume in extents is at least
equal to or larger than the number of ranks in the extent pool) and thus across more disks.
This approach reduces the occurrences of I/O hot spots at the rank level within the storage
system. Storage pool striping helps to balance the overall workload evenly across the
back-end resources. It reduces the risk of single ranks that become performance bottlenecks
while providing ease of use with less administrative effort.
When using the optional rotate volumes (rotatevols) EAM, each volume, one volume after
another volume, is placed on a single rank with a successive distribution across all ranks in a
round-robin fashion.
The rotate extents and rotate volumes EAMs determine the initial data distribution of a volume
and thus the spreading of workloads in non-managed, single-tier extent pools. With Easy Tier
automatic mode enabled for single-tier (homogeneous) or multi-tier (hybrid) extent pools, this
selection becomes unimportant. The data placement and thus the workload spreading is
managed by Easy Tier. The use of Easy Tier automatic mode for single-tier extent pools is
highly encouraged for an optimal spreading of the workloads across the resources. In
single-tier extent pools, you can benefit from the Easy Tier automatic mode feature
auto-rebalance. Auto-rebalance constantly and automatically balances the workload across
ranks of the same storage tier based on rank utilization, minimizing skew and avoiding the
occurrence of single rank hot spots. It can also take device variations (different drive types) or
different RAID levels within a storage tier into account (Easy Tier micro-tiering capability).
However, there might also be valid reasons for not using storage pool striping. You might use
it to avoid unnecessary layers of striping and reorganizing I/O requests, which might increase
latency and not help achieve a more evenly balanced workload distribution. Multiple
independent striping layers might be counterproductive. For example, creating a number of
volumes from a single multi-rank extent pool that uses storage pool striping and then,
additionally, use host-level striping or application-based striping on the same set of volumes
might compromise performance. In this case, two layers of striping are combined with no
overall performance benefit. In contrast, creating four volumes from four different extent pools
from both rank groups that use storage pool striping and then use host-based striping or
application-based striping on these four volumes to aggregate the performance of the ranks in
all four extent pools and both processor complexes is reasonable.
The DS8000 storage pool striping is based on spreading extents across different ranks. So,
with extents of 1 GiB (FB) or 0.94 GiB (1113 cylinders/CKD), the size of a data chunk is rather
large. For distributing random I/O requests, which are evenly spread across the capacity of
each volume, this chunk size generally is appropriate. However, depending on the individual
access pattern of a specific application and the distribution of the I/O activity across the
volume capacity, certain applications perform better. Use more granular stripe sizes for
optimizing the distribution of the application I/O requests across different RAID arrays by
using host-level striping techniques or have the application manage the workload distribution
across independent volumes from different ranks.
Consider the following points for selected applications or environments to use storage pool
striping in homogeneous configurations:
DB2: Excellent opportunity to simplify storage management using storage pool striping.
You might prefer to use DB2 traditional recommendations for DB2 striping for
performance-sensitive environments.
DB2 and similar data warehouse applications, where the database manages storage and
parallel access to data. Consider generally independent volumes on individual ranks with
a careful volume layout strategy that does not use storage pool striping. Containers or
database partitions are configured according to suggestions from the database vendor.
Oracle: Excellent opportunity to simplify storage management for Oracle. You might prefer
to use Oracle traditional suggestions that involve Automatic Storage Management (ASM)
and Oracle striping capabilities for performance-sensitive environments.
Small, highly active logs or files: Small highly active files or storage areas smaller than
1 GiB with a high access density might require spreading across multiple ranks for
performance reasons. However, storage pool striping only offers a striping granularity on
extent levels around 1 GiB, which is too large in this case. Continue to use host-level
striping techniques or application-level striping techniques that support smaller stripe
sizes. For example, assume a 0.8 GiB log file exists with extreme write content, and you
want to spread this log file across several RAID arrays. Assume that you intend to spread
its activity across four ranks. At least four 1 GiB extents must be allocated, one extent on
each rank (which is the smallest possible allocation). Creating four separate volumes,
each with a 1 GiB extent from each rank, and then using Logical Volume Manager (LVM)
striping with a relatively small stripe size (for example, 16 MiB) effectively distributes the
workload across all four ranks. Creating a single LUN of four extents, which is also
distributed across the four ranks using the DS8000 storage pool striping, cannot effectively
spread the file workload evenly across all four ranks due to the large stripe size of one
extent, which is larger than the actual size of the file.
Tivoli® Storage Manager storage pools: Tivoli Storage Manager storage pools work well in
striped pools. But Tivoli Storage Manager suggests that the Tivoli Storage Manager
databases need to be allocated in a separate pool or pools.
AIX volume groups (VGs): LVM and physical partition (PP) striping continue to be powerful
tools for managing performance. In combination with storage pool striping, now
considerably fewer stripes are required for common environments. Instead of striping
across a large set of volumes from many ranks (for example, 32 volumes from 32 ranks),
striping is only required across a few volumes from a small set of different multi-rank
extent pools from both DS8000 rank groups that use storage pool striping. For example,
use four volumes from four extent pools, each with eight ranks. For specific workloads that
In general, storage pool striping helps to improve overall performance and reduce the effort of
performance management by evenly distributing data and thus workloads across a larger set
of ranks, reducing skew and hot spots. Certain application workloads can also benefit from
the higher number of disk spindles behind one volume. But, there are cases where host-level
striping or application-level striping might achieve a higher performance, at the cost of higher
overall administrative effort. Storage pool striping might deliver good performance in these
cases with less management effort, but manual striping with careful configuration planning is
required to achieve the best possible performance. So for overall performance and ease of
use, storage pool striping might offer an excellent compromise for many environments,
especially for larger workload groups where host-level striping techniques or application-level
striping techniques are not widely used or available.
Note: Business and performance critical applications always require careful configuration
planning and individual decisions on a case by case basis. Determine whether to use
storage pool striping or LUNs from dedicated ranks together with host-level striping
techniques or application-level striping techniques for the best performance.
Storage pool striping is best suited for new extent pools. Adding ranks to an existing extent
pool does not restripe volumes (LUNs) that are already allocated in an existing pool, unless
you manually restripe LUNs with the managefbvol command. You can use Easy Tier manual
mode features, such as manual volume rebalance. For more information, see IBM System
Storage DS8000 Easy Tier, REDP-4667.
Always consider that Easy Tier offers more advanced options than storage pool striping to
efficiently spread the workload across the resources in an extent pool. Storage pool striping
only achieves a balanced capacity distribution of the volumes across the ranks. Easy Tier
automatically relocates the data across the ranks based on the actual workload pattern to
achieve a balanced resource utilization within a storage tier (or in a homogeneous extent
pool). Or with multi-tier extent pool configurations, Easy Tier optimizes storage performance
and economics across different storage tiers.
And, you need to distribute the I/O workloads evenly across the available front-end resources:
I/O ports
HA cards
I/O enclosures
You need to distribute the I/O workloads evenly across both DS8000 processor complexes
(also called storage server 0/CEC#0 and storage server 1/CEC#1), as well.
Configuring the extent pools determines the balance of the workloads across the available
back-end resources, ranks, DA pairs, and both processor complexes.
Each extent pool is associated with an extent pool ID (P0, P1, and P2, for example). Each
rank has a relationship to a specific DA pair and can be assigned to only one extent pool. You
can have as many (non-empty) extent pools as you have ranks. Extent pools can be
expanded by adding more ranks to the pool. However, when assigning a rank to a specific
extent pool, the affinity of this rank to a specific DS8000 processor complex is determined. No
predefined affinity of ranks to a processor complex by hardware exists. All ranks that are
assigned to even-numbered extent pools (P0, P2, and P4, for example) form rank group 0
and are serviced by DS8000 processor complex 0. All ranks that are assigned to
odd-numbered extent pools (P1, P3, and P5, for example) form rank group 1 and are serviced
by DS8000 processor complex 1.
In order to spread the overall workload across both DS8000 processor complexes, a
minimum of two extent pools is required: one assigned to processor complex 0 (for example,
P0) and one assigned to processor complex 1 (for example, P1).
For a balanced distribution of the overall workload across both processor complexes and both
DA cards of each DA pair, apply the following rules. For each type of rank and its RAID level,
storage type (FB or CKD), and disk drive characteristics (disk type, rpm speed, and capacity),
apply these rules:
Assign half of the ranks to even-numbered extent pools (rank group 0) and assign half of
them to odd-numbered extent pools (rank group 1).
Spread ranks with and without spares evenly across both rank groups.
Distribute ranks from each DA pair evenly across both rank groups.
It is important to understand that you might seriously limit the available back-end bandwidth
and thus the system overall throughput if, for example, all ranks of a DA pair are assigned to
only one rank group and thus a single processor complex. In this case, only one DA card of
the DA pair is used to service all the ranks of this DA pair and thus only half of the available
DA pair bandwidth is available.
This practice now also becomes important with SSDs and small block random I/O workloads.
To be able to use the full back-end random I/O performance of two (or more) SSD ranks
within a specific DA pair, the SSD I/O workload also must be balanced across both DAs of the
DA pair. This balance can be achieved by assigning half of the SSD ranks of each DA pair to
even extent pools (P0, P2, and P4, for example, managed by storage server 0/rank group 0)
and half of the SSD ranks to odd extent pools (P1, P3, and P5, for example, managed by
storage server 1/rank group 1). If, for example, all SSD ranks of the same DA pair are
A simple two-extent pool example, evenly distributing all ranks from each DA pair across both
processor complexes for a homogeneously configured DS8700 system with one workload
group sharing all resources and using storage pool striping, can be seen in Figure 4-10. The
volumes created from the last extents in the pool are only distributed across the large 7+P
ranks, because the capacity on the 6+P+S ranks is exhausted at this time. Ensure that you
use this remaining capacity only for workloads with lower performance requirements in
manually managed environments. Or, consider using Easy Tier automatic mode management
(auto-rebalance) instead.
Figure 4-10 Example of homogeneously configured DS8700 with two extent pools
Figure 4-11 Example of homogeneously configured DS8700 with four extent pools
On the left in Figure 4-11, we have four strictly homogeneous extent pools, each one contains
only ranks of the same RAID level and capacity for the spare drives. Storage pool striping
efficiently distributes the extents in each extent pool across all ranks up to the last volume
created in these pools, providing more ease of use. However, the extent pools with 6+P+S
ranks and 7+P ranks differ in overall capacity, which might only be appropriate for two
different workload groups.
Another configuration with four extent pools is shown on the right in Figure 4-11. We evenly
distribute the 6+P+S and 7+P ranks from all DA pairs across all four extent pools to obtain the
same overall capacity in each extent pool. However, the last capacity in these pools is only
allocated on the 7+P ranks. So, ensure that you use this remaining capacity only for
workloads with lower performance requirements in manually managed environments. Or,
consider using Easy Tier automatic mode management (auto-rebalance) instead. Using four
extent pools and storage pool striping instead of two can also reduce the failure boundary
from one extent pool with 16 ranks (that is, if one rank fails all data in the pool is lost) to two
distinct extent pools with only eight ranks per processor complex (for example, when
physically separating tablespaces from logs).
You can also consider separating workloads by using different extent pools and the principles
of workload isolation, as shown in Figure 4-12 on page 133. The isolated workload can either
use storage pool striping as the EAM or rotate volumes combined with host-level or
application-level striping. The workload isolation in this example is on the DA pair level (DA2).
In addition, we have two pairs of extent pools for resource sharing workload groups.
Furthermore, we can also consider using Easy Tier automatic management for all pools
instead of manually spreading the workload.
Figure 4-12 Example of DS8700 extent pool configuration with workload isolation on the DA pair level
Another consideration for the number of extent pools to create is the usage of Copy Services,
such as FlashCopy SE. If you use FlashCopy SE, you also might consider a minimum of four
extent pools with two extent pools per rank group or processor complex, as shown in
Figure 4-13 on page 134. The FlashCopy SE repository for the space-efficient target volumes
is distributed across all available ranks within the extent pool (comparable to using storage
pool striping). Therefore, we suggest that you distribute the source and target volumes across
different extent pools (that is, different ranks) from the same DS8000 processor complex (that
is, the same rank group) for the best FlashCopy performance. Each extent pool can have
FlashCopy source volumes, as well as repository space for space-efficient FlashCopy target
volumes from source volumes, in the alternate extent pool. However, for certain
environments, consider a dedicated set of extent pools that uses RAID 10 arrays for
FlashCopy SE target volumes while the other extent pools that use RAID 5 arrays are only
used for source volumes.
FlashCopy SE
DA6 7+P L 7+P L DA6
DA4 7+P 7+P DA4
P0 P1
Figure 4-14 gives an example of a balanced extent pool configuration with two extent pools on
a DS8700 system with three different storage classes using Easy Tier. All arrays from each
DA pair (especially the SSD arrays) are evenly distributed across both processor complexes
to fully use the back-end DA pair bandwidth and I/O processing capabilities. All cross-tier and
intra-tier data relocation in these pools is automatic by Easy Tier, constantly optimizing
storage performance and storage economics.
Figure 4-14 Example DS8700 configuration with two hybrid extent pools using Easy Tier
A simple two-extent pool example, evenly distributing all ranks from each DA pair across both
processor complexes for a homogeneously configured DS8800 system with one workload
group sharing all resources and using storage pool striping, can be seen in Figure 4-15. The
volumes created from the last extents in the pool are only distributed across the large 7+P
ranks, because the capacity on the 6+P+S ranks is exhausted. Be sure to use this remaining
capacity only for workloads with lower performance requirements in manually managed
environments or consider using Easy Tier automatic mode management (auto-rebalance)
instead.
Figure 4-15 Example of homogeneously configured DS8800 with two extent pools
Another example for a homogeneously configured DS8800 system with four extent pools and
one workload group, which shares all resources across four extent pools, or two isolated
workload groups that each share half of the resources can be seen in Figure 4-16 on
page 136.
Figure 4-16 Example of homogeneously configured DS8800 with four extent pools
On the left in Figure 4-16, we have four strictly homogeneous extent pools, and each one
contains only ranks of the same RAID level and capacity for the spare drives. Storage pool
striping efficiently distributes the extents in each extent pool across all ranks up to the last
volume created in these pools, providing more ease of use. However, the extent pools with
6+P+S ranks and 7+P ranks differ considerably in overall capacity and performance, which
might only be appropriate for two workload groups with different overall capacity and
performance requirements.
Another configuration with four extent pools is shown on the right in Figure 4-16. We evenly
distribute the 6+P+S and 7+P ranks from all DA pairs across all four extent pools to obtain the
same overall capacity in each extent pool. However, the last capacity in these pools is only
allocated on the 7+P ranks. So, be sure to use this remaining capacity only for workloads with
lower performance requirements in manually managed environments. Or, consider using
Easy Tier automatic mode management (auto-rebalance) instead. Using four extent pools
and storage pool striping instead of two can also reduce the failure boundary from one extent
pool with 12 ranks (that is, if one rank fails, all data in the pool is lost) to two distinct extent
pools with only six ranks per processor complex (for example, when physically separating
tablespaces from logs).
Also consider separating workloads by using different extent pools with the principles of
workload isolation, as shown in Figure 4-17 on page 137. The isolated workload can either
use storage pool striping as the EAM or rotate volumes combined with host-level or
application-level striping if desired. The workload isolation in this example is on the DA pair
level (DA2). In addition, we have one pair of extent pools for resource sharing workload
groups. Instead of manually spreading the workloads across the ranks in each pool, consider
using Easy Tier automatic mode management (auto-rebalance) for all pools.
Figure 4-17 Example of DS8800 extent pool configuration with workload isolation on DA pair level
Another consideration for the number of extent pools to create is the usage of Copy Services,
such as FlashCopy SE. If you use FlashCopy SE, you also might consider a minimum of four
extent pools with two extent pools per rank group or processor complex, as shown in
Figure 4-13 on page 134 for a DS8700 system. Because the FlashCopy SE repository for the
space-efficient target volumes is distributed across all available ranks within the extent pool
(comparable to using storage pool striping), we suggest that you distribute the source and
target volumes across different extent pools (that is, different ranks) from the same DS8000
processor complex (that is, the same rank group) for the best FlashCopy performance. Each
extent pool can have FlashCopy source volumes, as well as repository space for
space-efficient FlashCopy target volumes from source volumes, in the alternate extent pool.
However, for certain environments, consider dedicated extent pools that use RAID 10 arrays
for FlashCopy SE target volumes while the other extent pools that use RAID 5 arrays are only
used for source volumes.
Figure 4-18 on page 138 gives an example of a balanced extent pool configuration with two
extent pools on a DS8800 system with three storage classes that use Easy Tier. All arrays
from each DA pair (especially the SSD arrays) are evenly distributed across both processor
complexes to fully use the back-end DA pair bandwidth and I/O processing capabilities. There
is a difference in DA pair population with ranks from LFF disk enclosures for the 3.5 inch NL
drives. A pair of LFF disk enclosure only contains 24 disk drives (three ranks) compared to a
pair of SFF disk enclosures with 48 disk drives (six ranks). All cross-tier and intra-tier data
relocation in these pools is automatic by Easy Tier. Easy Tier constantly optimizes storage
performance and storage economics.
ENT ENT
DA7 ENT ENT DA7
ENT ENT
ENT ENT
DA5 ENT ENT DA5
ENT ENT
Figure 4-18 Example DS8800 configuration with two hybrid extent pools that use Easy Tier
Multiple homogeneous extent pools, each with different storage classes, easily allow tiered
storage concepts with dedicated extent pools and manual cross-tier management. For
example, you can have extent pools with slow, large-capacity drives for backup purposes and
other extent pools with high-speed, small capacity drives or solid-state drives (SSDs) for
performance-critical transaction applications. Or you can use hybrid pools with Easy Tier and
introduce fully automated cross-tier storage performance and economics management.
Using dedicated extent pools with an appropriate number of ranks and DA pairs for selected
workloads is a suitable approach for isolating workloads.
The minimum number of required extent pools depends on the following considerations:
The number of isolated and resource-sharing workload groups
The number of different storage types, either FB for Open Systems or System i or CKD for
System z
Although you are not restricted from assigning all ranks to only one extent pool, the minimum
number of extent pools, even with only one workload on a homogeneously configured
DS8000, needs to be two (for example, P0 and P1). You need one extent pool for each rank
group (or storage server), so that the overall workload is balanced across both processor
complexes.
To optimize performance, the ranks for each workload group (either isolated or
resource-sharing workload groups) need to be split across at least two extent pools with an
equal number of ranks from each rank group. So, also at the workload level, each workload is
balanced across both processor complexes. Typically, you need to assign an equal number of
ranks from each DA pair to extent pools assigned to processor complex 0 (rank group 0: P0,
P2, and P4, for example) and to extent pools assigned to processor complex 1 (rank group 1:
P1, P3, and P5, for example). In environments with FB and CKD storage (Open Systems and
System z), you additionally need separate extent pools for CKD and FB volumes. We advise a
minimum of four extent pools to balance the capacity and I/O workload between the two
DS8000 processor complexes. Additional extent pools might be desirable to meet individual
needs, such as ease of use, implementing tiered storage concepts, or separating ranks for
different DDM types, RAID types, clients, applications, performance, or Copy Services
requirements.
The maximum number of extent pools, however, is given by the number of available ranks
(that is, creating one extent pool for each rank).
Creating dedicated extent pools on the DS8000 with dedicated back-end resources for
separate workloads allows individual performance management for business and
performance-critical applications. Compared to easier to manage share and spread
everything storage systems without the possibility to implement workload isolation concepts,
creating dedicated extent pools on the DS8000 with dedicated back-end resources for
separate workloads is an outstanding feature of the DS8000 as an enterprise-class storage
system. With it, you can consolidate and manage various application demands with different
performance profiles, which are typical in enterprise environments, on a single storage
system. Unfortunately, the cost is a slightly higher administrative effort.
Before configuring the extent pools, we advise that you collect all the hardware-related
information of each rank for the associated DA pair, disk type, available storage capacity,
RAID level, and storage type (CKD or FB) in a spreadsheet. Then, plan the distribution of the
workloads across the ranks and their assignments to extent pools.
As the first step, you can visualize the rank and DA pair association in a simple spreadsheet
that is based on the graphical scheme that is shown in Figure 4-19 on page 140.
DS8700
Figure 4-19 Basic scheme for rank and DA pair association in relation to extent pool planning
This example represents a homogeneously configured DS8700 with four DA pairs and 32
ranks that are all configured to RAID 5 and a DS8800 with four DA pairs and 24 ranks. Based
on the specific DS8000 hardware and rank configuration, the scheme typically becomes more
complex for the number of DA pairs, ranks, different RAID levels, drive classes, spare
distribution, and storage types. The DS8700 with the LFF disk enclosures differs from the
DS8800 with the SFF disk enclosures. Also, be aware of the difference in DA pair population
on DS8800 with ranks from LFF disk enclosures for the 3.5 inch NL drives. A pair of LFF disk
enclosures only contains 24 disk drives (three ranks) compared to a pair of SFF disk
enclosures with 48 disk drives (six ranks).
Based on this scheme, you can plan an initial assignment of ranks to your planned workload
groups, either isolated or resource-sharing, and extent pools for your capacity requirements.
After this initial assignment of ranks to extent pools and appropriate workload groups, you can
create additional spreadsheets to hold more details about the logical configuration and finally
the volume layout of the array site IDs, array IDs, rank IDs, DA pair association, extent pools
IDs, and volume IDs, as well as their assignments to volume groups and host connections.
4.9 Planning address groups, LSSs, volume IDs, and CKD PAVs
After creating the extent pools and evenly distributing the back-end resources (DA pairs and
ranks) across both DS8000 processor complexes, you can create host volumes from these
extent pools. When creating the host volumes, it is important to follow a volume layout
scheme that evenly spreads the volumes of each application workload across all ranks and
So, the next step is to plan the volume layout and thus the mapping of address groups and
LSSs to volumes created from the various extent pools for the identified workloads and
workload groups. For performance management and analysis reasons, it is crucial to be able
to easily relate volumes, which are related to a specific I/O workload, to ranks, which finally
provide the physical disk spindles for servicing the workload I/O requests and determining the
I/O processing capabilities. Therefore, an overall logical configuration concept that easily
relates volumes to workloads, extent pools, and ranks is desirable.
Each volume is associated with a hexadecimal 4-digit volume ID that must be specified when
creating the volume, as shown, for example, in Table 4-2 for volume ID 1101.
Table 4-2 Understanding the volume ID relationship to address groups and LSSs/LCUs
Volume ID Digits Description
The first digit of the hexadecimal volume ID specifies the address group, 0 - F, of that volume.
Each address group can only be used by a single storage type, either FB or CKD. The first
and second digit together specify the logical subsystem ID (LSS ID) for Open Systems
volumes (FB) or the logical control unit ID (LCU ID) for System z volumes (CKD). There are
16 LSS/LCU IDs per address group. The third and fourth digits specify the volume number
within the LSS/LCU, 00 - FF. There are 256 volumes per LSS/LCU. The volume with volume
ID 1101 is the volume with volume number 01 of LSS 11, and it belongs to address group 1
(first digit).
The LSS/LCU ID is furthermore related to a rank group. LSS/LCU IDs are restricted to
volumes that are created from rank group 0 and serviced by processor complex 0. Odd
LSS/LCU IDs are restricted to volumes that are created from rank group 1 and serviced by
processor complex 1. So, the volume ID also reflects the affinity of that volume to a DS8000
processor complex. All volumes, which are created from even-numbered extent pools (P0,
P2, and P4, for example) have even LSS IDs and are managed by DS8000 processor
complex 0. All volumes that are created from odd-numbered extent pools (P1, P3, and P5, for
example) have odd LSS IDs and are managed by DS8000 processor complex 1.
In the past, for performance analysis reasons, it was useful to easily identify the association of
specific volumes to ranks or extent pools when investigating resource contention. But with the
introduction of storage pool striping (rotate extents allocation method), the use of multi-rank
extent pools is the preferred configuration approach today for most environments. And with
the availability of Easy Tier automatic mode management for single-tier extent pools
(auto-rebalance), which is based on the actual workload and rank utilization, we advise that
you use Easy Tier automatic mode management also for single-tier extent pools. This method
The common approach that is still valid today with Easy Tier and storage pool striping is to
relate an LSS/LCU to a specific application workload with a meaningful numbering scheme
for the volume IDs for the distribution across the extent pools. Each LSS can have 256
volumes, with volume numbers ranging from 00 - FF. So, relating the LSS/LCU to a certain
application workload and additionally reserving a specific range of volume numbers for
different extent pools is a reasonable choice especially in Open Systems environments.
Because volume IDs are transparent to the attached host systems, this approach helps the
administrator of the host system to easily determine the relationship of volumes to extent
pools by the volume ID. Therefore, this approach helps you to easily identify physically
independent volumes from different extent pools when setting up host-level striping across
pools. This approach helps you when separating, for example, DB tablespaces from DB logs
onto volumes from physically different drives in different pools.
This approach not only provides a logical configuration concept that provides ease of use for
storage management operations, but it also reduces management efforts when using the
DS8000 related Copy Services, because basic Copy Services management steps (such as
establishing Peer-to-Peer Remote Copy (PPRC) paths and consistency groups) are related to
LSSs. If Copy Services are not currently planned, plan the volume layout, because overall
management is easier if you need to introduce Copy Services in the future (for example, when
migrating to a new DS8000 storage system that uses Copy Services).
However, the actual strategy for the assignment of LSS/LCU IDs to resources and workloads
can still vary depending on the particular requirements in an environment.
The following section introduces suggestions for LSS/LCU and volume ID numbering
schemes to help to relate volume IDs to application workloads and extent pools. For other
approaches to the LSS/LCU ID planning, including hardware-bound LSS/LCU configuration
schemes down to the rank level, see DS8000 Performance Monitoring and Tuning,
SG24-7146.
Typically, when using LSS/LCU IDs that relate to application workloads, the simplest
approach is to reserve a suitable number of LSS/LCU IDs according to the total amount of
1000 1010 1200 2800 2a00 1100 1110 1300 2900 2b00
1001 1011 1201 2801 2a01 1101 1111 1301 2901 2b01
1002 1012 1202 2802 2a02 1102 1112 1302 2902 2b02
1003 1013 1203 2803 2a03 1103 1113 1303 2903 2b03
1004 1014 1204 2804 2a04 1104 1114 1304 2904 2b04
1005 1015 1205 2805 2a05 Ranks Ranks 1105 1115 1305 2905 2b05
1006 1016 1206 2806 2a06 . . 1106 1116 1306 2906 2b06
1007 1017 1207 2807 2a07 . . 1107 1117 1307 2907 2b07
1008 1018 1208 2808 2a08 . . 1108 1118 1308 2908 2b08
1009 1019 1209 2809 2a09 Ranks Ranks 1109 1119 1309 2909 2b09
100a 101a 120a 280a 2a0a 110a 111a 130a 290a 2b0a
100b 101b 120b 280b 2a0b 110b 111b 130b 290b 2b0b
100c 101c 120c 280c 2a0c 110c 111c 130c 290c 2b0c
100d 101d 120d 280d 2a0d 110d 111d 130d 290d 2b0d
100e 101e 120e 280e 2a0e 110e 111e 130e 290e 2b0e
100f 101f 120f 280f 2a0f 110f 111f 130f 290f 2b0f
P0 P1
Host A1 Host A2 Host B Host C Host A1 Host A2 Host B Host C
Application A Application B Application C Application A Application B Application C
Figure 4-20 Application-related volume layout example for two shared extent pools
In Figure 4-21 on page 144, we spread our workloads across four extent pools. Again, we
assign two LSS/LCU IDs (one even, one odd) to each workload to spread the I/O activity
evenly across both processor complexes (both rank groups). Additionally, we reserve a
certain volume ID range for each extent pool based on the third digit of the volume ID. With
this approach, you can quickly create volumes with successive volume IDs for a specific
workload per extent pool with a single DSCLI mkfbvol/mkckdvol command.
Hosts A1 and A2 belong to the same application A and are assigned to LSS 10 and LSS 11.
For this workload, we use volume IDs 1000 - 100f in extent pool P0 and 1010 - 101f in extent
pool P2 on processor complex 0. And, we use volume IDs 1100 - 110f in extent pool P1 and
1110 - 111f in extent pool P3 on processor complex 1. In this case, the administrator on the
1 1 2 Ranks Ranks 1 1 2
0 0 a . . 1 1 b
0 0 0 . . 0 0 0
0 1 0 Ranks Ranks 0 1 0
P0 P1
1 1 2 Ranks Ranks 1 1 2
0 0 a . . 1 1 b
1 1 1 . . 1 1 1
0 1 0 Ranks Ranks 0 1 0
P2 P3
Host A1 Host A2 Host B Host A1 Host A2 Host B
Application A Application B Application A Application B
Figure 4-21 Application and extent pool-related volume layout example for four shared extent pools
In the example that is depicted in Figure 4-22 on page 145, we provide a numbering scheme
that can be used in a FlashCopy or FlashCopy SE scenario. Two different pairs of LSS are
used for source and target volumes. The address group identifies the role in the FlashCopy
relationship: address group 1 is assigned to source volumes, and address group 2 is used for
target volumes. This numbering scheme allows a symmetrical distribution of the FlashCopy
relationships across source and target LSSs. For instance, source volume 1007 in P0 uses
the volume 2007 in P2 as the FlashCopy target. We use the third digit of the volume ID within
an LSS as a marker to indicate that source volumes 1007 and 1017 are from different extent
pools. The same approach applies to the target volumes, for example, volumes 2007 and
2017 are from different pools. However for the simplicity of the Copy Services management,
we chose a different extent pool numbering scheme for source and target volumes (so 1007
and 2007 are not from the same pool) to implement the recommended extent pool selection
of source and target volumes in accordance with the FlashCopy guidelines. Source and target
volumes must stay on the same rank group but different ranks or extent pools; see
“Distribution of the workload: Location of source and target volumes” on page 589.
Figure 4-22 Application and extent pool-related volume layout example in a FlashCopy scenario
For high availability, each host system must use a multipathing device driver, such as
Subsystem Device Driver (SDD). Each host system must have a minimum of two host
connections to HA cards in different I/O enclosures on the DS8800. Preferably, they are
evenly distributed between left side (even-numbered) I/O enclosures and right side
(odd-numbered) I/O enclosures. The number of host connections per host system is primarily
determined by the required bandwidth. Use an appropriate number of HA cards to satisfy high
throughput demands.
For DS8800, 4-port and 8-port HA card options are available. Each port can auto-negotiate a
2, 4, or 8 Gbps link speed. However, because the maximum available bandwidth is the same
for 4-port and 8-port HA cards, the 8-port HA card provides additional connectivity and not
more performance. Furthermore, the HA card maximum available bandwidth is less than the
nominal aggregate bandwidth and depends on the workload profile. These specifications
must be considered when planning the HA card port allocation and especially with workloads
with high sequential throughputs. Be sure to contact your IBM representative or IBM Business
Partner for an appropriate sizing, depending on your actual workload requirements. However,
with typical transaction-driven workloads that show high numbers of random, small blocksize
I/O operations, all ports in a HA card can be used likewise. For the best performance of
The preferred practice is to use dedicated I/O ports for Copy Services paths and host
connections. For more information about performance aspects related to Copy Services, see
18.3.1, “Metro Mirror configuration considerations” on page 597.
To assign FB volumes to the attached Open Systems hosts by using LUN masking, these
volumes need to be grouped in the DS8000 volume groups. A volume group can be assigned
to multiple host connections, and each host connection is specified by the worldwide port
name (WWPN) of the host FC port. A set of host connections from the same host system is
called a host attachment. The same volume group can be assigned to multiple host
connections; however, a host connection can only be associated with one volume group. To
share volumes between multiple host systems, the most convenient way is to create a
separate volume group for each host system and assign the shared volumes to each of the
individual volume groups as required. A single volume can be assigned to multiple volume
groups. Only if a group of host systems shares the same set of volumes, and there is no need
to assign additional non-shared volumes independently to particular hosts of this group, can
you consider using a single shared volume group for all host systems to simplify
management. Typically, there are no significant DS8000 performance implications due to the
number of DS8000 volume groups or the assignment of host attachments and volumes to the
DS8000 volume groups.
Do not omit additional host attachment and host system considerations, such as SAN zoning,
multipathing software, and host-level striping. For additional information, see Chapter 8, “Host
attachment” on page 311, Chapter 9, “Performance considerations for UNIX servers” on
page 327, “Chapter 12, “Performance considerations for Linux” on page 445, and Chapter 14,
“Performance considerations for System z servers” on page 499.
After the DS8000 is installed, you can use the DSCLI lsioport command to display and
document I/O port information, including the I/O ports, HA type, I/O enclosure location, and
WWPN. Use this information to add specific I/O port IDs, the required protocol (FICON or
FCP), and the DS8000 I/O port WWPNs to the plan of host and remote mirroring connections
that is identified in 4.4, “Planning allocation of disk and host connection capacity” on page 98.
Additionally, the I/O port IDs might be required as input to the DS8000 host definitions if host
connections need to be restricted to specific DS8000 I/O ports by using the -ioport option of
the mkhostconnect DSCLI command. If host connections are configured to allow access to all
DS8000 I/O ports, which is the default, typically, the paths must be restricted by SAN zoning.
The I/O port WWPNs are required as input for SAN zoning. The lshostconnect -login
DSCLI command might help to verify the final allocation of host attachments to the DS8000
I/O ports, because it lists host port WWPNs that are logged in, sorted by the DS8000 I/O port
IDs for known connections. The lshostconnect -unknown DSCLI command might further help
to identify host port WWPNs, which are not yet configured to host connections, when creating
host attachments by using the mkhostconnect DSCLI command.
The DS8000 I/O ports use predetermined, fixed DS8000 logical port IDs in the form I0xyz
where:
x = I/O enclosure
y = slot number within the I/O enclosure
Slot numbers: The slot numbers for logical I/O port IDs are one less than the physical
location numbers for HA cards, as shown on the physical labels and in Tivoli Storage
Productivity Center for Disk, for example, I0101 is R1-XI2-C1-T2.
A simplified example of spreading the DS8000 I/O ports evenly to two redundant SAN fabrics
is shown in Figure 4-23. The SAN implementations can vary, depending on individual
requirements, workload considerations for isolation and resource-sharing, and available
hardware resources.
0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 16 1 Bay 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 17 1
3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 00 0 Card 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 03 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 03 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 03 1
Port
01 02 03 04 05 06 07 08 01 02 03 04 05 06 07 08
Figure 4-23 Example of spreading DS8000 I/O ports evenly across two redundant SAN fabrics
Consider a DS8800 single frame with four fully populated I/O enclosures. A schematic of the
I/O enclosures is shown in Figure 4-24. It shows that each of the four base frame I/O
enclosures (0 - 3) contains two 8-port FCP/FICON-capable HAs.
IO Enclosur e 0 IO En closure 1
SW SW SW SW
IO Enclosur e 2 IO En closure 3
SW LW LW SW
Figure 4-24 DS8800 example: I/O enclosures, HAs, and I/O ports
I/O topology configurations: As seen in Example 4-6, the default I/O topology
configurations are FICON for longwave (LW) HAs and Fibre Channel Arbitrated Loop
(FC-AL) for shortwave (SW) HAs.
By assuming that the FICON attachments use LW ports (typical for FICON environments), we
can reserve the LW cards for FICON attachments only and spread all the FCP attachments
across all the SW cards. So, following the general guidelines described before, the following
list is a possible IO port assignment:
FICON connections:
– I0230
– I0232
– I0234
– I0236
– I0300
– I0302
– I0304
– I0306
FCP host connections:
– I0000
– I0002
– I0004
– I0030
– I0032
– I0100
– I0102
In this case, the FCP PPRC ports share the adapter with FCP host ports. In general, sharing
the adapter bandwidth between PPRC and the host workload might lead to performance
issues especially during the PPRC full-copy operations. To avoid any possible interference
between PPRC and host workloads, you must use dedicated HAs for PPRC (or zGM data
mover) connections. For availability reasons, we advise that you spread the PPRC connection
across at least two HAs. The drawback of isolating the PPRC connections is that, with a few
PPRC connections, this approach leads to a waste of ports. In our example, we compromised
by using only one port per adapter for PPRC connectivity. In this way, even in PPRC full copy,
the PPRC port does not affect the performance of the other host ports significantly.
When planning the paths for the host systems, ensure that each host system uses a
multipathing device driver, such as Subsystem Device Driver (SDD). Ensure that each host
system has a minimum of two host connections to two different HA cards in different I/O
enclosures on the DS8800. Preferably, they are evenly distributed between left side
(even-numbered) I/O enclosures and the right side (odd-numbered) I/O enclosures for
highest availability. Multipathing additionally optimizes workload spreading across the
available I/O ports, HA cards, and I/O enclosures.
You must tune the SAN zoning scheme to balance both the oversubscription and the
estimated total throughput for each I/O port to avoid congestion and performance bottlenecks.
After the logical configuration is planned, you can use either the DS Storage Manager or the
DSCLI to implement it on the DS8000 in the following steps:
1. Change the password for the default user (admin) for DS Storage Manager and DSCLI.
After the logical configuration is created on the DS8000, it is important to document it.
You can use the DS Storage Manager to export information in a spreadsheet format (that is,
save it as a comma-separated values (CSV) file). You can use this information together with a
planning spreadsheet to document the logical configuration. For more information and
examples, see Appendix C, “Planning and documenting your logical configuration” on
page 651.
The DSCLI provides a set of list (ls) and show commands, which can be redirected and
appended into a plain text or CSV file. A list of selected DSCLI commands, as shown in
Example 4-7, can be started as a DSCLI script (using the DSCLI command dscli -script) to
collect the logical configuration of a DS8000 Storage Image. This output can be used as a
text file or imported into a spreadsheet to document the logical configuration.
Example 4-7 only collects a minimum set of the DS8000 logical configuration information, but
it illustrates a simple DSCLI script implementation and runs quickly within a single DSCLI
command session. Depending on the environment, you can modify this script to include more
commands to provide more information, for example, about Copy Services configurations and
source/target relationships. The DSCLI script terminates with the first command that returns
an error, which, for example, can be a simple lslcu command if no LCUs are defined. You
can adjust the output of the ls commands in a DSCLI script to meet special formatting and
delimiter requirements by using appropriate options for format, delim, or header in the
specified DS8000 profile file or selected ls commands.
Example 4-7 Example of a minimum DSCLI script get_config.dscli to gather the logical configuration
> dscli -cfg profile/DEVICE.profile -script get_config.dscli > DEVICE_SN_config.out
CMMCI9029E showrank: rank R48 does not exist.
lsarraysite -l
lsarray -l
lsrank -l
lsextpool -l
lsaddressgrp
lslss # Use only if FB volumes have been configured
#lslcu # Use only if CKD volumes and LCUs have been configured
The script in Example 4-7 on page 152 is focused on providing the relationship between
volumes, ranks, and hosts, and can easily be used on different DS8000 systems without
modification or the need to consider the particular storage image ID of the DS8000 system.
However, you can further enhance the script by adding the commands that are shown in
Example 4-8 to include hardware-specific information about the DS8000 system, which is
helpful when performing a deeper performance analysis. In this case, you need to specify the
storage unit or storage image ID correctly in the script for each DS8000 storage system.
Information in this chapter is not dedicated to the IBM System Storage DS8000. You can
apply this information generally to other disk storage systems.
The number of requests serviced from cache. This number is important for read requests,
because write requests always get into the cache first. If of 1000 requests, 100 of them are
serviced from cache, you have a 10% of cache hit ratio. The higher this parameter, the
lower the overall response time.
Expected seek ratio
The percentage of the I/O requests for which the disk arm must move from its current
location. Moving the disk arm requires more time than rotating the disk, which rotates
anyway and is fast enough. So, by not moving the disk arm, the whole track or a cylinder
can be read, which generally means large amount of data. This parameter is mostly
indicative of how disk subsystems worked a long time ago, and it is now turned into a sort
of the quality value of the random nature of the workload. A random workload shows this
value close to 100%. This parameter is not applicable to the solid state drives (SSDs) that
have no disk arms by design.
Expected write efficiency
The write efficiency is a number that represents the number of times a block is written to
before being destaged to the disk. Actual applications, especially databases, update the
information: write-read again-change-write with changes. So, the data for the single disk
block can be served several times from cache before it is written to the disk. A value of 0%
means that a destage is assumed for every write operation and the characteristic of the
“pure random small block write workload pattern”, which is unlikely. A value of 50% means
that a destage occurs after the track is written to twice. A value of 100% is unlikely also,
because it means that writes come to the one track only and are never destaged to disk.
In general, you describe the workload in these terms. The following sections cover the details
and describe the different workload types.
Important: Because of large block access and high response times, separate sequential
workload physically from the random small-block workload. Do not mix random and
sequential workloads on the same physical disk. If you do, large amounts of cache are
required on the disk system. Typically, high response times with small-block random
access mean the presence of the sequential write activity (foreground or background) on
the same disks.
It is not as easy to divide known workload types into cache friendly and cache unfriendly. An
application can change its behavior during the day several times. When users work with data,
it is cache friendly. When the batch processing or reporting starts, it is not cache friendly. High
random-access numbers mean a not cache-friendly workload type. However, if the amount of
data that is accessed randomly is not large, 10% for example, it can be placed totally to the
disk system cache and becomes cache friendly.
Sequential workloads are always cache friendly, because of prefetch algorithms that exist in
the disk system. Sequential workload is easy to prefetch. You know that the next 10 or 100
blocks are definitely accessed, and you can read them in advance. For the random
workloads, it is different. There are no purely random workloads in the actual applications and
it is possible to predict some moments. DS8800 and DS8700 use the following powerful
read-caching algorithms to deal with cache unfriendly workloads:
Sequential Prefetching in Adaptive Replacement Cache
Adaptive Multi-stream Prefetching
The write workload is always cache friendly, because every write request comes to the cache
first and the application gets the reply as soon as the request is placed into cache. Write
requests are served at least two times longer by the back end than read requests. You always
need to wait for the write acknowledgment, which is why cache is used for every write
request. However, improvement is possible. The DS8800 and DS8700 systems use the
Intelligent Write Caching algorithm, which makes work with write requests more effective.
See the following links to learn more about the DS8000 caching algorithms:
http://www.redbooks.ibm.com/abstracts/sg248886.html
http://www.almaden.ibm.com/storagesystems/resources/amp.pdf
http://www.almaden.ibm.com/cs/people/dmodha/sarc.pdf
Table 5-1 on page 161 provides a summary of the characteristics of the various types of
workloads.
The Database environment is often difficult to typify, because I/O characteristics differ greatly.
Database query has a high read content and is of a sequential nature. It also can be random,
depending on the query type and data structure. Transaction environments are more random
in behavior and are sometimes cache unfriendly. At other times, they have good hit ratios. You
can implement several enhancements in databases, such as sequential prefetch and the
exploitation of I/O priority queuing, that affect the I/O characteristics. Users need to
understand the unique characteristics of their database capabilities before generalizing the
performance.
The workload pattern for the logging is sequential writes mostly. Blocksize is about 64 KB.
Reads are rare and might not be considered. The write capability and location of the online
transaction logs are most important. The entire performance of the database depends on the
writes to the online transaction logs. If you expect high write rates to the database, plan for
RAID 10 on to which to place the online transaction logs. Also, we strongly advise that log
files must be physically separate from the disks on which the data and index files reside. For
more information, see Chapter 17, “Database performance” on page 559.
A database can benefit from using a large amount of server memory for the large buffer pool.
For example, the database large buffer pool, when managed properly, can avoid a large
percentage of the accesses to disk. Depending on the application and the size of the buffer
pool, this large buffer pool can translate poor cache hit ratios into synchronous reads in
DB2. You can spread data across several RAID arrays to increase the throughput even if all
accesses are read misses. DB2 administrators often require that tablespaces and their
indexes are placed on separate volumes. This configuration improves both availability and
performance.
Table 5-2 on page 164 summarizes these workload categories and common applications.
3 Digital video 100/0, 0/100, 50/50 128 KB, 256 - Sequential, good
editing 1024 KB caching
An example of a data warehouse is a design around a financial institution and its functions,
such as loans, savings, bank cards, and trusts for a financial institution. In this application,
there are three kinds of operations: initial loading of the data, access to the data, and
updating of the data. However, due to the fundamental characteristics of a warehouse, these
operations can occur simultaneously. At times, this application can perform 100% reads when
accessing the warehouse; 70% reads and 30% writes when accessing data while record
updating occurs simultaneously; or even 50% reads and 50% writes when the user load is
heavy. Remember that the data within the warehouse is a series of snapshots and after the
snapshot of data is made, the data in the warehouse does not change. Therefore, there is
typically a higher read ratio when using the data warehouse.
Object-Relational DBMS (ORDBMS) are now being developed, and they not only offer
traditional relational DBMS features, but also support complex data types. Objects can be
stored and manipulated, and complex queries at the database level can be run. Object data is
data about real objects, including information about their location, geometry, and topology.
Location describes their position, geometry relates to their shape, and topology includes their
relationship to other objects. These applications essentially have an identical profile to that of
the data warehouse application.
Depending on the host and operating system that are used to perform this application,
transfers are typically medium to large in size and access is always sequential. Image
processing consists of moving huge image files for editing. In these applications, the user
regularly moves huge high-resolution images between the storage device and the host
system. These applications service many desktop publishing and workstation applications.
Editing sessions can include loading large files of up to 16 MB into host memory, where users
edit, render, modify, and eventually store data back onto the storage system. High interface
transfer rates are needed for these applications, or the users waste huge amounts of time by
waiting to see results. If the interface can move data to and from the storage device at over 32
MBps, an entire 16 MB image can be stored and retrieved in less than one second. The need
for throughput is all important to these applications, and, along with the additional load of
many users, I/O operations per second are also a major requirement.
For general rules for application types, see Table 5-1 on page 161.
Transaction distribution
Table 5-4 breaks down the number of times that key application transactions are executed
by the average user and how much I/O is generated per transaction. Detailed application
and database knowledge is required to identify the number of I/Os and the type of I/Os per
transaction. The following information is a sample.
Table 5-5 Logical I/O profile from user population and transaction profiles
Transaction Iterations I/Os I/O type Average Peak users
per user user I/Os
Transfer money to checking 1000, 1000 100 RR, 1000 RW 300 RR, 3000 RW
Configure new bill payee 1000, 1000 100 RR, 1000 RW 300 RR, 3000 RW
As you can see in Table 5-6, to meet the peak workloads, you need to design an I/O
subsystem to support 6000 random reads/sec and 6000 random writes/sec:
Physical I/Os The number of physical I/Os per second from the host perspective
RR Random Read I/Os
RW Random Write I/Os
To determine the appropriate configuration to support your unique workload, see Appendix A,
“Performance management process” on page 631.
It is also possible to get the performance data for the DA pair or the rank. See
Example 5-2.
Example 5-2 Output of the performance data for 20 hours for rank 17 (output truncated)
dscli> lsperfrescrpt -start 20h r17
time resrc avIO avMB avresp %Hutl %hlpT %dlyT %impt
===================================================================
2011-11-11/09:15:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:20:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:25:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:30:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:35:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:40:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:45:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:50:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/09:55:00 R17 0 0.000 0.000 0 0 0 0
2011-11-11/10:00:00 R17 0 0.000 0.000 0 0 0 0
By default, statistics are shown for one hour. You can use settings that are specified in
days, hours, and minutes.
IBM Tivoli Storage Productivity Center for Disk
IBM Tivoli Storage Productivity Center for Disk is the tool to monitor the workload on your
DS8000 for a long period of time and collect historical data. This tool can also create
reports and provide alerts. See 7.2.1, “Tivoli Storage Productivity Center overview” on
page 237.
These commands are standard tools that are available with most UNIX and UNIX-like (Linux)
systems. We suggest using iostat for the data that you need to evaluate your host I/O levels.
Specific monitoring tools are also available for AIX, Linux, Hewlett-Packard UNIX (HP-UX),
and Oracle Solaris.
For more information, see Chapter 9, “Performance considerations for UNIX servers” on
page 327 and Chapter 12, “Performance considerations for Linux” on page 445.
For more information, see Chapter 10, “Performance considerations for Microsoft Windows
servers” on page 397.
IBM i environment
IBM i provides a vast selection of performance tools that can be used in performance-related
cases with external storage. Several of the tools, such as Collection services, are integrated
in the IBM i system. Other tools are a part of an IBM i licensed product. The management of
many IBM i performance tools is integrated into IBM i web graphical user interface IBM
System Director Navigator for i, or into the product iDoctor.
The IBM i tools, such as Performance Explorer and iDoctor, are used to analyze the hot data
in IBM i and to size solid-state drives (SSDs) for this environment. Other tools, such as Job
Watcher, are used mostly in solving performance problems, together with the tools for
monitoring the DS8000.
For more information about the IBM i tools and their usage, see 13.4.1, “IBM i performance
tools” on page 484.
System z environment
The z/OS systems have proven performance monitoring and management tools that are
available to use for performance analysis. Resource Measurement Facility (RMF), a z/OS
performance tool, collects performance data and reports it for the desired interval. It also
provides cache reports. The cache reports are similar to the disk-to-cache and cache-to-disk
reports that are available in the Tivoli Storage Productivity Center for Disk, except that the
RMF cache reports are in text format. RMF collects the performance statistics of the DS8000
that are related to the link or port and also to the rank and extent pool. The REPORTS(ESS)
parameter in the RMF report generator produces the reports that are related to those
resources.
The RMF Spreadsheet Reporter is an easy way to create Microsoft Excel Charts based on
RMF postprocessor reports. It is used to convert your RMF data to spreadsheet format and
generate representative charts for all performance charts for all performance relevant areas.
For more information, see Chapter 14, “Performance considerations for System z servers” on
page 499.
IBM Specialists and IBM Business Partner specialists use the IBM Disk Magic tool for
modeling the workload on the systems. Disk Magic can be used to help to plan the DS8000
hardware configuration. With Disk Magic, you model the DS8000 performance when
migrating from another disk subsystem or when making changes to an existing DS8000
configuration and the I/O workload. Disk Magic is for use with both System z and Open
Systems server workloads.
When running the DS8000 modeling, you start from one of these scenarios:
An existing, non-DS8000 model, which you want to migrate to a DS8000
An existing DS8000 workload
Modeling a planned new workload, even if you do not have the workload currently running
on any disk subsystem
You can model the following major DS8000 components by using Disk Magic:
DS8000 model: DS8100, DS8300, DS8700, and DS8800
Cache size for the DS8000
Number, capacity, and speed of disk drive modules (DDMs)
Number of arrays and RAID type
Type and number of DS8000 host adapters (HAs)
Type and number of channels
Remote Copy option
When working with Disk Magic, always ensure that you input accurate and representative
workload information, because Disk Magic results depend on the input data that you provide.
Also, carefully estimate the future demand growth that you input to Disk Magic for modeling
projections. The hardware configuration decisions are based on these estimates.
More information about using Disk Magic is in 6.1, “Disk Magic” on page 176.
With these configuration settings, you can simulate and test most types of workloads. Specify
the workload characteristics to reflect the workload in your environment.
To test the sequential read speed of a rank, you can run the command:
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
The rvpath0 is the character or raw device file for the LUN presented to the operating system
by SDD. This command reads 100 MB off of rvpath0 and reports how long it takes in seconds.
Take 100 MB and divide by the number of seconds that is reported to determine the MB/s
read speed.
Linux: For Linux systems, use the appropriate /dev/sdX device or /dev/mpath/mpathn
device if you use Device-Mapper multipath.
Issue the following command and start the nmon monitor or iostat -k 1 command in Linux:
dd if=/dev/rvpath0 of=/dev/null bs=128k
Your nmon monitor (the e option) reports that this previous command imposed a sustained 100
MB/s bandwidth with a blocksize=128k on vpath0. Notice the xfers/sec column; xfers/sec is
IOPS. Now, if your dd command did not already error out because it reached the end of the
disk, press Ctrl+C to stop the process. Now, nmon reports idle. Next, issue the following dd
command with a 4 KB blocksize and put it in the background:
For this command, nmon reports a lower MB/s but a higher IOPS, which is the nature of I/O as
a function of blocksize. Try your dd sequential read command with a bs=1024 and you see a
high MB/s but a reduced IOPS.
Try different blocksizes, different raw vpath devices, and combinations of reads and writes.
Run the commands against the block device (/dev/vpath0) and notice that blocksize does not
affect performance.
Because the dd command generates a sequential workload, you still need to generate the
random workload. You can use a free open-source tool, Vdbench.
Vdbench is a disk and tape I/O workload generator for verifying data integrity and measuring
the performance of direct-attached and network-connected storage on Windows, AIX, Linux,
The examples in this chapter demonstrate the steps that are required to model a storage
system for certain workload requirements. The examples and screen captures might refer to
previous versions of Disk Magic and previous IBM storage systems, such as ESS800 and
DS8300. However, they still provide guidance about the general steps that are involved in this
process.
Disk Magic: Disk Magic is available for use by IBM Business Partners, IBM
representatives, and users. Clients must contact their IBM representative to run Disk Magic
tool studies when planning for their DS8000 hardware configurations.
Disk Magic for Windows is a product of IntelliMagic B.V. Although we continue to refer to
this product as Disk Magic in this book, the product that is available from IntelliMagic for
clients is renamed to IntelliMagic Direction and might contain more features.
We provide basic examples only for the use of Disk Magic in this book. The version of the
product used in these examples might be outdated and contain fewer features than the
currently offered version for clients. For more information, see the product documentation
and guides. Or, for more information about the latest client version of this product, go to the
IntelliMagic website:
http://www.intellimagic.net
One of the problems with general rules is that they are based on assumptions about the
workload, which are lost or never documented in the first place. For example, a general rule
for I/O rates that applies to 4 KB transfers does not necessarily apply to 256 KB transfers. A
particular rule applies if the workloads are the same and all the hardware components are the
same, which they seldom are. Disk Magic overcomes this inherent lack of flexibility in rules by
allowing the person who runs the model to specify the details of the workload and the details
of the hardware. Disk Magic then computes the result in the response time and resource
utilization of running that workload on that hardware.
Disk Magic models are often based on estimates of the workload. For example, what is the
maximum I/O rate that the storage server sees? This I/O rate depends on identifying the
historical maximum I/O rate (which can require a bit of searching) and then possibly applying
adjustment factors to account for anticipated changes. The more you can substitute hard data
or even reasonable estimates to replace assumptions and guesses, the more accurate the
results are. In any event, a Disk Magic model is likely to be far more accurate than results
obtained by adjusting benchmark results or by applying general rules.
Disk Magic is calibrated to match the results of lab runs documented in sales materials and
white papers. You can view it as an encoding of the data obtained in benchmarks and
reported in white papers.
Different components can peak at different times. For example, a processor-intensive online
application might drive processor utilization to a peak while users are actively using the
system. However, disk utilization might be at a peak when the files are backed up during
off-hours. So, you might need to model multiple intervals to get a complete picture of your
processing environment.
z/OS environment
For each control unit to be modeled (current and proposed), we need the following
information:
Control unit type and model
Cache size
Nonvolatile storage (NVS) size
Disk drive module (DDM) size and speed
Number, type, and speed of channels
Parallel access volume (PAV), and whether it is used
Data collection
The preferred data collection method for a Disk Magic study is by using Tivoli Storage
Productivity Center. For each control unit to be modeled, collect performance data, create a
report for each control unit, and export each report as a comma-separated values (CSV) file.
You can obtain the detailed instructions for this data collection from the IBM representative.
The Help function in Disk Magic documents shows how to gather various Open Systems
types of performance data by using commands, such as iostat in Linux/UNIX and perfmon in
Windows. Disk Magic also can process PT reports from IBM i systems.
When working with Disk Magic, always ensure that you feed in accurate and representative
workload information, because Disk Magic results depend on the input data that is provided.
Also, carefully estimate future demand growth, which is fed into Disk Magic for modeling
projections on which the hardware configuration decisions are made.
After the valid base model is created, you proceed with your modeling. You change the
hardware configuration options of the base model to determine the best DS8000
configuration for a certain workload. Or, you can modify the workload values that you initially
entered, so that, for example, you can see what happens when your workload grows or its
characteristics change.
When we create a model for a DS8000 storage system, we select New SAN Project and the
panel, as shown in Figure 6-2 on page 181, opens. This panel shows the following options:
CreAte New Project Using Automated Input:
– zSeries or WLE Automated input (*.DMC) by using DMC input files:
Disk Magic Control (DMC) files contain a description of both the configuration and
workload of a z/OS environment for a specific time interval. This description is used as
a starting point for a detailed and accurate disk subsystem modeling study with Disk
Magic that is based on data collected with the z/OS Resource Management Facility.
Use the Resource Management Facility (RMF) Loader option to process raw RMF data
created with RMFPack on z/OS to create DMC files for Disk Magic.
– Open and iSeries® Automated Input (*.IOSTAT, *.TXT, *.CSV):
With this option, you can make Disk Magic process multiple UNIX/Linux iostat,
Windows perfmon, or iSeries PT reports, that are generated by multiple servers. Disk
Magic then consolidates these statistics across the servers, so that you can identify the
interval with the highest I/O rate, MB transferred rate, and so on.
– Tivoli Productivity Center Reports from disk subsystem (DSS) and SAN Volume
Controller (SVC) configurations (*.CSV):
This option creates a .CSV output file from Tivoli Storage Productivity Center.
Create New Project Using Manual Input:
– General Project: This option can be selected to create a project that initially consists of
a single Project, Model, System, and Disk Subsystem:
• Number of zSeries Servers
• Number of Open Servers
• Number of iSeries Servers
In a z/OS environment, disk subsystem measurement data is collected by the z/OS Resource
Management Facility. For more information, see 14.11, “RMF” on page 522. The disk
subsystem measurement data is stored in an SMF dataset. These zSeries I/O load statistics
can be entered manually or automatically into Disk Magic.
Manual data entry requires you to format Device and Cache Activity reports with the RMF
post processor, which creates reports with device-level statistics and a summary by logical
The preferred option is to use an automated input process to load the RMF data into Disk
Magic by using a Disk Magic Control (DMC) file. DMC files contain a description of both the
configuration and workload of a z/OS environment for a specific time interval. This DMC file is
used as a starting point for a detailed and accurate disk subsystem modeling study with Disk
Magic.
You can use the RMF Loader option in Disk Magic, as shown in Figure 6-2 on page 181, to
process raw RMF data and create a Disk Magic Control (DMC) file.
Figure 6-3 Using RMF Loader in Disk Magic to create z/OS DMC input files
To be able to process the z/OS SMF dataset on a Windows system with Disk Magic, it must
be packed first with the RMFPack utility. RMFPack is part of the Disk Magic installation. The
Disk Magic installation provides two XMIT files for installation of RMFPack on the z/OS
system. The Disk Magic installation provides a PDF file that contains a detailed description of
how to install and use RMFPack on z/OS. Use RMFPack to create the input data for the RMF
Loader option. RMFPack creates an SMF file in ZRF format on z/OS to be downloaded in
binary to your Windows system. You can then create your DMC files by processing the data
with the RMF Loader option in Disk Magic, as shown in Figure 6-3, and determine the
intervals to use for modeling.
To read the DMC file as automated input into Disk Magic, select zSeries or WLE Automated
input (*.DMC) in the New SAN Project dialog panel, as shown in Figure 6-2 on page 181. By
using automated input, you have the options to make the Disk Magic process the model at the
disk subsystem (DSS), logical control unit (LCU), or device level (Figure 6-3). Considerations
for using one or the other are provided in the Disk Magic help text under “How to Perform
Device Level Modeling”.
For example, JL292059 means that the DMC file was created for the RMF period of July 29 at
20:59.
2. In this particular example, we select the JL292059.dmc file, which opens the following
window, as shown in Figure 6-5 on page 184.
3. We see that there are four LPARs (SYSA, SYSB, SYSC, and SYSD) and two disk subsystems
(IBM-12345 and IBM-67890).
Clicking the IBM-12345 icon opens the general information that relates to this disk
subsystem (Figure 6-6). It shows that this disk subsystem is an ESS-800 with 32 GB of
cache that was created by using the subsystem identifier (SSID) or logical control unit
(LCU) level. The number of subsystem identifiers (SSIDs) or LCUs is 12, as shown in the
Number of zSeries LCUs field.
4. Select Hardware Details on Figure 6-6 to open the window in Figure 6-7 on page 185.
You can change the following features, based on the actual hardware configuration of the
ESS-800:
– SMP type
– Number of host adapters
– Number of device adapters
– Cache size
5. Next, click the Interfaces tab, as shown in Figure 6-6 on page 184. We see that each
LPAR connects to the disk subsystem through eight Fibre Channel connection (FICON)
Express2 2 Gb channels. If this information is incorrect, you can change it by clicking Edit.
6. Select From Disk Subsystem in Figure 6-8 shows the interface that is used by the disk
subsystem. Figure 6-9 on page 186 indicates that ESS IBM-12345 uses eight FICON
ports.
In this panel, you also indicate whether there is a Remote Copy relationship between this
ESS-800 and a remote disk subsystem. You also can define the connections that are used
between the primary site and the secondary site.
7. Next, look at the DDM by clicking the zSeries Disk tab. The DDM type is 36 GB/15K rpm.
Because the DDM that we use is 73 GB/10K rpm, we update this information by clicking
Edit. The 3390 types or models are 3390-3 and 3390-9 in Figure 6-10. Because any 3390
model that has a greater capacity than a 3390-9 model shows as a 3390-9 in the DMC file,
we need to know the actual models of the 3390s. Generally, there is a mixture of 3390-9,
3390-27, and 3390-54 models.
8. To see the last option, select the zSeries Workload tab. Because this DMC file is created
by using the SSID or LCU option, we see the I/O statistics for each LPAR by SSID
(Figure 6-11 on page 187). Click the Average tab to the right of SYSA (the Average tab at
the top in Figure 6-11 on page 187), scroll to the right of SSID 4010, and click the Average
tab (the Average tab at the bottom in Figure 6-12 on page 187). We see the total I/O rate
from all four LPARs to this ESS-800, which is 9431.8 IOPS (Figure 6-12 on page 187).
Figure 6-12 zSeries I/O statistics from all LPARs to this ESS-800
9. Click Base to create the base model for this ESS-800. It is possible that a base model
cannot be created from the input workload statistics, for example, if there is excessive
CONN time that Disk Magic cannot calibrate against the input workload statistics. In this
case, we must identify another DMC from a different time period, and try to create the
base model from that DMC file. After creating this base model for IBM-12345, we must
also create the base model for IBM-67890 by following this same procedure.
Figure 6-13 zSeries merge and create new target disk subsystem
2. Because we want to merge the ESS-800s to a DS8300, we need to modify this Merge
Target1. Click IBM DS8100 on the Hardware Type option to open a window that presents
choices, where we can select the IBM DS8300 Turbo. We also select Parallel Access
Volumes so that Disk Magic can model the DS8300 to take advantage of this feature.
We can select the cache size. In this case, we select 64 GB, because each of the two
ESS-800s has 32 GB cache. In a DS8300, this selection automatically also determines
the nonvolatile storage (NVS) size.
Disk Magic computes the number of HAs on the DS8000 based on the specification on the
Interfaces page, but you can, to a certain extent, override these numbers. We suggest that
you use one host adapter (HA) for every two ports, for both the Fibre Channel connection
(FICON) ports and the Fibre ports. The Fibre ports are used for Peer-to-Peer Remote
Copy (PPRC) links. We enter 4 FICON Host Adapters, because we are using eight FICON
ports on the DS8300 (see the Count column in Figure 6-17 on page 190).
4. Click the Interfaces tab to open the From Servers dialog (Figure 6-16 on page 190).
Because the DS8300 FICON ports are running at 4 Gbps, we need to update this option
on all four LPARs and also on the From Disk Subsystem (Figure 6-17 on page 190) dialog.
If the Host CEC uses different FICON channels than the FICON channels that are
specified, it also needs to be updated.
Select and determine the Remote Copy Interfaces. Select the Remote Copy type and the
connections that are used for the Remote Copy links.
5. To select the DDM capacity and rpm used, click the zSeries Disk tab in Figure 6-17. Now,
you can select the DDM type that is used by clicking Edit in Figure 6-18 on page 191. In
our example, we select the DS8000 146GB/15k DDM.
Usually, you do not specify the number of volumes used. Disk Magic determines the
number by adding all the 3390s coming from the merge source disk subsystems. If you
know the configuration that is used as the target subsystem and want the workload to be
spread over all the DDMs in that configuration, you can select the number of volumes on
the target subsystem so that it reflects the number of configured ranks. You can also
specify the RAID type that is used for this DDM set.
6. Merge the second ESS onto the target subsystem. In Figure 6-19, right-click IBM-67890,
select Merge, and then, select Add to Merge Source Collection.
7. Perform the merge procedure. From the Merge Target window (Figure 6-20 on page 192),
click Start Merge.
8. This selection initiates Disk Magic to merge the two ESS-800s onto the new DS8300 and
creates Merge Result1 (Figure 6-21).
9. To see the DDM configured for the DS8300, select zSeries Disk on MergeResult1. You
can see the total capacity that is configured based on the total number of volumes on the
two ESS-800s (Figure 6-22 on page 193). There are 11 ranks of 146 GB/15K rpm DDM
required.
10.Select zSeries Workload to show the Disk Magic predicted performance of the DS8300.
You can see that the modeled DS8300 has an estimated response time of 1.1 msec. Disk
Magic assumes that the workload is spread evenly among the ranks within the extent pool
that is configured for the workload (Figure 6-23).
11.Click Utilization to display the utilization statistics of the various DS8300 components. In
Figure 6-24 on page 194, you can see that the Average FICON HA Utilization is 39.8%
and has a darker (amber) background color. This amber background is an indication that
the utilization of that resource is approaching its limit. This percentage is still acceptable,
but it is a warning that workload growth might push this resource to its limit.
Any resource that is a bottleneck is shown with a red background. If a resource has a red
background, you need to increase the size of that resource to resolve the bottleneck.
12.Figure 6-25 on page 195 can be used as a guideline for the various resources in a
DS8000. The middle column has an amber background color, and the rightmost column
has a red background color. The amber number indicates a warning that if the resource
utilization reaches this number, an increase in the workload might soon cause the
resource to reach its limit. The red numbers are the utilization numbers, which indicate
that the resource is already saturated and can cause an increase in one of the
components of the response time.
13.It is better if the merge result shows that none of the resource utilization falls into the
amber color category.
3. Click Plot to produce the response time components graph of the three disk subsystems
that you selected in a Microsoft Excel spreadsheet. Figure 6-28 is the graph that is created
based on the numbers from the Excel spreadsheet.
2.5
2
2
1.5
1.5
msec
1
1
0.5
0
ESS-12345 @ 9432 IO/s ESS-67890 @ 6901 IO/s DS8300 @ 16332 IO/s
Now, select Plot. An error message indicates a host adapter Utilization > 100%. This
message means that you cannot increase the I/O rate up to 50000 I/O per second because of
a FICON host adapter bottleneck. Click OK to complete the graph creation. The graph is
based on the I/O rate increase as shown in Figure 6-30 on page 198.
Next, select Utilization Overview in the Graph Data choices, then click Clear and Plot to
produce the chart shown in Figure 6-31 on page 198.
2.5
2.1
2
1.6
1.4 1.4 1.5
1.5 1.3
1.1 1.1 1.1 1.2 1.2
1
1
0.5
40000
24000
26000
28000
30000
32000
38000
16000
18000
20000
22000
34000
36000
IO/sec
In Figure 6-31, observe that the FICON host adapter started to reach the red area at 22000
I/O per second. The workload growth projection stops at 40000 I/O per second, because the
FICON host adapter reaches 100% utilization when the I/O rate is greater than 40000 I/O per
second.
Average SMP 60% 80% 10.7% 12.0% 13.4% 14.7% 16.0% 17.4% 18.7% 20.1% 21.4% 22.7% 24.1% 25.4% 26.7%
Average Bus 70% 90% 6.7% 7.5% 8.4% 9.2% 10.0% 10.9% 11.7% 12.5% 13.4% 14.2% 15.0% 15.9% 16.7%
Average
n/a n/a 0.5% 0.6% 0.7% 0.7% 0.8% 0.9% 1.0% 1.1% 1.2% 1.3% 1.4% 1.5% 1.6%
Logical Device
Highest DA 60% 80% 14.9% 16.8% 18.7% 20.5% 22.4% 24.3% 26.1% 28.0% 29.9% 31.8% 33.6% 35.5% 37.4%
Highest HDD 60% 80% 27.2% 30.6% 34.0% 37.3% 40.7% 44.1% 47.5% 50.9% 54.3% 57.7% 61.1% 64.5% 67.9%
Average
35% 50% 39.0% 43.9% 48.8% 53.7% 58.5% 63.4% 68.3% 73.2% 78.0% 82.9% 87.8% 92.7% 97.6%
FICON HA
Highest
35% 50% 31.9% 35.9% 39.9% 43.9% 47.9% 51.9% 55.9% 59.9% 63.9% 67.9% 71.9% 75.8% 79.8%
FICON Port
In this example, we use the Tivoli Storage Productivity Center comma-separated values
(CSV) output file for the period that we want to model with Disk Magic. Typically, clients model
these periods:
Peak I/O period
Peak Read + Write throughput in MBps
Peak Write throughput in MBps
Tivoli Storage Productivity Center for Disk creates the Tivoli Storage Productivity Center csv
output files.
6.3.1 Process the Tivoli Storage Productivity Center CSV output file
We use these steps to process the CSV files:
1. From Figure 6-1 on page 180, we select Open and iSeries Automated Input and click
OK. This selection opens a window where you can select the csv files to use. In this case,
we select ESS11_TPC.csv, and then, while holding Ctrl down, we select
ESS14_TPC.csv. Then, click Open as shown in Figure 6-32.
2. The result is shown in Figure 6-33 on page 200. To include both csv files, click Select All
and then click Process to display the I/O Load Summary by Interval table (see Figure 6-37
on page 202). This table shows the combined load of both ESS11 and ESS14 for all the
intervals recorded in the Tivoli Storage Productivity Center CSV file.
3. Select Excel in Figure 6-37 on page 202 to create a spreadsheet with graphs for the I/O
rate (Figure 6-34), Total MBps (Figure 6-35 on page 201), and Write MBps (Figure 6-36
on page 201) for the combined workload on both of the ESS-800s by time interval.
Figure 6-34 shows the I/O Rate graph and that the peak is approximately 18000+ IOPS.
I/O Rate
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
1-Aug
3-Aug
4-Aug
5-Aug
6-Aug
8-Aug
9-Aug
10-Aug
12-Aug
13-Aug
14-Aug
16-Aug
17-Aug
18-Aug
19-Aug
21-Aug
22-Aug
23-Aug
Interval Time
Total MB/s
6000
5000
4000
3000
2000
1000
0
1-Aug
3-Aug
4-Aug
5-Aug
6-Aug
8-Aug
9-Aug
10-Aug
11-Aug
13-Aug
14-Aug
15-Aug
17-Aug
18-Aug
19-Aug
20-Aug
22-Aug
23-Aug
Interval Time
Also, investigate the situation for the peak write MBps on Figure 6-36. The peak period, as
expected, coincides with the peak period of the total MBps.
Write MB/s
1400
1200
1000
800
600
400
200
0
1-Aug
3-Aug
4-Aug
5-Aug
6-Aug
8-Aug
9-Aug
10-Aug
11-Aug
13-Aug
14-Aug
15-Aug
16-Aug
18-Aug
19-Aug
20-Aug
22-Aug
23-Aug
Interval Time
5. From this panel, select Add Model and then select Finish. A pop-up window prompts you
with “Did you add a Model for all the intervals you need?” because you can include
multiple workload intervals in the model. However, we model one workload interval, so we
respond Yes. The window in Figure 6-38 opens.
6. In Figure 6-38, double-click ESS11 to see the general information related to this disk
subsystem as shown in Figure 6-39 on page 203. Figure 6-39 on page 203 shows that this
disk subsystem is an ESS-800 with 16 GB of cache and 2 GB of NVS.
7. Select Hardware Details in Figure 6-39 to change the following features, based on the
actual hardware configuration of the ESS-800:
– SMP type
– Number of host adapters
– Number of device adapters
– Cache size
In this example (Figure 6-40), we change the cache size to the actual cache size of the
ESS11, which is 32 GB.
8. Next, click the Interface tab in Figure 6-39. The From Servers panel (Figure 6-41 on
page 204) shows that each server connects to the disk subsystem through four 2 Gb Fibre
Channels. If this information is not correct, you can change it by clicking Edit.
9. Select the From Disk Subsystem option in Figure 6-41 to display the interface used by
the disk subsystem. Figure 6-42 shows that ESS11 uses 8 Fibre 2 Gb Ports. We need to
know how many Fibre ports are used, because there are two servers that access this
ESS-800. Each server uses four Fibre Channels, so there can be up to eight Fibre ports
on the ESS-800. In this particular case, there are eight Fibre ports on the ESS-800. If
there are more (or fewer) FICON ports, you can update this information by clicking Edit.
10.On this panel, you also indicate whether there is a Remote Copy relationship between this
ESS-800 and a remote disk subsystem. On this panel, you can choose to define the
connections used between the Primary site and the Secondary site.
11.To see the DDM, click the Open Disk tab. Figure 6-43 on page 205 shows DDM options
by server. We enter or select the actual configuration specifics of ESS11. ESS11 is
accessed by server Sys_ESS11.
13.Select the Total tab in Figure 6-43 to display the total capacity of the ESS-800, which is 12
TB on 28 RAID ranks of 73GB/10K rpm DDMs, as shown in Figure 6-44.
14.To see the last option, select the Open Workload tab in Figure 6-44. Figure 6-45 on
page 206 shows that the I/O rate from Sys_ESS11 is 6376.8 IOPS and the service time is
Figure 6-46 Open Systems merge and the creation of a target disk subsystem
3. Selecting Hardware Details opens the window that is shown in Figure 6-48. With the
Failover Mode option, you can model the performance of the DS8000 when one processor
server with its associated processor storage is lost.
We can select the cache size, in this case 64 GB, which is the sum of the cache sizes of
ESS11 and ESS14. In a DS8300, this selection automatically also determines the NVS
size.
Disk Magic computes the number of HAs on the DS8000 based on the numbers that are
specified on the Interfaces page, but you can, to a certain extent, override these numbers.
We suggest that you use one HA for every two Fibre ports. In this case, we select four
Fibre HAs, because we use eight Fibre ports.
4. Click the Interfaces tab in Figure 6-47 to open the dialog that is shown in Figure 6-48.
5. Because the DS8300 Fibre ports run at 4 Gbps, we need to update this option on both
servers (Figure 6-49 on page 208) and also on the From Disk Subsystem dialog
6. To select the DDM capacity and rpm used, click the Open Disk tab in Figure 6-51 on
page 209. Then, select the DDM used by clicking Add. Select the HDD type used
(146GB/15K rpm) and the RAID type (RAID 5), and enter capacity in GB (24000). Now,
click OK.
7. Now, merge the second ESS onto the target subsystem. In Figure 6-52, right-click ESS14,
select Merge, and then, select Add to Merge Source Collection.
8. To start the merge, in the Merge Target window that is shown in Figure 6-53 on page 210,
click Start Merge. This selection initiates Disk Magic to merge the two ESS-800s onto the
new DS8300. Use the pop-up window to select whether to merge all workloads or only a
9. Click the Open Disk tab in Figure 6-54 to show the disk configuration. In this case, it is
24 TB on 28 ranks of 146 GB/15K rpm DDMs (Figure 6-55 on page 211).
10.Select the Open Workload tab in Figure 6-55 to show the Disk Magic predicted
performance of the DS8300. We see that the modeled DS8300 has an estimated service
time of 5.9 msec (Figure 6-56).
11.Click Utilizations in Figure 6-56 to show the utilization statistics of the various
components of the DS8300. In Figure 6-57 on page 212, we see that the Highest HDD
Utilization is 60.1% and has a darker (amber) background color. This amber background
is an indication that the utilization of that resource is approaching its limit. It is still
acceptable, but the color is a warning that a workload increase might push this resource to
its limit.
Any resource that is a bottleneck is shown with a red background. If a resource shows a
red background, you need to increase that resource to resolve the bottleneck.
Use Figure 6-25 on page 195 as a guideline for the various resources in a DS8000. The
middle column has an amber background color and the rightmost column has a red
background color. The amber number indicates a warning that if the resource utilization
reaches this number, a workload increase might soon cause the resource to reach its limit.
The red numbers are utilization numbers that cause an increase in one of the components of
the response time.
Hold Ctrl down and select ESS11, ESS14, and MergeResult1. Right-click any of them, and a
small window appears. Select Graph from this window. On the panel that appears
(Figure 6-59 on page 213), select Clear to clear any prior graph option settings. Click Plot to
produce the service time graph of the three selected disk subsystems (Figure 6-60 on
page 213).
14
11.4
12
Service Time (msec)
10
8
5.9
6 4.5
0
ESS11 @ 6377 IO/s ESS14 @ 12490 IO/s DS8300 @ 18867 IO/s
Configuration
Click Range Type, and choose I/O Rate. This selection fills the from field with the I/O Rate of
the current workload, which is 18,867.2 IOPSec, the to field with 22,867.2, and the by field
with 1,000. You can change these numbers. In our case, we change the numbers to 18000,
Now, select Plot. An error message appears that shows that the HDD utilization > 100%. This
message indicates that we cannot increase the I/O rate up to 40000 IOPSec because of the
DDM bottleneck. Click OK to complete the graph creation.
The graph shows the service time plotted against the I/O rate increase as shown in
Figure 6-62.
40
35 33.4
30
25
msec
20
15.4
15
10.7
10 8.5
7.2
5.7 6.3
0
18000
20000
22000
24000
28000
26000
30000
IO/sec
Figure 6-62 Open Systems service time projection with workload growth
After selecting Utilization Overview in the Graph Data choices, click Clear and click Plot,
which produces the resource utilization table in Figure 6-63.
Average Bus 70% 90% 12.1% 13.5% 14.8% 16.2% 17.5% 18.9% 20.2%
Average Logical Device n/a n/a 3.6% 4.6% 5.8% 7.7% 10.8% 17.4% 42.2%
Highest DA 60% 80% 8.9% 9.9% 10.9% 11.9% 12.9% 13.9% 14.9%
Highest HDD 60% 80% 57.3% 63.7% 70.1% 76.5% 82.8% 89.2% 95.6%
Average FICON HA 35% 50% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Highest FICON Port 35% 50% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Average Fibre HA 60% 80% 28.8% 32.0% 35.2% 38.4% 41.6% 44.8% 48.1%
Highest Fibre Port 60% 80% 26.7% 29.7% 32.7% 35.6% 38.6% 41.6% 44.5%
2. Click Add to add the new extent pool oPool2 with 4 SSD ranks, as depicted in Figure 6-69.
3. Click OK in the Add RAID Ranks dialog to complete the new extent pool creation. By
clicking OK in the RAID Ranks for Open Systems dialog, Disk Magic asks whether you
want to start the SSD advisor, as shown in Figure 6-70.
4. Answer Yes to the question and Disk Magic tries to identify a suitable server workload to
move onto SSDs. If it succeeds, the window that is depicted in Figure 6-71 on page 219
opens.
5. In this example, Disk Magic identified the AIX2 workload to move to SSDs. On the main
Disk Subsystem dialog, select Solve to create the model. Figure 6-72 and Figure 6-73 on
page 220 show the predicted workload statistics for AIX2 and AIX1.
6. The extent pool field now shows oPool2 for AIX2, because Disk Magic decided to move
this workload to SSDs.
7. The AIX1 workload also benefits from moving the AIX2 workload to SSDs, because the
HDDs are now less utilized, as shown in Figure 6-74.
heavy
medium
80% IOps light
on SSD in
cr
ea
si
ng
sk
)
ty
ew
i
ns
55% IOps
de
on SSD
ss
Sk
ce
ac
ew
us
eo
37% IOps Le
en
on SSD
ve
og
om
l
(h
ew
sk
no
20% capacity
on SSD
1
Release 9.1.x of Disk Magic introduced an extended set of five predefined skew levels: very low skew, low skew,
intermediate skew, high skew, and very high skew. The previous skew levels light, medium, and heavy proved to be
too conservative in general. Many workloads showed even higher skew levels. The new skew levels are adopted to
address this increase by introducing new high and very high skew levels. The previous skew levels now map to
light=very low skew, medium=low skew, and heavy=intermediate skew.
For Open Systems and System z workloads, Disk Magic uses the IBM TotalStorage
Productivity Center for Disk data that is provided as input. By examining the workload
distribution over the logical volumes, it estimates a skew level value. For a System z multi-tier
configuration, Disk Magic can also determine a skew level value so that the computed
average disconnect times for the tiers match the measured values.
To enable the Easy Tier modeling, Disk Magic provides a special settings dialog. On the
General tab of the Disk Subsystem dialog, you can see Easy Tier Settings, as depicted in
Figure 6-78 on page 224.
2. Select the zSeries Disk tab in the Disk Subsystem dialog. The new tab format appears.
Now, you can mix different drive types in the same extent pool, as shown in Figure 6-80 on
page 225. Click Add in the RAID Rank Definitions section to add new ranks into an extent
pool.
3. In our example, we add two SSD 300 GB ranks to extent pool zPool1, as depicted in
Figure 6-81.
4. Now, we are ready to start the modeling with a multi-tier configuration. In the main Disk
Subsystem dialog, click Solve to start the modeling. After the modeling completes, the
zSeries Workload tab reports the new workload statistics for the model, as depicted in
Figure 6-82 on page 226. The predicted Response Time improves from 2.8 msec to
1.8 msec.
6. As result of introducing the SSD, we can see a higher DA utilization and a lower HDD
utilization. Click Report on the main Disk Subsystem dialog (Figure 6-82). The report for
the new model is created. We can see the predicted I/O distribution across the tiers, as
reported in Figure 6-84 on page 227.
7. In this example, Disk Magic estimates that about 60% of I/Os are to be performed on SSD
ranks, which are about 13% of the total capacity. These figures agree with the high skew
level curve that is reported in Figure 6-75 on page 221.
In the previous example, we use the predefined skew level settings. However, in certain
cases, we can ask Disk Magic to estimate the skew level value directly from the workload
data. This estimate can be created for both System z and Open Systems.
8. For System z workloads that run in multi-tier configurations, Disk Magic calibrates the
skew level value for each LCU on the tiered extent pools that best matches the LCU
measured disconnect time. For non-LCU models, a single skew level is determined
similarly for the whole DSS. This feature is called Skew Value Calibration. To activate the
automatic skew level value calibration, in Figure 6-85, enable Easy Tier modeling and
clear the Use Predefined Skew Level check box. To restart the modeling, click Solve in
the main Disk Subsystem panel.
Figure 6-85 Easy Tier Settings panel with cleared Use Predefined Skew Level option
9. In the report that is produced by Disk Magic, you can see the skew values that are
estimated at the LCU level, as shown in Figure 6-86.
Figure 6-86 Disk Magic report tiering information with calibrated skews
Hint: When the workload measurements are inadequate for a skew value calibration, Disk
Magic returns the default skew value (2.00) for all the LCUs or DSS. Use the skew value
calibration feature only with real workloads that are provided through RMF data (RMF
Loader).
Disk Magic provides another skew value estimation capability with the Tivoli Productivity
Center Loader facility. This skew value calculation method applies both to System z and Open
Systems. The following inputs are needed by the Tivoli Productivity Center Loader to estimate
the skew level:
CSV files that contain the Tivoli Productivity Center data at the volume level
CSV file that contains the output of the DS8000 command-line interface (DSCLI)
command:
– lsfbvol -l -fmt delim -delim ; -fullid for Open Systems
– lsckdvol -l -fmt delim -delim ; -fullid for System z
The Tivoli Productivity Center Loader tries to calculate the skew value by analyzing the logical
volume I/O density. If a skew value is successfully calculated, it is reported on the Easy Tier
Settings dialog.
Disk Magic performs skew value estimations by making assumptions that are based on
measured performance statistics, such as disconnect times and I/O density. Although these
assumptions are generally valid, in certain cases, they cannot be accepted. For example,
large LUNs with high I/O density can bring Disk Magic to overestimate the amount of extents
to be placed into the higher performance tier. This situation results in an aggressive sizing.
These guidelines do not represent any type of guarantee of performance and do not replace
the more accurate estimation techniques that can be obtained from a Disk Magic study.
Workload
The example is based on an online workload with the assumption that the transfer size is 4K
and all the read and write operations are random I/Os. The workload is a 70/30/50 online
transaction processing (OLTP) workload, which is an online workload with 70% reads, 30%
writes, and a 50% read-hit-ratio. We estimate these workload characteristics:
Maximum host I/O rate is 10000 IOPS.
Write efficiency is 33%, which means that 67% of the writes are destaged.
Ranks use a RAID 5 configuration.
DDM speed
For I/O intensive workloads, consider 15K rpm DDMs, or a mix with SSDs.
DDM capacity
You must estimate the capacity. Choose from among 146 GB, 300 GB, 450 GB, 600 GB, or
900 GB DDMs based on these factors:
Total capacity that is needed in GBs
Estimated Read and Write I/O rates
RAID type used
For a discussion about RAID levels, space efficiency, and write penalty, see 4.7, “Planning
RAID arrays and ranks” on page 103.
RAID 5: The RAID 5 write penalty of four I/O operations per write is shown in Table 4-1 on
page 106 in the Performance Write Penalty column.
Depending on the DDM size, the table in Figure 6-87 shows how much capacity you can get
with 18 ranks for three selected drive types. Knowing the total GB capacity that is needed for
this workload, use this chart to select the DDM size that can meet the capacity requirement.
Tip: Use larger DDM sizes only for applications that are less I/O intensive, or as part of a
tiered configuration.
For Open Systems loads, make sure that the DS8000 adapters and ports do not reach their
throughput limits (MB/s values). If you must fulfill specific throughput requirements, ensure
that each port under peak conditions is only loaded with a maximum of approximately 70% of
its nominal throughput. Also, consider the guidelines in 4.10.1, “I/O port planning
considerations” on page 147.
If you do not have throughput requirements, use the following rule to obtain an initial estimate
of the number of host adapters:
For each TB of capacity that is used, configure a nominal throughput of 100 MB/s.
For instance, 64 TB disk capacity then leads to 6400 MB/s required nominal throughput.
With 8-Gbps ports assumed, you need eight DS8000 I/O ports.
In 4.10.1, “I/O port planning considerations” on page 147, we learned that the performance of
an 8 Gbps host adapter (HA) does not scale with more than four ports. Therefore, you need to
plan for a minimum of two 8 Gbps host adapters in this example, which is based on an
estimate for the throughput.
With an actual throughput requirement of this size, consider that the actual throughput that
can be sustained by a host adapter also depends on the workload profile and, therefore, is
typically less than the sum of the nominal 8 Gbps Fibre Channel port throughputs. In general,
plan for a utilization that is no more than approximately 70% of the nominal throughput. In this
case, we suggest more than two host adapters. Contact your IBM representative for an
appropriate sizing that is based on your actual workload requirements.
DS8000 cache
Use Figure 6-88 as a guide for the DS8000 cache size if you do not have workload history
that you can use.
STAT overview
The advisor tool processes data that is collected by the Easy Tier monitor. The DS8000
monitoring capabilities are available regardless of whether you install and activate the Easy
Tier license feature on your DS8000. The monitoring capability of the DS8000 enables it to
monitor the usage of storage at the volume extent level. Monitoring statistics are gathered
and analyzed every 24 hours. The results of this data are summarized in summary monitor
data that can be downloaded from a DS8000 for reporting with the advisor tool.
The advisor tool provides a graphical representation of performance data that is collected by
the Easy Tier monitor over a 24-hour operational cycle. You can view the information that is
displayed by the advisor tool to analyze workload statistics and evaluate which logical
volumes might be candidates for Easy Tier management. If the Easy Tier feature is not
installed and enabled, you can use the performance statistics that are gathered by the
monitoring process to help you determine whether to use Easy Tier to enable potential
performance improvements in your storage environment and to determine optimal SSD or
HDD configurations and benefits.
After you know your microcode version (DSCLI command ver -l), you can download the
suitable STAT version from the IBM FTP site:
ftp://ftp.software.ibm.com/storage/ds8000/updates/DS8K_Customer_Download_Files/Sto
rage_Tier_Advisor_Tool/
To extract the summary performance data that is generated by the Storage Tier Advisor Tool,
you can use either the DSCLI or DS Storage Manager. When you extract summary data, two
files are provided, one for each processor complex in the Storage Facility Image (SFI server).
The download operation initiates a long running task to collect performance data from both
selected storage facility images. This information can be provided to IBM if performance
analysis or problem determination is required.
Availability: Easy Tier monitor is available on any DS8000 system, starting with LMC level
6.5.1.xx (DS8700) or LMC 7.6.10.xx (DS8800), with or without the Easy Tier LIC feature
or SSDs.
STAT improvement
In its initial version, the STAT estimated performance improvement and SSD configuration
guidelines were performed for the storage system as a whole.
The newer version of the STAT that is available with DS8000 microcode R6.2 is more granular
and it provides a broader range of recommendations and benefit estimations. The
recommendations are available for all the supported multi-tier configurations and they are on
Another improvement in the STAT is a better way to calculate the performance estimation.
Previously, the STAT took the heat values of each bucket as linear, which resulted in an
inaccurate estimation when the numbers of extents in each bucket are disproportional. The
STAT now uses the average heat value that is provided at the sub-LUN level to provide a
more accurate estimate of the performance improvement.
The STAT (Figure 6-89) describes the Easy Tier statistical data that is collected by the
DS8000 in detail and it produces reports in HTML format that can be viewed by using a
standard browser. These reports provide information at the levels of a DS8000 storage
system, the extent pools, and the volumes. The reports also show whether certain pools are
bandwidth (BW)-overloaded, IOPS-overloaded, or skewed. Sizing recommendations and
estimated benefits are also in the STAT reports.
The STAT provides information about the amount of hot data, the length of time that it takes to
apply automatic tiering, and suggestions about the amount of SSD capacity (higher tier) that
can be added beneficially. Also, the STAT provides information about the nearline capacity
that can be added to save costs when a large amount of cold data is detected.
Figure 6-89 shows the first output view of a STAT analysis, which is the System Summary.
This report provides information about all the extent pools, how many tiers make up each
extent pool, and the total amount of data monitored, and hot data.
When the STAT detects a significant amount of cold data, the STAT provides suggestions
(Figure 6-90 on page 233) to add SSDs and nearline drives to a DS8000 to cold-demote data
that is infrequently accessed onto slower and less costly storage space.
When your system consists of a limited number of available SSD ranks, or nearline ranks, the
STAT recommendation shows you which pool might benefit most by adding those ranks to
existing Enterprise HDD ranks.
The next series of views is by extent pool. When you click a certain storage pool, you can see
additional and detailed recommendations for improvements at the level of each extent pool.
See Figure 6-91.
Figure 6-91 Report that shows that storage pool 0000 needs SSDs and skew
Figure 6-91 shows an extent pool view, for a pool that currently consists of two Enterprise
(15K/10K) HDD ranks, with both hot and cold extents. This pool can benefit from adding one
solid-state drive (SSD) rank and one nearline rank. You can select the types of drives that you
want to add for a certain tier through the small pull-down menus on the left side. These menus
contain all the drive and RAID types for a certain type of tier. For instance, when adding more
Enterprise drives is suggested, the STAT can calculate the benefit of adding drives in
If adding multiple ranks of a certain tier is beneficial for a certain pool, the STAT modeling
offers improvement predictions for the expected performance gains when adding two, three,
or more ranks up to the recommended number.
Another view within each pool is the Volume Heat Distribution report. Figure 6-92 shows all
the volumes of a certain pool. For each volume, the view shows the amount of capacity that is
allocated to each tier and the distribution within each tier among hot, warm, and cold data.
Figure 6-92 Storage pool statistics and recommendations: Volume Heat Distribution view
In this view, three heat classes are visible externally. Internally, however, DS8000 Easy Tier
monitoring uses a more granular extent temperature in heat buckets. This detailed Easy Tier
data can be retrieved by IBM support for extended studies of a client workload situation.
Table 7-1 Tivoli Storage Productivity Center supported activities for performance processes
Process Activities Feature
Operational Performance data collection for port, Performance monitor jobs that use Native
array, volume, and switch metrics API (NAPI)
Tactical Performance analysis and tuning Tool facilitates through data collection
and reporting
Tactical Short term trending GUI charting facilitates trend lines, and
reporting options facilitate export to
analytical tools
Strategic Long term trending GUI charting facilitates trend lines, and
reporting options facilitate export to
analytical tools
Certain features that are required to support the performance management processes are
not provided in Tivoli Storage Productivity Center. These features are shown in Table 7-2 on
page 237.
Operational Host data collection Native host tools. See the OS chapters for more detail.
performance and alerting
Tactical Host performance analysis Native host tools. See the OS chapters for more detail.
and tuning
For a full list of the features that are provided in each of the Tivoli Storage Productivity Center
SE components, visit the IBM website:
http://www.ibm.com/systems/storage/software/center/standard/index.html
Tivoli Storage Productivity Center V4.2 provides a new access method to gather information
from devices. This method is called the Native API (NAPI) and is at this time available for only
a limited number of disk storage systems.
With the introduction of Native API, another architectural change is introduced: the External
Process Manager (EPM). This process manager is the link between the devices used by
NAPI and Tivoli Storage Productivity Center. It is called External Process Manager, because
now the jobs for the NAPI devices are started as external processes in the operating system,
and are no longer running as threads within the Device server process. The advantage is that
the scalability and reliability is increased. Figure 7-2 on page 239 shows the high-level
architecture of the EPM. You can see that the EPM starts external processes for each type of
device and each type of job.
For the devices we listed, Tivoli Storage Productivity Center V4.2 uses only the Native API.
When you upgrade to Tivoli Storage Productivity Center V4.2, an update/migration is required
to switch to the NAPI, which can be done before or during the installation (or even later, but
you cannot the use the device until you complete the migration). For that reason, the
Supported Storage Products Matrix does not list any provider versions or Interop Namespace
for the IBM supported devices that are listed. In addition to this new interface, the device
server is modified, so that together with the NAPI, the scalability and reliability are enhanced.
Tivoli Storage Productivity Center is still not trying to replace the management tools for those
devices, but at the same time, clients asked for better integration of IBM devices. As an
example, for the DS8000, specifying the logical subsystem (LSS) when provisioning volumes
was not possible. It is now possible with Tivoli Storage Productivity Center V4.2.
The Storage Management Initiative (SMI) standard does not include this level of detail,
because the intention of SMI-S is too abstract from the actual hardware devices.
There are many components in a Tivoli Storage Productivity Center environment. An example
of the complexity of a Tivoli Storage Productivity Center environment is provided in
Figure 7-3.
TPC
Pro xy CIMOM Pr oxy C IMOM
GUI
GUI
GU I
GU I
GU I
GUI GU I
Inter nal
H MC
G UI
Extern al
H MC
This architecture was designed by the Storage Networking Industry Association (SNIA), an
industry workgroup. The architecture is not simple, but it is “open,” which means that any
company can use SMI-S standard CIMOMs to manage and monitor storage and switches.
Tivoli Storage Productivity Center can collect the DS8000 data, but it also can collect SAN
fabric performance data when using Tivoli Storage Productivity Center Standard Edition.
Tivoli Storage Productivity Center can gather information about the component levels, as
shown in Figure 7-4 on page 241 for the DS8000 systems. Displaying a metric within Tivoli
Storage Productivity Center depends on the ability of the storage subsystem and the NAPI to
provide the performance data and related information, such as the values that are assigned to
processor complexes. We guide you through the diagram in Figure 7-4 on page 241 by
drilling down from the overall subsystem level.
Metrics: A metric is a numerical value that is derived from the information that is provided
by a device. It is not only the raw data, but a calculated value. For example, the raw data is
the transferred bytes, but the metric uses this value and the interval to show the
bytes/second.
connections
By Subsystem
Controller 1 Controller 2 - Cache
Write Cache Write Cache Mirror
Controller 1 Controller 1
Arrays
(sometimes called
enclosures, 8-Pack,
Mega Pack, …)
The amount of available information or metrics depends on the type of subsystem involved.
The SMI-S standard does not require vendors to provide detailed performance data. For the
DS8000, IBM provides extensions to the standard that include much more information than
the information that is required by the SMI-S standard. Therefore, the NAPI is used.
Important: If you migrate a NAPI device either before or as part of the upgrade to Tivoli
Storage Productivity Center V4.2, any embedded DS8000 CIMOMs, SAN Volume
Controller CIMOMs, and XIV CIMOMs are automatically deleted from Tivoli Storage
Productivity Center. Proxy DS CIMOMs are not automatically deleted, even if Tivoli
Storage Productivity Center knows of no other devices configured on that CIMOM.
Cache
The cache in Figure 7-4 on page 241 is a subcomponent of the subsystem, because the
cache plays a crucial role in the performance of any storage subsystem. You do not find the
cache as a selection in the navigation tree in Tivoli Storage Productivity Center, but there are
available metrics that provide information about cache.
Cache metrics for the DS8000 are available in the following report types:
Subsystem
Controller
Array
Volume
Cache metrics
Metrics, such as disk-to-cache operations, show the number of data transfer operations from
disks to cache. The number of data transfer operations from disks to cache is called staging
for a specific volume. Disk-to-cache operations are directly linked to read activity from hosts.
When data is not found in the DS8000 cache, the data is first staged from back-end disks into
the cache of the DS8000 server and then transferred to the host.
Read hits occur when all the data requested for a read data access is in cache. The DS8000
improves the performance of read caching by using Sequential Prefetching in Adaptive
Replacement Cache (SARC) staging algorithms. For more information about the SARC
algorithm, see 1.3.1, “Advanced caching techniques” on page 8. The SARC algorithm seeks
to store those data tracks that have the greatest probability of being accessed by a read
operation in cache.
The cache-to-disk operation shows the number of data transfer operations from cache to
disks, which is called as destaging for a specific volume. Cache-to-disk operations are directly
linked to write activity from hosts to this volume. Data written is first stored in the persistent
memory (also known as nonvolatile storage (NVS)) at the DS8000 server and then destaged
to the back-end disk. The DS8000 destaging is enhanced automatically by striping the
volume across all the disk drive modules (DDMs) in one or several ranks (depending on your
configuration). This striping, or volume management done by Easy Tier, provides automatic
load balancing across DDMs in ranks and an elimination of the hot spots.
The Write-cache Delay I/O Rate or Write-cache Delay Percentage due to persistent memory
allocation gives us information about the cache usage for write activities. The DS8000 stores
data in the persistent memory before sending an acknowledgement to the host. If the
persistent memory is full of data (no space available), the host receives a retry for its write
request. In parallel, the subsystem must destage the data that is stored in its persistent
memory to the back-end disk before accepting new write operations from any host.
If a volume experiences write operation delayed due to persistent memory constraint delays,
consider moving the volume to a lesser used rank or spread this volume on multiple ranks
(increase the number of DDMs used). If this solution does not fix the persistent memory
constraint problem, consider adding cache capacity to your DS8000.
You can use the controller reports to identify if the DS8000 processor complexes are busy
and persistent memory is sufficient. Write delays can occur due to write performance
limitations on the back-end disk (at the rank level) or limitation of the persistent memory size.
Ports
The port information reflects the performance metrics for the front-end DS8000 ports that
connect the DS8000 to the SAN switches or hosts. Additionally, port error rate metrics, such
as Error Frame Rate, are also available. The DS8000 host adapter (HA) card has four or eight
ports. The SMI-S standards do not reflect this aggregation so Tivoli Storage Productivity
Center does not show any group of ports that belong to the same HA. Monitoring and
analyzing the ports that belong to the same card are beneficial, because the aggregate
throughput is less that the sum of the stated bandwidth of the individual ports. For more
information about the DS8000 port cards, see 2.5.1, “Fibre Channel and FICON host
adapters” on page 46.
Port metrics: Tivoli Storage Productivity Center reports on many port metrics; therefore,
the ports on the DS8000 are the front-end part of the storage device.
Volumes on the DS8000 storage systems are primarily associated with an extent pool and an
extent pool relates to a set of ranks. To quickly associate all arrays with their related ranks and
extent pools, use the output of the DSCLI lsarray -l and lsrank -l commands, as shown in
Example 7-1 on page 245.
Certain Tivoli Storage Productivity Center examples and figures in this book might refer to
Tivoli Storage Productivity Center identifiers used in versions before Version 4.2 and refer
to the DS8000 array sites instead of arrays.
Figure 7-5 TPC 4.2.x storage subsystem performance report by array that shows statistics for DS8000 arrays (Axy)
A DS8000 array site consists of a group of eight DDMs. A DS8000 array is defined on an
array site with a specific RAID type. A rank is a logical construct to which an array is
assigned. A rank provides a number of extents that are used to create one or several
volumes. A volume can use the DS8000 extents from one or several ranks. For more
information, see 3.2.1, “Array sites” on page 58, 3.2.2, “Arrays” on page 58, and 3.2.3,
“Ranks” on page 59.
Example 7-1 shows the relationships among a DS8000 rank, an array, and an array site with
a typical divergent numbering scheme by using DSCLI commands. Use the showrank
command to show which volumes have extents on the specified rank.
The Tivoli Storage Productivity Center array performance reports include both front-end and
back-end metrics. The back-end metrics are specified by the keyword Backend. They provide
metrics from the perspective of the controller to the back-end array sites. The front-end
metrics relate to the activity between the server and the controller.
There is a relationship between array operations, cache hit ratio, and percentage of read
requests. When the cache hit ratio is low, the DS8000 has frequent transfers from DDMs to
cache (staging).
When the percentage of read requests is high and the cache hit ratio is also high, most of the
I/O requests can be satisfied without accessing the DDMs due to the cache management
prefetching algorithm.
When the percentage of read requests is low, the DS8000 write activity to the DDMs can be
high. The DS8000 has frequent transfers from cache to DDMs (destaging).
Comparing the performance of different arrays shows whether the global workload is equally
spread on the DDMs of your DS8000. Spreading data across multiple arrays increases the
number of DDMs used and optimizes the overall performance.
Important: Back-end write metrics do not include the RAID overhead. In reality, the
RAID 5 write penalty adds additional unreported I/O operations.
Tivoli Storage Productivity Center might not be able to calculate some of the volume-based
metrics in the array report in multi-rank extent pools, and multi-rank extent pools that contain
space-efficient volumes. In this case, the columns for volume-based metrics display the value
The following message might be reported by Tivoli Storage Productivity Center in conjunction
with multi-rank extent pools:
HWNPM2060W The device does not support performance management for segment pool pool ID.
Only incomplete performance data can be collected for array array ID.
Explanation
The specified segment pool contains multiple ranks, which makes it impossible to
accurately manage the performance for those ranks, the arrays associated with those
ranks, and the device adapters associated with those arrays.
For DS6000 and DS8000 devices whenever a segment pool contains multiple ranks, any
volumes allocated in that segment pool might be spread across those ranks in an
unpredictable manner. This makes it impossible to determine the performance impact of
the volumes on the individual ranks. To avoid presenting the user with inaccurate or
misleading performance data, the Performance Manager does not attempt to compute the
performance metrics for the affected arrays and device adapters.
Volumes
The volumes, which are also called logical unit numbers (LUNs), are shown in Figure 7-7 on
page 247. The host server sees the volumes as physical disk drives and treats them as
physical disk drives.
Analysis of volume data facilitates the understanding of the I/O workload distribution among
volumes, as well as workload characteristics (random or sequential and cache hit ratios). A
DS8000 volume can belong to one or several ranks, as shown in Figure 7-7 (for more
information, see 3.2.7, “Extent allocation methods (EAMs)” on page 68). Especially in
managed multi-rank extent pools with Easy Tier automatic data relocation enabled, the
distribution of a certain volume across the ranks in the extent pool can change over time. The
The analysis of volume metrics shows the activity of the volumes on your DS8000 and can
help you perform these tasks:
Determine where the most accessed data is located and what performance you get from
the volume.
Understand the type of workload that your application generates (sequential or random
and the read or write operation ratio).
Determine the cache benefits for the read operation (cache management prefetching
algorithm SARC).
Determine cache bottlenecks for write operations.
Compare the I/O response observed on the DS8000 with the I/O response time observed
on the host.
The relationship of certain array sites and ranks to the DS8000 volumes can be derived from
the Tivoli Storage Productivity Center configuration report, as shown in Figure 7-8 on
page 248. In addition, to quickly associate the DS8000 arrays to array sites and ranks, you
might use the output of the DSCLI commands lsrank -l and lsarray -l, as shown in
Example 7-1 on page 245.
Figure 7-8 Array site to volume breakdown using Tivoli Storage Productivity Center
Random read Attempt to find data in cache. If not present in cache, read from back end.
Sequential write Write data to NVS of processor complex owning volume and send copy of
data to cache in other processor complex. Upon back-end destaging,
perform prefetching of read data and parity into cache to reduce the number
of disk operations on the back end.
Random write Write data to NVS of processor complex owning volume and send copy of
data to cache in other processor complex. Destage modified data from NVS
to disk as determined by microcode.
The DS8000 lower interfaces use switched Fibre Channel (FC) connections, which provide a
high data transfer bandwidth. In addition, the destage operation is designed to avoid the write
penalty of RAID 5, if possible. For example, there is no write penalty when modified data to be
destaged is contiguous enough to fill the unit of a RAID 5 stride. A stride is a full RAID 5
stripe. However, when all of the write operations are random across a RAID 5 array, the
DS8000 cannot avoid the write penalty.
The read hit ratio depends on the characteristics of data on your DS8000 and applications
that use the data. If you have a database and it has a high locality of reference, it shows a
high cache hit ratio, because most of the data referenced can remain in the cache. If your
database has a low locality of reference, but it has the appropriate sets of indexes, it might
also have a high cache hit ratio, because the entire index can remain in the cache.
We suggest that you monitor the read hit ratio over an extended period of time:
If the cache hit ratio is historically low, it is most likely due to the nature of the data access
patterns. Defragmenting the filesystem and making indexes if none exist might help more
than adding cache.
If you have a high cache hit ratio initially and it decreases as the workload increases,
adding cache or moving part of the data to volumes associated with the other processor
complex might help.
For a logical volume that has sequential files, it is key to understand the application types that
access those sequential files. Normally, these sequential files are used for either read only or
write only at the time of their use. The DS8000 cache management prefetching algorithm
(SARC) determines whether the data access pattern is sequential. If the access is sequential,
contiguous data is prefetched into cache in anticipation of the next read request.
Tivoli Storage Productivity Center for Disk reports the reads and writes through various
metrics. See 7.3, “Tivoli Storage Productivity Center data collection” on page 250 for a
description of these metrics in greater detail.
Although the time information of the device is written to the database, reports are always
based on the time of the Tivoli Storage Productivity Center server. Tivoli Storage Productivity
Center receives the time zone information from the devices (or the NAPIs) and uses this
information to adjust the time in the reports to the local time. Certain devices might convert
the time into Greenwich mean time (GMT) time stamps and not provide any time zone
information.
This complexity is necessary to be able to compare the information from two subsystems in
different time zones from a single administration point. This administration point is the GUI,
not the Tivoli Storage Productivity Center server. If you open the GUI in different time zones, a
performance diagram might show a distinct peak at different times, depending on its local
time zone.
To ensure that the time stamps on the DS8000 are synchronized with the other infrastructure
components, the DS8000 provides features for configuring a Network Time Protocol (NTP)
server. To modify the time and configure the hardware management console (HMC) to use an
NTP server, the following steps are required:
1. Log on to HMC.
2. Select HMC Management.
3. Select Change Date and Time.
4. A dialog box similar to Figure 7-9 on page 251 opens. Change the Time to match the
current time for the time zone.
5. To configure an NTP server, select the NTP Configuration tab. A dialog box similar to
Figure 7-10 opens.
6. Select Add NTP Server and provide the IP address and the NTP version.
7. Check Enable NTP service on this HMC and click OK.
Reboot: These configuration changes require a reboot of the HMC. These steps were
tested on DS8000 code Version 4.0 and later.
7.3.2 Duration
Tivoli Storage Productivity Center can collect data continuously. From a performance
management perspective, collecting data continuously means that performance data exists to
facilitate reactive, proactive, and even predictive processes, as described in Chapter 7,
“Practical performance management” on page 235. For ongoing performance management of
the DS8000, we suggest one of the following approaches to data collection:
Run continuously. The benefit to this approach is that at least in theory data always exists.
The downside is that if a component of Tivoli Storage Productivity Center goes into a bad
state, it does not always generate an alert. In these cases, data collection might stop with
only a warning, and a Simple Network Management Protocol (SNMP) alert is not
generated. In certain cases, the only obvious indication of a problem is a lack of
performance data.
Restart collection every n number of hours. In this approach, configure the collection to
run for somewhere between 24 - 168 hours. For larger environments, a significant delay
period might need to be configured between the last interval and the first interval in the
next data collection. The benefit to this approach is that data collection failures result in an
alert every time that the job fails. You can configure this alert to go to an operational
monitoring tool, such as Tivoli Enterprise Console® or IBM Tivoli Netcool/OMNIbus
operations management software. In this case, performance data loss is limited to the
configured duration. The downside to this approach is that there is always data that is
missing for a period of time as Tivoli Storage Productivity Center starts the data collection
on all devices. For large environments, this technique might not be tenable for an interval
less than 72 hours, because the start-up costs related to starting the collection on many
devices can be significant.
Job startup Job tries to connect to any NAPI to Job tries to connect to any NAPI to
behavior which the device is connected. If the which the device is connected. If the
connection fails, the job fails, and an connection fails, the job fails, and an
alert, if defined, is generated. alert, if defined, is generated.
In any case, the job is scheduled to run If you do not fix the problem and restart
after n hours, so n number of hours is the job manually, the job never
the maximum that you lose. automatically restarts, even though
the problem might be temporary.
NAPI fails Performance data collection job fails, The job tries to reconnect to the NAPI
after successful and an alert, if defined, is generated. within the defined intervals and
start of job You lose up to n number of hours of recover if the communication can be
information in addition to the one hour reestablished. But, there might be
pause, depending on when this failure situations where this recovery does
happens. not work. For example, if the NAPI is
restarted, it might not be able to
resume the performance collection
from the device until Tivoli Storage
Productivity Center restarts the data
collection job.
Alerts You get alerts for every job that fails. You get an alert only one time.
Log files A log file is created for each scheduled Usually, you see a only single log file.
run. The navigation tree shows the You see multiple log files only if you
status of the current and past jobs, and stop and restart the job manually.
whether the job was successful in the
past.
7.3.3 Intervals
In Tivoli Storage Productivity Center, the data collection interval is referred to as the sample
interval. The sample interval for the DS8000 performance data collection tasks is from 5 - 60
minutes. A shorter sample interval results in a more granular view of performance data at the
expense of requiring additional database space. The appropriate sample interval depends on
the objective of the data collection. Table 7-5 displays example data collection objectives and
reasonable values for a sample interval.
To reduce the growth of the Tivoli Storage Productivity Center database while watching for
potential performance issues, Tivoli Storage Productivity Center can store only samples in
which an alerting threshold is reached. This skipping function is useful for SLA reporting and
longer term capacity planning.
Table 7-6 Tivoli Storage Productivity Center subsystem, controller, port, volume, and array metrics
Key DS8000 metrics Definition
Subsystem
Controller
Volume
Array
Port
Read I/O Rate (overall) Average number of read operations per second
for the sample interval.
Write I/O Rate (overall) Average number of write operations per second
for the sample interval.
Total I/O Rate (overall) Average number of read and write operations
per second for the sample interval.
Read Cache Hits Percentage Percentage of reads during the sample interval
(overall) that are found in the cache. A storage
subsystem-wide target is 50%, although this
percentage varies depending on the workload.
Subsystem
Controller
Volume
Array
Port
Write-cache Delay I/O Rate The rate of I/Os (writes) that are delayed during
the sample interval because of write cache. This
rate must be 0.
Read Data Rate Average read data rate in megabytes per second
during the sample interval.
Overall Response Time Average response time in milliseconds for all I/O
in the sample interval, including both cache hits,
as well as misses to back-end storage if
required.
Overall Transfer Size Average transfer size in kilobytes for all I/O
during the sample interval.
Backend Read I/O Rate The average read rate in reads per second
caused by read misses. This rate is the read rate
to the back-end storage for the sample interval.
Backend Write I/O Rate The average write rate in writes per second
caused by front-end write activity. This rate is the
write rate to the back-end storage for the sample
interval. These writes are logical writes, and the
actual number of physical I/O operations
depends on the type of RAID architecture.
Total Backend I/O Rate The average write rate in writes per second
caused by front-end write activity. This rate is the
write rate to the back-end storage for the sample
interval. These writes are logical writes and the
actual number of physical I/O operations
depends on the type of RAID architecture.
Backend Read Data Rate Average number of megabytes per second read
from back-end storage during the sample
interval.
Subsystem
Controller
Volume
Array
Port
Backend Write Data Rate Average number of megabytes per second
written to back-end storage during the sample
interval.
Total Backend Data Rate Sum of the Backend Read and Write Data Rates
for the sample interval.
Port Send I/O Rate The average rate per second for operations that
send data from an I/O port, typically to a server.
This operation is typically a read from the server
perspective.
Port Receive I/O Rate The average rate per second for operations
where the storage port receives data, typically
from a server. This operation is typically a write
from the server perspective.
Total Port I/O Rate Average read plus write I/O rate per second at
the storage port during the sample interval.
Port Send Data Rate The average data rate in megabytes per second
for operations that send data from an I/O port,
typically to a server.
Port Receive Data Rate The average data rate in megabytes per second
for operations where the storage port receives
data, typically from a server.
Total Port Data Rate Average (read+write) data rate in megabytes per
second at the storage port during the sample
interval.
Subsystem
Controller
Volume
Array
Port
Total Port Response Time Weighted average port send and port receive
time over the sample interval.
Port Send Transfer Size Average size in kilobytes per Port Send
operation during the sample interval.
Port Receive Transfer Size Average size in kilobytes per Port Receive
operation during the sample interval.
Total Port Transfer Size Average size in kilobytes per port transfer during
the sample interval.
Controller Cache Holding Time < 60 Indicates high cache track turnover
and possibly cache constraint.
Controller Write Cache Delay I/O Rate >1 Indicates writes delayed due to
insufficient memory resources.
Array Write I/O Rate (overall) > 250 RAID 5 with 4 operations per
write indicates saturated array.
Array Total I/O Rate (overall) > 1000 Even if all I/Os are reads, this metric
indicates busy disks.
Volume Write Cache Hits < 100 Cache misses can indicate busy
Percentage (overall) back end or need for additional
cache.
Volume Read I/O Rate (Overall) N/A Look for high rates.
Volume Write I/O Rate (overall) N/A Look for high rates.
Volume Write Response Time >5 Indicates cache misses, busy back
end, and possible front-end
contention.
Volume Write Cache Delay I/O Rate >1 Cache misses can indicate busy
back end or need for additional
cache.
Port Total Port I/O Rate > 2500 Indicates transaction intensive load.
Port Total Port Data Rate ~= 2/4 Gb If port data rate is close to bandwidth,
this rate indicates a saturation.
Port Port Send Response Time > 20 Indicates contention on I/O path from
the DS8000 to host.
Port Port Receive Response > 20 Indicates potential issue on I/O path
Time or the DS8000 back end.
Port Total Port Response Time > 20 Indicates potential issue on I/O path
or the DS8000 back end.
All of the reports use the metrics available for the DS8000 as described in Table 7-6 on
page 254. In the remainder of this section, we describe each of the report types in detail.
The STAT can also provide additional performance information for your DS8800/DS8700
system based on the DS8000 Easy Tier performance statistics collection by revealing data
heat information at the system and volume level in addition to configuration
recommendations. For more information about the STAT, see 6.7, “Storage Tier Advisor Tool”
on page 231.
The next topics include brief descriptions of new Tivoli Storage Productivity Center features,
such as SAN Planner, Storage Optimizer, and Tivoli Storage Productivity Center Reporter for
Disk.
Tivoli Storage Productivity Center is not an online performance monitoring tool. However, it
uses the term performance monitor for the name of the job that is set up to gather data from
a subsystem. The performance monitor is a performance data collection task. Tivoli Storage
Productivity Center collects information at certain intervals and stores the data in its
database. After inserting the data, the data is available for analysis by using several methods
that we describe in this section. Because the intervals are usually 5 - 15 minutes, Tivoli
Storage Productivity Center is not an online or real-time monitor.
You can use Tivoli Storage Productivity Center to define performance-related alerts that can
trigger an event when the defined thresholds are reached. Even though Tivoli Storage
Productivity Center works in a similar manner to a monitor without user intervention, the
actions are still performed at the intervals specified during the definition of the performance
monitor job.
Alerts
Generally, alerts are the notifications defined for different jobs. Tivoli Storage Productivity
Center creates an alert on certain conditions, for example, when a probe or scan fails. There
are various ways to be notified. Simple Network Management Protocol (SNMP) traps, IBM
Tivoli Enterprise Console or IBM Tivoli Netcool® OMNIbus events, and email are the most
common methods.
All the alerts are stored in the Alert Log, even if notification is not set up yet. This log is in the
navigation tree at IBM Tivoli Storage Productivity Center Alerting Alert Log.
In addition to the alerts that you set up when you define a certain job, you can also define
alerts that are not directly related to a job, but instead to specific conditions, such as a new
subsystem that is discovered. This type of alert is defined in Disk Manager Alerting
Storage Subsystem Alerts.
Constraints
In contrast to the alerts that are defined with a probe or a scan job, the alerts defined in the
Alerting navigation subtree are kept in a special constraint report available in the Disk
Manager Reporting Storage Subsystem Performance Constraint Violation
navigation subtree. This report lists all the threshold-based alerts, which can be used to
identify hot spots within the storage environment. You can think of a constraint report as a
statistic about how often each alert is triggered in the specified time frame. To effectively use
thresholds, the analyst must be familiar with the workloads. Figure 7-11 shows all the
available constraint violations. Unfortunately, most of them are not applicable to the DS8000.
Table 7-9 shows the constraint violations applicable to the DS8000. For those constraints
without predefined values, we provide suggestions. You need to configure the exact values
appropriately for the environment. Most of the metrics that are used for constraint violations
are I/O rates and I/O throughput. It is difficult to configure thresholds based on these metrics,
because absolute threshold values depend on the hardware capabilities and the workload. It
might be perfectly acceptable for a tape backup to use the full bandwidth of the storage
subsystem ports during backup periods. If the thresholds are configured to identify a high
data rate, a threshold is generated. In these cases, the thresholds are exceeded, but the
information does not necessarily indicate a problem. These types of exceptions are called
“false positives.” Other metrics, such as Disk Utilization Percentage, Overall Port Response
Time, Write Cache Delay Percentage, and perhaps Cache Hold Time, tend to be more
predictive of actual resource constraints and need to be configured in every environment.
These constraints are highlighted in green in Table 7-9.
Total Port I/O Rate Threshold Depends Depends Indicates highly active ports.
Total Port Data Rate Threshold Depends Depends Indicates highly active port.
Total I/O Rate Threshold Depends Depends Difficult to use, because I/O
rates vary depending on
workload and configuration.
Total Data Rate Threshold (MB) Depends Depends Difficult to use, because
data rates vary depending
on workload and
configuration.
For information about the exact meaning of these metrics and thresholds, see 7.3, “Tivoli
Storage Productivity Center data collection” on page 250.
Figure 7-12 on page 263 is a diagram to illustrate the four thresholds that create five
“regions.” Stress alerts define levels that, when exceeded, trigger an alert. An idle threshold
level triggers an alert when the data value drops beneath the defined idle boundary. There
are two types of alerts for both the stress category and the idle category:
Critical Stress: No warning stress alert is created, because both (warning and critical)
levels are exceeded with the interval.
Warning Stress: It does not matter that the metric shows a lower value than in the last
interval. An alert is triggered, because the value is still above the warning stress level.
Normal workload and performance: No alerts are generated.
Warning Idle: The workload drops, and this drop might indicate a problem (does not have
to be performance-related).
Critical Idle: The same applies as for critical stress.
It is unnecessary to specify a threshold value for all levels. However, for a growing perspective
of your system and effective monitoring of the usage of resources, you might configure the
most important thresholds and also some nearby thresholds for good planning.
The predefined Tivoli Storage Productivity Center performance reports are customized
reports. The Top Volume reports show only a single metric over a specific time period. These
reports provide a way to identify the busiest volumes in the entire environment or by storage
subsystem. You can use Selection and Filter options for these reports. We describe the
Selection and Filter options in detail in 7.5.4, “Batch reports” on page 269.
By Storage Subsystem
By Controller
By array
By Volume
By Port
If you click the drill-down icon in Figure 7-19, you get a report that contains all the volumes
that are stored on that specific array. If you click the drill up icon, you get a performance report
at the controller level. In Figure 7-20, we show you the DS8000 components and levels to
which you can drill down. Tivoli Storage Productivity Center refers to the DS8000 processor
complexes as controllers.
4. On the Selection tab, select the date and time range, the interval, and the Subsystem
components as shown in Figure 7-23 on page 270.
Important: Do not click Selection when extracting volume data. In this case, we
suggest that you click Filter. Use the following syntax to gather volumes for only the
subsystem of interest: DS8000-2107-#######-IBM. See Figure 7-24 for an example.
Replace ####### with the seven character DS8000 storage image ID. Remember that
the number of results might be limited. By filtering, you can also obtain a greater
number of results in an interval for the specified filter.
5. To reduce the amount of data, we suggest creating a filter that requires the selected
component to contain at least 1 for the Total I/O Rate (overall) (ops/s) as shown in
Figure 7-25.
9. Another consideration is When to Run. Click When to Run to see the available options.
The default is Run Now. This option is fine for ad hoc reporting, but you might also
schedule the report to Run Once at a certain time or Run Repeatedly. This tab also
contains an option for setting the time zone for the report. The default is to use the local
time in each time zone. For more information, see 7.3.1, “Time stamps” on page 250.
10.Before running the job, configure any desired alerts in the Alert tab, which provides a
means for sending alerts if the job fails. This feature can be useful if the job is a regularly
scheduled job.
11.In order to run the batch report, immediately click the Save icon (diskette) in the toolbar as
shown in Figure 7-27.
12.When clicking the Save icon, a prompt displays Specify a Batch Report name. Enter a
name that is descriptive enough for later reference.
13.After submitting the job, it is either successful or unsuccessful. Examine the log under the
Batch Reports to perform problem determination on the unsuccessful jobs.
Reports: The location of the batch file reports is not intuitive. It is in the Tivoli Storage
Productivity Center installation directory as shown in Figure 7-28.
7.5.5 TPCTOOL
You can use the TPCTOOL command-line interface (CLI) to extract data from the Tivoli
Storage Productivity Center database. It requires no knowledge of the Tivoli Storage
Productivity Center schema or SQL query skills, but you need to understand how to use the
tool. It is not obvious. Nevertheless, it has advantages over the Tivoli Storage Productivity
Center GUI:
Multiple components
Extract information about multiple components, such as volumes and arrays by specifying
a list of component IDs. If the list is omitted, every component for which data is gathered is
returned.
Multiple metrics
The multiple metrics feature is probably the most important feature of the TPCTOOL
reporting function. Exporting data from a history chart can include data from multiple
samples for multiple components, but it is limited to a single metric type. In TPCTOOL, the
metrics are specified by the columns parameter.
The data extraction can be automated. TPCTOOL, when used in conjunction with shell
scripting, can provide an excellent way to automate the Tivoli Storage Productivity Center
data extracts. This capability can be useful for loading data into a consolidated
performance history repository for custom reporting and data correlation with other data
sources.
TPCTOOL can be useful if you need to create your own metrics by using supplied metrics
or counters. For example, you can create a metric that shows the access density: the
TPCTOOL has one command for creating reports and several list commands (starting with
ls) for querying information needed to generate a report.
3. Determine the component type to report by using the lstype command as shown in
Figure 7-30.
5. Determine the start date and time and put in the following format:
YYYY.MM.DD:HH:MM:SS.
6. Determine the data collection interval in seconds: 86400 (1 day).
7. Determine the summarization level: sample, hourly, or daily.
8. Run the report by using the getrpt command as shown in Figure 7-32. The command
output can be redirected to a file for analysis in a spreadsheet. The <USERID> and
<PASSWORD> variables need to be replaced with the correct values for your
environment.
tpctool> getrpt -user <USERID> -pwd <PASSWORD> -ctype array -url localhost:9550 -subsy
2107.1303241+0 -level hourly -start 2008.11.04:10:00:00 -duration 86400 -columns 801,8
Timestamp Interval Device Component 801 802
================================================================================
2008.11.04:00:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 78.97 26.31
2008.11.04:01:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 54.73 14.85
2008.11.04:02:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 43.72 11.13
2008.11.04:03:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 40.92 8.36
2008.11.04:04:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 50.92 10.03
The book titled Tivoli Storage Productivity Center Advanced Topics, SG24-7438, contains
instructions for importing TPCTOOL data into Excel. The book also provides a Visual Basic
macro that can be used to modify the time stamp to the international standard.
The lstime command is helpful, because it provides information that can be used to
determine whether performance data collection is running. It provides three fields:
Start The date and time of the start of performance data collections
Duration The number of seconds that the job ran
Option The location
In order to identify whether the performance job is still running, use the following logic:
1. Identify the start time of the last collection (2008.10.27 at 20:00:00).
2. Identify the duration (928800).
3. Add the start time to the duration (Use Excel =Sum(2008.10.27
20:00:00+(928800/86400)).
4. Compare the result to the current time. The result is 2008.11.07 at 14:00, which happens
to be the current time. This result indicates that data collection is running.
After a plan is made, it can be implemented by SAN Planner. After a plan is made, the user
can select to implement the plan with the SAN Planner.
SAN Planner supports TotalStorage Enterprise Storage Server (ESS), IBM System Storage
DS6000, IBM System Storage DS8000, and IBM System Storage SAN Volume Controller.
SAN Planner supports the Space Only workload profile option for any other storage system
supported by Tivoli Storage Productivity Center.
After you create the plan recommendation, which is also called the planner output, you can
review it and choose to execute the plan recommendation. The planner can create a job to
change the environment based on the plan output. Alternatively, you can vary the input
provided to the planner to get multiple recommendations.
For additional information about SAN Planner and more details about how to use it, see
Chapter 12 in SAN Planner onTivoli Storage Productivity Center 4.2 Release Update,
SG24-7894.
Important: The Storage Optimizer does not modify subsystem configurations. Its primary
purpose is to provide you with a performance analysis and optimization recommendations
that you can choose to implement at your discretion. It might be useful especially if you use
Easy Tier in manual mode to move volumes manually across different extent pools and
storage tiers. However, before applying any data migrations, double-check the results and
carefully plan your data migrations according to your application performance
requirements and growth considerations.
To use the Storage Optimizer, enable analysis on your specified storage subsystems for
collecting data (Disk Manager Storage Optimizer, see Figure 7-33 on page 277). Later,
you need to create an analytics report to obtain results (Analytics Storage Optimizer, see
Figure 7-34 on page 277). Export it to a PDF file, for example (click the job execution under
Analytics Storage Optimizer and configure your report).
An example Storage Optimizer report for a DS8000 system is shown in Figure 7-35 on
page 278.
An example of the detailed report is shown in Figure 7-36 on page 279. The detailed report
includes some recommendations for moving data manually across storage pools to help
improve performance and avoid hot spots.
The automatically generated report contains an overview of the subsystem information, basic
attributes of each subsystem component, a performance summary of each component,
aggregate statistics of each component, and charts that describe information about each
component instance in detail. The DS8000 component types reported are subsystem, ports,
arrays, and volumes.
4. On the next page, the software prompts you for Customer Contact Information and
IBM/Business Partner Contact Information. Because these values become part of the PDF
and are permanently stored, enter them fully, and click Next.
5. The TPC Reporter asks for the TPC for Disk database server IP address, user ID, and
password to use to connect. Enter this information as shown in Figure 7-38 on page 281,
and click Next.
6. On the next page, TPC Reporter asks which counters must be present in the PDF. You can
use the volume threshold to exclude volumes with less or no IOPS activity from the report.
When using TPC Reporter with SAN Volume Controller, there are more counters available.
For DS (the DS8000), unless you want to reduce the PDF size, click all counters, specify
the export directory as in Figure 7-39, and click Next.
Figure 7-39 TPC Reporter: Specify counters and output path for the PDF report
Figure 7-40 TPC Reporter: Specify your disk system and the time range that you want
8. On the next page, as depicted in Figure 7-41 on page 283, click Generate Report to start
processing. After it completes, click Finish.
A report is now in the selected location in the previous steps. The report contains the
following information:
Subsystem Summary
Subsystem Statistics
Port Information
Port Performance Summary
Port Comparison Statistics
Port Statistics
Array Information
Array Performance Summary
Array Comparison Statistics
Array Statistics
Volume Information
Volume Performance Summary
Volume Comparison Statistics
TPC Reporter for Disk is an excellent way to generate regular performance healthcheck
reports, especially for ports and arrays. You might want to exclude, for example, the volume
information, to make the report shorter, if this information is not necessary for deeper
analysis.
For long-term observation, the TPC Reporter can provide a quick view while the performance
situation on the system develops. Keep the PDFs for historical reference. Also, configuration
information is contained, for instance, the number of type of disk arrays currently installed,
and their RAID formats.
Figure 7-42 on page 284 and Figure 7-43 on page 284 are examples.
For additional information about monitoring performance through a SAN switch or director
point product, see these websites:
http://www.ibm.com/systems/networking/
http://www.brocade.com
http://www.cisco.com
Most SAN management software includes options to create SNMP alerts based on
performance criteria, and to create historical reports for trend analysis. Certain SAN vendors
offer advanced performance monitoring capabilities, such as measuring I/O traffic between
specific pairs of source and destination ports, and measuring I/O traffic for specific LUNs.
In addition to the vendor point products, the Tivoli Storage Productivity Center Fabric
Manager can be used as a central data repository and reporting tool for switch environments.
It lacks real-time capabilities, but Tivoli Storage Productivity Center Fabric Manager collects
and reports on data at a 5 - 60 minute interval for later analysis.
The Fabric Manager of Tivoli Storage Productivity Center provides facilities to report on fabric
topology, configuration and configuration changes, and switch and port performance and
errors. In addition, you can use it to configure alerts or constraints for Total Port Data Rate
and Total Port Packet Rate. Configuration options allow the creation of events to be triggered
if the constraints are exceeded. Although Tivoli Storage Productivity Center does not provide
real-time monitoring, it offers several advantages over traditional vendor point products:
Ability to store performance data from multiple switch vendors in a common database
Advanced reporting and correlation between host data and switch data through custom
reports
Centralized management and reporting
Aggregation of port performance data for the entire switch
The first example configuration, which is shown in Figure 7-44 on page 286, has host server
Host_1 that connects to DS8000_1 through two SAN switches or directors (SAN
Switch/Director_1 and SAN Switch/Director_2). There is a single inter-switch link (ISL)
chnolog y Group
Host_1
SAN Switch/Director_1
ISL
SAN Switch/Director_2
Stor age
En clo sur e
Stor age
En clo sur e
Stor age
En clo sur e
Stor age
En clo su r e
DS8 000_1
I /O I/O
Dr awer Dra wer
I /O I/O
Dr awer Dra wer
A second type of configuration in which SAN statistics can be useful is shown in Figure 7-45
on page 287. In this configuration, host bus adapters or channels from multiple servers
access the same set of I/O ports on the DS8000 (server adapters 1 - 4 share access to the
DS8000 I/O ports 5 and 6). In this environment, the performance data available from only the
host server or only the DS8000 might not be enough to confirm load balancing or to identify
the contributions of each server to I/O port activity on the DS8000, because more than one
host is accessing the same DS8000 I/O ports.
If DS8000 I/O port 5 is highly utilized, it might not be clear whether Host_A, Host_B, or both
hosts are responsible for the high utilization. Taken together, the performance data available
from Host_A, Host_B, and the DS8000 might be enough to isolate the contribution of each
server connection to I/O port utilization on the DS8000. However, the performance data
available from the SAN switch or director might make it easier to see load balancing and
relationships between I/O traffic on specific host server ports and the DS8000 I/O ports at a
glance. The performance data can provide real-time utilization and traffic statistics for both
host server SAN ports and DS8000 SAN ports in a single view, with a common reporting
interval and metrics. The Tivoli Storage Productivity Center Fabric Manager can be used for
analysis of historical data, but it does not collect data in real time.
Host_A Host_B
1 2 3 4
SAN
Switch, Director or Fabric
5 6
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
CEC 0
CEC 1
I/O I/O
Drawer Drawer
RIO 1 RIO 1
I/O I/O
Drawer Drawer
RIO 0 RIO 0
SAN statistics can also be helpful in isolating the individual contributions of multiple DS8000s
to I/O performance on a single server. In Figure 7-46 on page 288, host bus adapters or
channels 1 and 2 from a single host (Host_A) access I/O ports on multiple DS8000s (I/O
ports 3 and 4 on DS8000_1 and I/O ports 5 and 6 on DS8000_2).
In this configuration, the performance data available from either the host server or from the
DS8000 might not be enough to identify the contribution of each DS8000 to adapter activity
on the host server, because the host server is accessing I/O ports on multiple DS8000s. For
example, if adapters on Host_A are highly utilized or if I/O delays are experienced, it might not
be clear whether this situation is due to traffic that is flowing between Host_A and DS8000_1,
between Host_A and DS8000_2, or between Host_A and both DS8000_1 and DS8000_2.
The performance data available from the host server and from both DS8000s can be used
together to identify the source of high utilization or I/O delays. Additionally, you can use the
Tivoli Storage Productivity Center Fabric Manager or vendor point products to gather
performance data for both host server SAN ports and DS8000 SAN ports.
1 2
3 4 5 6
I/ O I /O I /O I /O
Draw er Dr awer Dra wer Draw er
I/ O I /O I /O I /O
Draw er Dr awer Dra wer Draw er
Primary Secondary
Site Site
Stora ge Stora ge
E nc los ure Enc los ure
Stora ge Stora ge
E nc los ure Enc los ure
Stora ge Stora ge
E nc los ure Enc los ure
Stora ge Stora ge
Enc los ure Enclos ure
DS8000_1 DS8000_2
1 2 3 4
You must check SAN statistics to determine whether there are SAN bottlenecks that limit
DS8000 I/O traffic. You can also use SAN link utilization or throughput statistics to break down
the I/O activity contributed by adapters on different host servers to shared storage subsystem
It is not only the formal bandwidth that can be exceeded (for example, the Bandwidth
percentages are preselected to be set to 75% Warning level and 85% Critical). Also, in
bad-quality SAN connections, due to many retries and resent packages, we might see
performance degradations that we can track.
Predefined Total Switch Port Data Rate Graph individual switch port metrics.
Predefined Total Switch Port Packet Rate Graph individual switch port metrics.
Ad hoc Create line chart with up to 10 Useful for identifying port hot spots over
ports and any supported metric time.
Batch Export port performance data Useful for exporting data for analysis in
spreadsheet software.
TPCTOOL Command-line tool for extracting Extract data for analysis in spreadsheet
data from TPC software. Can be automated.
Custom Create custom queries by using Useful for creating reports not available in
BIRT Tivoli Storage Productivity Center.
The process of using Tivoli Storage Productivity Center Fabric Manager to create reports is
similar to the process that is used to create reports in Tivoli Storage Productivity Center Disk
Table 7-11 Tivoli Storage Productivity Center for Fabric Manager metrics
Metric Definition
Port Send Packet Rate Average number of packets per second for send operations, for a
particular port during the sample interval
Port Receive Packet Rate Average number of packets per second for receive operations, for
a particular port during the sample interval
Total Port Packet Rate Average number of packets per second for send and receive
operations, for a particular port during the sample interval
Port Send Data Rate Average number of mebibytes (220 bytes) per second that were
transferred for send (write) operations, for a particular port during
the sample interval
Port Receive Data Rate Average number of mebibytes (220 bytes) per second that were
transferred for receive (read) operations, for a particular port
during the sample interval
Total Port Data Rate Average number of mebibytes (220 bytes) per second that were
transferred for send and receive operations, for a particular port
during the sample interval
Port Peak Send Data Rate Peak number of mebibytes (220 bytes) per second that were sent
by a particular port during the sample interval
Port Peak Receive Data Rate Peak number of mebibytes (220 bytes) per second that were
received by a particular port during the sample interval
Port Send Packet Size Average number of KB sent per packet by a particular port during
the sample interval
Port Receive Packet Size Average number of KB received per packet by a particular port
during the sample interval
Overall Port Packet Size Average number of KB transferred per packet by a particular port
during the sample interval
Error Frame Rate The average number of frames per second that were received in
error during the sample interval
Dumped Frame Rate The average number of frames per second that were lost due to a
lack of available host buffers during the sample interval
Link Failure Rate The average number of link errors per second during the sample
interval
Loss of Sync Rate The average number of times per second that synchronization was
lost during the sample interval
Loss of Signal Rate The average number of times per second that the signal was lost
during the sample interval
CRC Error Rate The average number of frames received per second in which the
cyclic redundancy check (CRC) in the frame did not match the
CRC computed by the receiver during the sample interval
Port Send Bandwidth The approximate bandwidth utilization percentage for send
Percentage operations by a port based on its current negotiated speed
Port Receive Bandwidth The approximate bandwidth utilization percentage for receive
Percentage operations by this port, based on its current negotiated speed
Overall Port Bandwidth The approximate bandwidth utilization percentage for send and
Percentage receive operations by this port
The most important metrics for determining whether a SAN bottleneck exists are the Total
Port Data Rate, and the Port Bandwidth Percentages (Send, Receive, Overall).
Definition
Classification
No Hardware or
Yes
configuration issue Fix it
No Yes
Host
Identification
ID hot/slow resource
issue? Fix it
host disks
storage bottleneck
Fix it Yes No Expand Scope
switch bottleneck
Validation
Figure 7-50 I/O performance analysis process
I/O bottlenecks as referenced in this section relate to one or more components on the I/O
path that reached a saturation point and can no longer achieve the I/O performance
requirements. I/O performance requirements are typically throughput-oriented or
transaction-oriented. Heavy sequential workloads, such as tape backups or data warehouse
environments, might require maximum bandwidth and use large sequential transfers.
However, they might not have stringent response time requirements. Transaction-oriented
workloads, such as online banking systems, might have stringent response time requirements
but have no requirements for throughput.
Fabric Manager
TPC
Switch 1 Switch 2
Port
Disk Manager
Array
TPC
Volume
Controller 1 Controller 2
DS8000
As shown in Figure 7-51, Tivoli Storage Productivity Center does not provide host
performance, configuration, or error data. Tivoli Storage Productivity Center Fabric Manager
provides performance and error log information about SAN switches. Tivoli Storage
Productivity Center Disk Manager provides the DS8000 storage performance and
configuration information.
Tip: Performance analysis and troubleshooting must always start top-down, starting with
the application (for example, database design and layout), then operating system, server
hardware, SAN, and then storage. The tuning potential is greater at the “higher” levels. The
best I/O tuning is never carried out, because server caching or a better database design
eliminated the need for it.
Process assumptions
This process assumes that the following conditions exist:
The server is connected to the DS8000 natively.
Tools exist to collect the necessary performance and configuration data for each
component along the I/O path (server disk, SAN fabric, and the DS8000 arrays, ports, and
volumes).
Skills exist to use the tools, extract data, and analyze data.
Data is collected in a continuous fashion to facilitate performance management.
Process flow
The order in which you conduct the analysis is important. We suggest the following process:
1. Define the problem. A sample questionnaire is provided in “Sample questions for an AIX
host” on page 638. The goal is to assist in determining the problem background and
understand how the performance requirements are not being met.
Changes: Before proceeding any further, ensure that adequate discovery is pursued to
identify any changes that occurred in the environment. In our experience, there is a
significant correlation between changes in the environment and sudden “unexpected”
performance issues.
Physical component: If you notice significant errors in the “datapath query device”
or the “pcmpath query device” and the errors increase, likely there is a problem with
a physical component on the I/O path.
b. Gather the host error report and look for Small Computer System Interface (SCSI) or
FIBRE errors.
Hardware: Often a hardware error that relates to a component on the I/O path
shows as a TEMP error. A TEMP error does not exclude a hardware failure. You must
perform diagnostics on all hardware components in the I/O path, including the host
bus adapter (HBA), SAN switch ports, and the DS8000 HBA ports.
c. Gather the SAN switch configuration and errors. Every switch vendor provides different
management software. All of the SAN switch software provides error monitoring and a
way to identify whether there is a hardware failure with a port or application-specific
integrated circuit (ASIC). For more information about identifying hardware failures, see
your vendor-specific manuals or contact vendor support.
Patterns: As you move from the host to external resources, remember any patterns.
A common error pattern that you see involves errors that affect only those paths on
the same HBA. If both paths on the same HBA experience errors, the errors are a
result of a common component. The common component is likely to be the host
HBA, the cable from the host HBA to the SAN switch, or the SAN switch port.
Ensure that all of these components are thoroughly reviewed before proceeding.
d. If errors exist on one or more of the host paths, determine whether there are any
DS8000 hardware errors. Log on to the HMC as customer/cust0mer and look to ensure
that there are no hardware alerts. Figure 7-52 provides a sample of a healthy DS8000.
If there are any errors, you might need to open a problem ticket (PMH) with DS8000
hardware support (2107 engineering).
In addition to the disk response time and disk queuing data, gather the disk activity rates,
including read I/Os, write I/Os, and total I/Os, because they show which disks are active:
a. Gather performance data as shown in Table 7-12.
I/O-intensive disks: The number of total I/Os per second indicates the relative
activity of the device. This relative activity provides a metric to prioritize the analysis.
Those devices with high response times and high activity are more important to
understand than devices with high response time and infrequent access. If
analyzing the data in a spreadsheet, consider creating a combined metric of
Average I/Os Average Response Time to provide a method for identifying the most
I/O-intensive disks. You can obtain additional detail about OS-specific server
analysis in the OS-specific chapters.
Multipathing: Ensure that multipathing works as designed. For example, if there are
two paths zoned per HBA to the DS8000, there must be four active paths per LUN.
Both SDD and SDDPCM use an active/active configuration of multipathing, which
means that traffic flows across all the traffic fairly evenly. For native DS8000
connections, the absence of activity on one or more paths indicates a problem with
the SDD behavior.
c. Format the data and correlate the host LUNs with their associated DS8000 resources.
Formatting the data is not required for analysis, but it is easier to analyze formatted
data in a spreadsheet.
The following steps represent the logical steps that are required to format the data and
do not represent literal steps. You can codify these steps in scripts. You can obtain
examples of these scripts in Appendix E, “Post-processing scripts” on page 677:
i. Read the configuration file.
Analyze the DS8000 performance data first: Analysis of the SAN fabric and the
DS8000 performance data can be completed in either order. However, SAN
bottlenecks occur less frequently than disk bottlenecks, so it can be more efficient to
analyze the DS8000 performance data first.
b. Use Tivoli Storage Productivity Center to gather the DS8000 performance data for
subsystem ports, arrays, and volumes. Compare the key performance indicators from
Table 7-7 on page 257 with the performance data. Follow these steps to analyze the
performance:
Compare the same time period: Meaningful correlation with the host
performance measurement and the previously identified hot LUNs requires
analysis of the DS8000 performance data for the same time period that the host
data was collected. For more information about time stamps, see 7.3.1, “Time
stamps” on page 250.
ii. Correlate the hot LUNs with their associated disk arrays. When using the Tivoli
Storage Productivity Center GUI, the relationships are provided automatically in the
drill-down feature. If using batch exports and you want to correlate the volume data
to the rank data, you can correlate the volume data to the rank data manually or by
using the script. If multiple ranks per extent pool and storage pool striping, or Easy
Tier managed pools are used, one volume can exist on multiple ranks.
iii. Analyze storage subsystem ports for the ports associated with the server in
question.
6. Continue the identification of the root cause by collecting and analyzing SAN fabric
configuration and performance data:
a. Gather the connectivity information and establish a visual diagram of the environment.
If you use the Tivoli Storage Productivity Center Fabric Manager, you can use the
Topology Viewer to quickly create a visual representation of your SAN environment as
shown in Figure 7-49 on page 292.
Visualize the environment: Sophisticated tools are not necessary for creating this
type of view; however, the configuration, zoning, and connectivity information must
be available to create a logical visual representation of the environment.
b. Gather the SAN performance data. Each vendor provides SAN management
applications that provide the alerting capability and some level of performance
management. Often, the performance management software is limited to real-time
monitoring and historical data collection features require additional licenses. In addition
to the vendor-provided solutions, Tivoli Storage Productivity Center Fabric Manager
can collect further metrics that are shown in Table 7-11 on page 291.
c. Consider graphing the Overall Port Response Time, Port Bandwidth Percentage, and
Total Port Data Rate metrics to determine whether any of the ports along the I/O path
are saturated during the time when the response time is degraded. If the Total Port
Data Rate is close to the maximum expected throughput for the link or the bandwidth
percentages that exceed their thresholds, this situation is likely a contention point. You
can add additional bandwidth to mitigate this type of issue either by adding additional
links or by adding faster links. Adding links might require upgrades of the server HBAs
and the DS8000 host adapters to take advantage of the additional switch link capacity.
Problem definition
The application owner complains of poor response time for transactions during certain times
of the day.
Problem classification
There are no hardware errors, configuration issues, or host performance constraints.
Identification
Figure 7-54 shows the average read response time for a Windows Server 2008 server that
performs a random workload in which the response time increases steadily over time.
30.00
25.00
Response Time (ms)
20.00 Disk1
Disk2
Disk3
15.00
Disk4
Disk5
10.00 Disk6
5.00
-
17 :08
17 :08
:1 8
17 :08
17 :08
:3 8
17 :08
17 :08
:4 8
17 :08
18 :08
:0 8
18 :08
18 :08
:2 8
18 :08
18 :08
:4 8
18 :08
18 08
08
17 7:0
17 5:0
17 3:0
18 1:0
18 9:0
18 7:0
9:
5:
5
1
3
9
1
7
9
5
7
3
5
1
3
:5
:0
:0
:1
:2
:3
:4
:5
:0
:1
:1
:3
:3
:4
:5
16
Figure 7-54 Windows Server perfmon average physical disk read response time
1,000.00
900.00
800.00
Disk1
700.00
Disk Reads/sec
Disk2
600.00
Disk3
500.00
Disk4
400.00
Disk5
300.00
Disk6
200.00
100.00
-
16:55:08
17:01:08
17:07:08
17:13:08
17:19:08
17:25:08
17:31:08
17:37:08
17:43:08
17:49:08
17:55:08
18:01:08
18:07:08
18:13:08
18:19:08
18:25:08
18:31:08
18:37:08
18:43:08
18:49:08
Time - 1 Minute Interval
As described in 7.7, “End-to-end analysis of I/O performance problems” on page 292, there
are several possibilities for high average disk read response time:
DS8000 array contention
DS8000 port contention
SAN fabric contention
Host HBA saturation
Because the most probable reason for the elevated response times is the disk utilization on
the array, gather and analyze this metric first. Figure 7-56 on page 302 shows the disk
utilization on the DS8000.
100
90
80
2107.75GB192-13
70
Disk Utilization
2107.75GB192-14
60
2107.75GB192-21
50
2107.75GB192-22
40 `
2107.75GB192-5
30
2107.75GB192-6
20
10
0
00
00
00
00
00
00
00
00
00
00
00
00
00
2:
2:
2:
2:
2:
2:
2:
2:
2:
2:
2:
2:
2:
:5
:0
:1
:2
:3
:4
:5
:0
:1
:2
:3
:4
:5
17
18
18
18
18
18
18
19
19
19
19
19
19
Time - 5 Minute Intervals
Recommend changes
We recommend that you add volumes on additional disks. For environments where host
striping is configured, you might need to re-create the host volumes to spread the I/O from an
existing workload across the new volumes.
Problem definition
The online transactions for a Windows Server 2008 SQL server appear to take longer than
normal and time out in certain cases.
Problem classification
After reviewing the hardware configuration and the error reports for all hardware components,
we determined that there are errors on the paths associated with one of the host HBAs, as
shown in Figure 7-57 on page 303. This output shows the errors on path 0 and path 1, which
are both on the same HBA (SCSI port 1). For a Windows Server 2008 server that runs
SDDDSM, additional information about the host adapters is available through the gethba.exe
command. The command that you use to identify errors depends on the multipathing software
installation.
Disabling a path: In the cases where there is a path with significant errors, you can
disable the path with the multipathing software, which allows the non-working paths to be
disabled without causing performance degradation to the working paths. With SDD, disable
the path by using datapath set device # path # offline.
1000000
900000
800000 Dev Disk7
700000 Dev Disk6
KB/sec
600000 Dev Disk5
500000 Dev Disk4
400000 Production Disk5
300000 Production Disk2
200000 Production Disk1
100000
0
17:39:06
17:45:06
17:51:06
17:57:06
18:03:06
18:09:06
18:15:06
18:21:06
18:27:06
18:33:06
18:39:06
18:45:06
18:51:06
18:57:06
19:03:06
19:09:06
19:15:06
19:21:06
19:27:06
Time - 1 minute inte rv al
The DS8000 port data reveals a peak throughput of around 300 MBps per 4-Gbps port.
700.00
600.00
500.00
Total MB/sec
400.00 R1-I3-C4-T0
300.00 R1-I3-C1-T0
200.00
100.00
-
:4 0
:4 0
:5 0
:5 0
:0 0
:0 0
:1 0
:2 0
:2 0
:3 0
:3 0
:4 0
:4 0
:5 0
:5 0
:0 0
:0 0
:1 0
:1 0
:2 0
:2 0
00
17 9 :0
17 4 :0
17 9 :0
17 4 :0
18 9 :0
18 4 :0
18 9 :0
18 4 :0
18 4 :0
18 9 :0
18 4 :0
18 9 :0
18 4 :0
18 9 :0
18 4 :0
19 9 :0
19 4 :0
19 9 :0
19 4 :0
19 9 :0
19 4 :0
9:
:3
17
Time - 5 minute
Before beginning the diagnostic process, you must understand your workload and your
physical configuration. You need to know how your system resources are allocated, as well as
understand your path and channel configuration for all attached servers.
Assume that you have an environment with a DS8000 attached to a z/OS host, an AIX Power
Systems host, and several Windows Server 2008 hosts. You noticed that your z/OS online
users experience a performance degradation between 07:30 and 08:00 hours each morning.
You might notice that there are 3390 volumes that indicate high disconnect times, or high
device busy delay time for several volumes in the RMF device activity reports. Unlike UNIX or
Windows Server 2008, you might notice response time and its breakdown to connect,
disconnect, pending, and IOS queuing.
Device busy delay is an indication that another system locks up a volume, and an extent
conflict occurs among z/OS hosts or applications in the same host when using Parallel
Access Volumes (PAVs). The DS8000 multiple allegiance or PAVs capability allows it to
process multiple I/Os against the same volume at the same time. However, if a read or write
request against an extent is pending while another I/O is writing to the extent, or if a write
request against an extent is pending while another I/O is reading or writing data from the
extent, the DS8000 delays the I/O by queuing. This condition is referred as extent conflict.
Queuing time due to extent conflict is accumulated to device busy (DB) delay time. An extent
is a sphere of access; the unit of increment is a track. Usually, I/O drivers or system routines
decide and declare the sphere.
To determine the possible cause of high disconnect times, check the read cache hit ratios,
read-to-write ratios, and bypass I/Os for those volumes. If you see that the cache hit ratio is
lower than usual and you did not add other workload to your System z environment, I/Os
against Open Systems FB volumes might be the cause of the problem. Possibly, FB volumes
that are defined on the same server have a cache-unfriendly workload, thus affecting your
System z volumes hit ratio.
To get more information about cache usage, you can check the cache statistics of the FB
volumes that belong to the same server. You might be able to identify the FB volumes that
have a low read hit ratio and short cache holding time. Moving the workload of these Open
Systems logical disks or the System z CKD volumes about which you are concerned to the
The approaches that use IBM Tivoli Storage Productivity Center for Disk as described in this
chapter might not cover all the possible situations that you can encounter. You might need to
include more information, such as application and host operating system-based performance
statistics, the STAT reports, or other data collections to analyze and solve a specific
performance problem. But if you have a basic understanding of how to interpret the DS8000
performance reports and how the DS8000 works, you can develop your own ideas about how
to correlate the DS8000 performance reports with other performance measurement tools
when you approach specific situations in your production environment.
Part 3 Performance
considerations for host
systems
This part provides performance considerations for various host systems or appliances that
are attached to the IBM System Storage DS8000 system.
The DS8800 4-way standard model supports a maximum of 16 FCP/FICON host adapters
with four or eight ports each, so a maximum of 128 FCP/FICON ports are supported. The
DS8800 2-way standard and Business Class models support a maximum of four FCP/FICON
host adapters with four or eight ports each, so a maximum of 32 FCP/FICON ports are
supported. All the ports can be intermixed and independently configured. The DS8800 host
adapters support 2, 4, or 8 Gbps link speed, but 1 Gbps is no longer supported. Enterprise
Systems Connection (ESCON®) adapters are not supported in DS8800.
The DS8000 can support host systems and remote mirroring links by using Peer-to-Peer
Remote Copy (PPRC) on the same I/O port. However, we advise that you use dedicated I/O
ports for remote mirroring links.
Planning and sizing the host adapters for performance are not easy tasks and we strongly
suggest that you use modeling tools, such as Disk Magic (see 6.1, “Disk Magic” on
page 176). The factors that might affect the performance at the host adapter level are typically
the aggregate throughput and the workload mix that the adapter can handle. All connections
on a host adapter share bandwidth in a balanced manner. Therefore, host attachments that
require maximum I/O port performance must be connected to HAs that are not fully
populated. You must allocate host connections across I/O ports, host adapters, and I/O
enclosures in a balanced manner (workload spreading).
No SCSI: There is no direct Small Computer System Interface (SCSI) attachment support
for the DS8000.
We describe recommendations for implementing a switched fabric in more detail in the next
section.
If a host adapter fails and starts logging in and out of the switched fabric, or a server must be
rebooted several times, you do not want it to disturb the I/O to other hosts. Figure 8-1 on
page 315 shows zones that include only a single host adapter and multiple DS8000 ports
(single initiator zone). This approach is the suggested way to create zones to prevent
interaction between server host adapters.
Tip: Each zone contains a single host system adapter with the desired number of ports
attached to the DS8000.
By establishing zones, you reduce the possibility of interactions between system adapters in
switched configurations. You can establish the zones by using either of two zoning methods:
Port number
Worldwide port name (WWPN)
You can configure switch ports that are attached to the DS8000 in more than one zone, which
enables multiple host system adapters to share access to the DS8000 host adapter ports.
Shared access to a DS8000 host adapter port might be from host platforms that support a
combination of bus adapter types and operating systems.
LUN masking
In Fibre Channel attachment, logical unit number (LUN) affinity is based on the worldwide
port name (WWPN) of the adapter on the host, which is independent of the DS8000 host
adapter port to which the host is attached. This LUN masking function on the DS8000 is
provided through the definition of DS8000 volume groups. A volume group is defined by using
the DS Storage Manager or DS8000 command-line interface (DSCLI), and host WWPNs are
connected to the volume group. The LUNs to be accessed by the hosts that are connected to
the volume group are defined to reside in that volume group.
Although it is possible to limit through which DS8000 host adapter ports a certain WWPN
connects to volume groups, we suggest that you define the WWPNs to have access to all
available DS8000 host adapter ports. Then, by using the recommended process of creating
Fibre Channel zones as discussed in “Importance of establishing zones” on page 313, you
can limit the desired host adapter ports through the Fibre Channel zones. In a switched fabric
with multiple connections to the DS8000, this concept of LUN affinity enables the host to see
the same LUNs on different paths.
The number of times that a DS8000 logical disk is presented as a disk device to an open host
depends on the number of paths from each host adapter to the DS8000. The number of paths
from an open server to the DS8000 is determined by these factors:
The number of host adapters installed in the server
The number of connections between the SAN switches and the DS8000
The zone definitions created by the SAN switch software
Physical paths: Each physical path to a logical disk on the DS8000 is presented to the
host operating system as a disk device.
By cabling the SAN components and creating zones as shown in Figure 8-1 on page 315,
each logical disk on the DS8000 is presented to the host server four times, because there are
four unique physical paths from the host to the DS8000. As you can see the picture, Zone A
shows that FC0 has access through DS8000 host ports I0000 and I0130. Zone B shows that
H ost I0 0 0 0
SAN I0 0 3 0
s w itc h
FC0 A
I0 1 0 0
I0 1 3 0
D S8800
I0 2 0 0
SAN
FC1 s w itc h
I0 2 3 0
I0 3 0 0
B
I0 3 3 0
Zone A- FC 0 Zone B- FC 1
D S 8 8 0 0 _ I0 0 0 0 D S 8 8 0 0 _ I0 2 3 0
D S 8 8 0 0 _ I0 1 3 0 D S 8 8 0 0 _ I0 3 0 0
You can see how the number of logical devices presented to a host can increase rapidly in a
SAN environment if you are not careful about selecting the size of logical disks and the
number of paths from the host to the DS8000.
Typically, we suggest that you cable the switches and create zones in the SAN switch
software for dual-attached hosts so that each server host adapter has two to four paths from
the switch to the DS8000. With hosts configured this way, you can allow the multipathing
module to balance the load across the four host adapters in the DS8000.
Zoning more paths, such as eight connections from the host to the DS8000, generally does
not improve SAN performance and causes twice as many devices to be presented to the
operating system.
8.2.3 Multipathing
Multipathing describes a technique to attach one host to an external storage device through
more than one path. Multipathing can improve fault-tolerance and the performance of the
overall system, because the fault of a single component in the environment can be tolerated
without an impact to the host. Also, you can increase the overall system bandwidth, which
positively influences the performance of the system.
As illustrated in Figure 8-2 on page 316, attaching a host system by using a single-path
connection implements a solution that depends on several single points of failure. In this
example, as a single link, failure either between the host system and the switch, between the
Host System
Host adapter
single point of
failure
SAN sw itch
single point of
failure
Host
Port
I0001
Logical
disk
DS8000
Adding additional paths requires you to use multipathing software (Figure 8-3 on page 317).
Otherwise, the same LUN behind each path is handled as a separate disk from the operating
system side, which does not allow failover support.
Multipathing provides the DS8000 attached Open Systems hosts that run Windows, AIX,
HP-UX, Oracle Solaris, or Linux with these capabilities:
Support for several paths per LUN
Load balancing between multiple paths when there is more than one path from a host
server to the DS8000. This approach might eliminate I/O bottlenecks that occur when
many I/O operations are directed to common devices via the same I/O path, thus
improving the I/O performance.
Automatic path management, failover protection, and enhanced data availability for users
that have more than one path from a host server to the DS8000. It eliminates a potential
single point of failure by automatically rerouting I/O operations to the remaining active
paths from a failed data path.
Dynamic reconsideration after changing the configuration environment, including zoning,
LUN masking, and adding or removing physical paths.
Host Host
Port Port
I0001 I0131
LUN
DS8000
Important: Do not intermix several multipathing solutions within one host system. Usually,
the multipathing software solutions cannot coexist.
The Subsystem Device Driver can operate under different modes or configurations:
Concurrent data access mode: A system configuration where simultaneous access to data
on common LUNs by more than one host is controlled by system application software.
Examples are Oracle Parallel Server or file access software that can handle address
conflicts. The LUN is not involved in access resolution.
Non-concurrent data access mode: A system configuration where there is no inherent
system software control of simultaneous accesses to the data on a common LUN by more
than one host. Therefore, access conflicts must be controlled at the LUN level by a
hardware-locking facility, such as Small Computer System Interface (SCSI)
Reserve/Release.
The IBM Subsystem Device Driver does not support booting from or placing a system
primary paging device on an SDD pseudo device.
For certain servers that run AIX, booting off the DS8000 is supported. In that case, LUNs
used for booting are manually excluded from the SDD configuration by using the querysn
command to create an exclude file.
For more information about installing and using SDD, see IBM System Storage Multipath
Subsystem Device Driver User’s Guide, GC52-1309. This publication and other information
are available at this website:
http://www.ibm.com/servers/storage/support/
The policy that is specified for the device determines the path that is selected to use for an I/O
operation. The following policies are available:
Load balancing (default). The path to use for an I/O operation is chosen by estimating the
load on the adapter to which each path is attached. The load is a function of the number of
I/O operations currently in process. If multiple paths have the same load, a path is chosen
at random from those paths.
Round-robin. The path to use for each I/O operation is chosen at random from those paths
not used for the last I/O operation. If a device has only two paths, SDD alternates between
the two paths.
Failover only. All I/O operations for the device are sent to the same (preferred) path until
the path fails because of I/O errors. Then, an alternate path is chosen for later I/O
operations.
Normally, path selection is performed on a global rotating basis; however, the same path is
used when two sequential write operations are detected.
Single-path mode
SDD does not support concurrent download and installation of the Licensed Machine Code
(LMC) to the DS8000 if hosts use a single-path mode. However, SDD supports a single-path
Fibre Channel connection from your host system to a DS8000. It is possible to create a
volume group or a vpath device with only a single path.
Important: With a single-path connection, which we do not advise, the SDD cannot
provide failure protection and load balancing.
When a path failure occurs, the IBM SDD automatically reroutes the I/O operations from the
failed path to the other remaining paths. This action eliminates the possibility of a data path
becoming a single point of failure.
Multipath I/O
Multipath I/O (MPIO) summarizes native multipathing technologies that are available in
several operating systems, such as AIX, Linux, and Windows. Although the implementation
differs for each of the operating systems, the basic concept is almost the same:
The multipathing module is delivered with the operating system.
The multipathing module supports failover and load balancing for standard SCSI devices,
such as simple SCSI disks or SCSI arrays.
To add device-specific support and functions for a specific storage device, each storage
vendor might provide a device-specific module that implements advanced functions for
managing the specific storage device.
IBM currently provides a device-specific module for the DS8000 for AIX, Linux, and Windows
according to the information in Table 8-1.
For example, Symantec provides an alternative to the IBM provided multipathing software.
The Veritas Volume Manager (VxVM) relies on the Microsoft implementation of MPIO and
Device Specific Modules (DSMs) that rely on the Storport driver. The Storport driver is not
available for all versions of Windows.
Check the System Storage Interoperation Center (SSIC) website for your specific hardware
configuration:
http://www.ibm.com/systems/support/storage/config/ssic/
FCP: z/VM, z/VSE, and Linux for System z can also be attached to the DS8000 series with
FCP.
8.3.1 FICON
FICON is a Fibre Connection used with System z servers. Each storage unit host adapter has
either four or eight ports, and each port has a unique WWPN. You can configure the port to
operate with the FICON upper layer protocol. When configured for FICON, the storage unit
provides the following configurations:
Either fabric or point-to-point topology
A maximum of 128 host ports for DS8800 4-way model and a maximum of 32 host ports
for DS8800 2-way model (either Standard or Business Class)
A maximum of 2048 logical paths on each Fibre Channel port
Access to all 255 control unit images (65280 count key data (CKD) devices) over each
FICON port. The connection speeds are 200, 400, and 800 MB/s, which is similar to Fibre
Channel for Open Systems.
IBM introduced FICON channels in the IBM 9672 G5 and G6 servers with the capability to run
at 1 Gbps. Since that time, IBM introduced several generations of FICON channels. The
FICON Express8S channels make up the latest generation of FICON channels. They are
designed to support 8 Gbps link speeds and can also auto-negotiate to 2 or 4 Gbps link
speeds. These speeds depend on the capability of the director or control unit port at the other
end of the link. Operating at 8 Gbps speeds, FICON Express8S channels are designed to
achieve up to 620 MBps for a mix of large sequential read and write I/O operations, as
depicted in the following charts. Figure 8-4 on page 321 shows a comparison of the overall
throughput capabilities of various generations of channel technology.
z196 z196
z10 z114
FICON
Expres s4
4Gbps
FICON
Express 2
2Gbps
z10
FICON
z9
Express
FICON 2Gbps z9 EC
Express z990
1 Gbps z890
FICON
1Gbps zSeries
z900
z800
z900
The FICON Express8S channel on the IBM System zEnterprise 196 and z114 represents a
significant improvement in maximum bandwidth capability compared to FICON Express4
channels and previous FICON offerings. The response time improvements are expected to be
noticeable for large data transfers. The speed at which data moves across a 4 Gbps link is
effectively 800 MBps compared to 400 MBps with a 4 Gbps link.
As shown in Figure 8-5 on page 322, the maximum number of I/Os per second (IOPS)
measured on a FICON Express8S channel that runs an I/O driver benchmark with a 4 KB per
I/O workload is approximately 23000 IOPS. This maximum is more than 10% greater than the
maximum number of I/Os measured with a FICON Express8 channel. The greater
performance capabilities of the FICON Express8S channel make it a good match with the
performance characteristics of the new DS8000 host adapters.
FICON
Express 8
z196
8Gbps
z114
z196
z10
FICON FICON
Expr es s2 E xpres s4
2 Gbps 4Gbps
G5/G6
Support: The FICON Express8S SX and LX are supported on zEnterprise 196 and z114
servers only. FICON Express8 SX and LX are available on z10, zEnterprise 196, and z114
servers.
The System zEnterprise 196 and System z114 servers offer FICON Express8S SX and LX
features that have two independent channels. Each feature occupies a single I/O slot and
uses one CHPID per channel. Each channel supports 2 Gbps, 4 Gbps, and 8 Gbps link data
rates with auto-negotiation to support existing switches, directors, and storage devices.
FICON Express4: FICON Express4 was the last feature to support 1 Gbps link data rates.
For any generation of FICON channels, you can attach directly to a DS8000 or you can attach
through a FICON capable Fibre Channel switch.
When you use a Fibre Channel/FICON host adapter to attach to FICON channels, either
directly or through a switch, the port is dedicated to FICON attachment and cannot be
simultaneously attached to FCP hosts. When you attach a DS8000 to FICON channels
through one or more switches, the maximum number of FICON logical paths is 2048 per
DS8000 host adapter port. The directors provide high availability with redundant components
and no single points of failure.
System z System z
FICON (FC) channels FICON (FC) channels
FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC
A one-to-many configuration is also possible, as shown in Figure 8-6, but, again, careful
planning is needed to avoid performance issues.
FC FC FC FC FC FC FC FC
Sizing FICON connectivity is not an easy task. You must consider many factors. We strongly
advise that you create a detailed analysis of the specific environment. Use these guidelines
before you begin sizing the attachment environment:
For FICON Express CHPID utilization, the recommended maximum utilization level is
50%.
For the FICON Bus busy utilization, the recommended maximum utilization level is 40%.
For the FICON Express Link utilization with an estimated link throughput of 2 Gbps,
4 Gbps, or 8 Gbps, the recommended maximum utilization threshold level is 70%
For more information about the DS8000 FICON support, see IBM System Storage DS8000
Host Systems Attachment Guide, SC26-7917, and FICON Native Implementation and
Reference Guide, SG24-6266.
The FICON features provide support of Fibre Channel and SCSI devices in z/VM, z/VSE, and
Linux on System z. FCP allows z/VM, z/VSE, and Linux on System z to access
industry-standard SCSI devices. For disk applications, these FCP storage devices use FB
512-byte sectors rather than Extended Count Key Data (ECKD™) format. Each FICON
For more information about Linux on System z connectivity, see the IBM developerWorks®
website:
http://www.ibm.com/developerworks/linux/linux390/development_documentation.html
Before planning for performance, validate the configuration of your environment. See the
System Storage Interoperation Center (SSIC):
http://www.ibm.com/systems/support/storage/config/ssic/index.jsp
Also, for host bus adapter (HBA) interoperability, see this website:
http://www.ibm.com/systems/support/storage/config/hba/index.wss
Check the IBM Support website to download the latest version of firmware for the Fibre
Channel (FC) adapters for AIX servers:
http://www14.software.ibm.com/webapp/set2/firmware/gjsn
Download the latest fix packs for your AIX version from the following site:
http://www.ibm.com/eserver/support/fixes/fixcentral/main/pseries/aix
The Oracle Solaris software family requires patches to ensure that the host and the DS8000
function correctly. See the following website for the most current list of Oracle Solaris-SPARC
patches and the Oracle Solaris-x86 patch for recent and current versions of Oracle Solaris:
http://www.oracle.com/us/products/servers-storage/solaris/solaris11/overview/index
.html
Also, for detailed information about how to attach and configure a host system to a DS8000,
see the IBM System Storage DS8000 Host System Attachment Guide, GC27-2298:
http://www.ibm.com/support/docview.wss?rs=1114&context=HW2B2&uid=ssg1S7001161
Normally in each of these layers, there are performance indicators that help you to assess
how that particular layer affects performance.
Modern DS8000 systems have many improvements in the data management that can change
LVM usage. Easy Tier V3 functionality moves the method of data isolation from the rank level
to the extent pool, volume, and application levels. It is important that a disk system of today
must be planned from the application point of view, not the hardware resource allocation point
of view. Plan the logical volumes on the extent pool that is for one type of application or
workload. This method of disk system layout eliminates the necessity of LVM usage.
Moreover, by using LVM striping of the Easy Tier managed volumes, you eliminate most of the
technology benefits, because striping shadows the real skew factor and changes the real
picture of the hot extents allocation. This method might lead to improper extent migration plan
generation, which leads to continuous massive extent migration. Performance analysis
becomes complicated and makes I/O bottleneck allocation a complicated task. In general, the
basic approach for most of the applications is to use one or two hybrid extent pools with three
tiers and Easy Tier in automated mode for group of applications of the same kind or the same
workload type. To prevent bandwidth consumption by one or several applications, the I/O
Priority Manager functionality must be used.
The extended RAID functionality of the LVM (RAID 5, 6, and 10) must not be used at all,
except the RAID 1 (Mirroring) function which might be required in high availability (HA) and
disaster recovery (DR) solutions.
Consider these points when you read the LVM description in the following sections. Also, see
Chapter 4, “Logical configuration performance considerations” on page 87.
There are two methods for implementation of queuing: tagged and untagged. Tagged queuing
allows a target to accept multiple I/O processes from each initiator for each logical unit.
Untagged queuing allows a target to accept one I/O process from each initiator for each
logical unit or target routine. Untagged queuing might be supported by SCSI-1 or SCSI-2
devices. Tagged queuing is new in SCSI-2. For more information, see this link:
http://ldkelley.com/SCSI2/SCSI2-07.html
The qdepth parameter might affect the disk I/O performance. Values that are too small can
make the device ineffective to use. Values that are too high might lead to the QUEUE FULL
status of the device, reject the next I/O, and lead to data corruption or a system crash.
Another important reason why queue depth parameters must be set correctly is the queue
limits of the host ports of the disk system. The host port might be flooded with the SCSI
commands if there is no correct limit set in the operating system. When this situation
happens, a host port refuses to accept any I/Os, then resets, and then starts the loop
initialization primitive (LIP) procedure. This situation leads to the inactivity of the port for up to
several minutes and might initiate path failover or an I/O interruption. Moreover, in highly
loaded environments, this situation leads to the overload of the other paths and might lead to
the complete I/O interruption for the application or buffer overflow in the operating system,
which causes paging activity.
In addition to the settings specified in the DS8000 Host System Attachment Guide,
SC26-7917-02, there can be additional suggestions for multipathing software. For example,
the DM-Multipath driver in Linux requires that you have the following settings for QLogic FC
HBAs:
ql2xmaxqdepth. This parameter defines the maximum queue depth reported to SCSI
Mid-Level per device. The Queue depth specifies the number of outstanding requests per
LUN. The default is 32. The recommended number is 16.
qfull_retry_count. The number of retries to perform on Queue Full status on the device.
The default is 16. The recommended number is 32.
IBM Subsystem Device Driver (SDD) and SDD Path Control Module (SDDPCM) multipathing
drivers have their own queue depth settings to manage FC targets and the vpaths.
If qdepth_enable=yes (the default), I/Os that exceed the queue_depth queue at SDD. And if
qdepth_enable=no, I/Os that exceed the queue_depth queue in the hdisk wait queue. SDD
with qdepth_enable=no and SDDPCM do not queue I/Os and instead merely pass them to
the hdisk drivers.
The datapath command: With SDD 1.6, it is preferable to use the datapath command to
change qdepth_enable, rather than by using chdev, because then it is a dynamic change.
For example, datapath set qdepth disable sets it to no. Certain releases of SDD do not
include SDD queueing, and other releases include SDD queueing. Certain releases do not
show the qdepth_enable attribute. Either check the manual for your version of SDD or try
the datapath command to see whether it supports turning this feature off.
If you used both SDD and SDDPCM, you remember that with SDD, each LUN has a
corresponding vpath and an hdisk for each path to the vpath or LUN. And with SDDPCM, you
have only one hdisk per LUN. Thus, with SDD, you can submit queue_depth x # paths to a
LUN, while with SDDPCM, you can submit queue_depth IOs only to the LUN. If you switch
from SDD that uses four paths to SDDPCM, you must set the SDDPCM hdisks to 4x that of
Both the hdisk and adapter drivers have “in process” and “wait” queues. After the queue limit
is reached, the I/Os wait until an I/O completes and opens a slot in the service queue. The
in-process queue is also sometimes called the “service” queue. Many applications do not
generate many in-flight I/Os, especially single-threaded applications that do not use
asynchronous I/O. Applications that use asynchronous I/O are likely to generate more in-flight
I/Os.
As it is shown in Figure 9-2 on page 334, you need to set the queue_depth attribute on the
VIOC hdisk to match that of the mapped hdisk queue_depth on the VIOS. For a formula, the
maximum number of LUNs per virtual SCSI adapter (vhost on the VIOS or vscsi on the VIOC)
Important: To change the queue_depth on an hdisk at the VIOS, you are required to
unmap the disk from the VIOC and remap it back.
If you use N_Port ID Virtualization (NPIV), if you increase num_cmd_elems on the virtual FC
(vFC) adapter, you must also increase the setting on the real FC adapter.
For more information about the queue depth settings for VIO Server, see IBM System
Storage DS8000 Host Attachment and Interoperability, SG24-8887-00:
http://www.redbooks.ibm.com/abstracts/sg248887.html
A good general rule for tuning queue_depths is to increase queue_depths until I/O service
times start exceeding 15 ms for small random reads or writes or you are not filling the queues.
After I/O service times start increasing, you push the bottleneck from the AIX disk and
adapter queues to the disk subsystem. There are two approaches to tuning queue depth:
Use your application and tune the queues from it.
Use a test tool to see what the disk subsystem can handle and tune the queues from that
information based on what the disk subsystem can handle.
We prefer to tune based on the application I/O requirements, especially when the disk system
is shared with other servers.
When you examine the devstats, if you see that the Maximum field = queue_depth x # paths
and qdepth_enable=yes for SDD, then increasing the queue_depth for the hdisks might help
performance. At least the I/Os queue on the disk subsystem rather than in AIX. It is
reasonable to increase queue depths about 50% at a time.
Regarding the qdepth_enable parameter, the default is yes, which essentially has SDD
handling the I/Os beyond queue_depth for the underlying hdisks. Setting it to no results in the
hdisk device driver handling them in its wait queue. With qdepth_enable=yes, SDD handles
the wait queue; otherwise, the hdisk device driver handles the wait queue. There are error
handling benefits that allow SDD to handle these I/Os, for example, by using LVM mirroring
across two DS8000s. With heavy I/O loads and much queuing in SDD (when
qdepth_enable=yes), it is more efficient to allow the hdisk device drivers to handle relatively
shorter wait queues rather than SDD handling a long wait queue by setting
qdepth_enable=no. SDD queue handling is single threaded and there is a thread for handling
each hdisk queue. So, if error handling is of primary importance (for example, when LVM
mirroring across disk subsystems), leave qdepth_enable=yes. Otherwise, setting
qdepth_enable=no more efficiently handles the wait queues when they are long. Set the
qdepth_enable parameter by using the datapath command, because it is a dynamic change
that way (chdev is not dynamic for this parameter).
It is also reasonable to use the iostat -D command or sar -d to provide an indication if the
queue_depths need to be increased (Example 9-1).
The avgwqsz is the average wait queue size, and avgsqsz is the average service queue size.
The average time spent in the wait queue is avgtime. The sqfull value changed from initially
being a count of the times that we submitted an I/O to a full queue, to now where it is the rate
of I/Os submitted to a full queue. The example report shows the prior case (a count of I/Os
submitted to a full queue). Newer releases typically show decimal fractions that indicate a
rate. It is good that iostat -D separates reads and writes, because we expect the I/O service
times to differ when we have a disk subsystem with cache. The most useful report for tuning
is to run iostat -D. This command shows statistics since system boot, and it assumes that
the system is configured to continuously maintain disk IO history (run # lsattr -El sys0, or
smitty chgsys to see whether the iostat attribute is set to true).
# sar -d 1 2
System configuration: lcpu=2 drives=1 ent=0.30
The avwait is the average time spent in the wait queue. The avserv is the average time spent
in the service queue. And, avserv corresponds to avgserv in the iostat output. The avque
value represents the average number of I/Os in the wait queue.
SDD provides the datapath query devstats and datapath query adaptstats commands to
show hdisk and adapter queue statistics. SDDPCM similarly has pcmpath query devstats
and pcmpath query adaptstats. You can refer to the SDD manual for syntax, options, and
explanations of all the fields. We show devstats output for a single LUN. See Example 9-2 on
page 336 for details.
Transfer Size: <= 512 <= 4k <= 16K <= 64K > 64K
118 20702 80403 12173 5245
We are interested in the Maximum field, which indicates the maximum number of I/Os
submitted to the device since system boot. The Maximum for devstats does not exceed
queue_depth x # paths for SDD when qdepth_enable=yes. But Maximum for adaptstats can
exceed num_cmd_elems, because it represents the maximum number of I/Os submitted to
the adapter driver and includes I/Os for both the service and wait queues. If, in this case, we
have two paths and use the default queue_depth of 20, the 40 indicates that we filled the
queue at least once and increasing queue_depth can help performance. For SDDPCM, if the
Maximum value equals the hdisk queue_depth, the hdisk driver queue filled during the
interval and increasing queue_depth is appropriate.
You can similarly monitor adapter queues and IOPS: for adapter IOPS, run iostat -at
<interval> <# of intervals> and for adapter queue information, run iostat -aD, optionally
with an interval and number of intervals.
The downside of setting queue depths too high is that the disk subsystem cannot handle the
I/O requests in a timely fashion and might even reject the I/O or ignore it. This situation can
result in an I/O timeout, and an I/O error recovery code is called. This situation is bad,
because the processor ends up performing more work to handle I/Os than necessary. If the
I/O eventually fails, this situation can lead to an application crash or worse.
Lower the queue depth per LUN when using multipathing. With multipathing, this default value
is magnified, because it equals the default queue depth of the adapter multiplied by the
number of active paths to the storage device. For example, because QLogic uses a default
queue depth of 32, the suggested queue depth value to use is 16 when using two active paths
and 8 when using four active paths. Directions for adjusting the queue depth are specific to
each HBA driver and are available in the documentation for the HBA.
For more information about AIX, see this technote, AIX disk queue depth tuning for
performance, available at this website:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745
For AIX 5.3 or earlier, if you want additional information about the tunable commands and
their parameters for a specific configuration, see the following link:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix
.cmds/doc/aixcmds3/ioo.htm
9.2.1 AIX Journaled File System (JFS) and Journaled File System 2 (JFS2)
JFS and JFS2 are AIX standard filesystems. JFS was created for the 32-bit kernels and
implements the concept of a transactional filesystem where all of the I/O operations of the
metadata information are kept in a log. The practical impact is that in the recovery of a
filesystem, the fsck command looks at that log to see what I/O operations completed and
rolls back only those operations that were not completed. From a performance point of view,
there is overhead. However, it is generally an acceptable compromise to ensure the recovery
of a corrupted filesystem. Its file organization method is a linear algorithm. You can mount the
filesystems with the Direct I/O option. You can adjust the mechanisms of sequential read
ahead, sequential and random write behind, delayed write operations, and others. You can
tune its buffers to increase the performance. It also supports asynchronous I/O.
JFS2 was created for 64-bit kernels. Its file organization method is a B+ tree algorithm. It
supports all the features that are described for JFS, with exception of “delayed write
operations.” It also supports concurrent I/O (CIO).
Read ahead
JFS and JFS2 have read-ahead algorithms that can be configured to buffer data for
sequential reads into the filesystem cache before the application requests it. Ideally, this
feature reduces the percent of I/O wait (%iowait) and increases I/O throughput as seen from
the operating system. Configuring the read ahead algorithms too aggressively results in
unnecessary I/O. The following VMM tunable parameters control read-ahead behavior:
For JFS:
– minpgahead = max(2, <application’s blocksize> / <filesystem’s blocksize>)
– maxpgahead = max(256, (<application’s blocksize> / <filesystem’s blocksize> *
<application’s read ahead block count>))
For JFS2:
– j2_minPgReadAhead = max(2, <application’s blocksize> / <filesystem’s blocksize>)
– j2_maxPgReadAhead = max(256, (<application’s blocksize> / <filesystem’s blocksize>
* <application’s read ahead block count>))
I/O pacing
I/O pacing manages the concurrency to files and segments by limiting the processor
resources for processes that exceed a specified number of pending write I/Os to a discrete
file or segment. When a process exceeds the maxpout limit (high-water mark), it is put to
sleep until the number of pending write I/Os to the file or segment is less than minpout
(low-water mark). This pacing allows another process to access the file or segment.
Disabling I/O pacing improves backup times and sequential throughput. Enabling I/O pacing
ensures that no single process dominates the access to a file or segment. AIX 6.1 and higher
enables I/O pacing by default. In AIX 5.3, you needed to explicitly enable this feature. The
feature is enabled by setting the sys0 settings of the minpout and maxpout parameters to
4096 and 8193 (lsattr -El sys0). To disable I/O pacing, simply set them both to zero. You
can also limit the effect of setting global parameters by mounting filesystems by using an
explicit 0 for minput and maxpout: mount -o minpout=0,maxpout=0 /u. Tuning the maxpout
Enabling I/O pacing improves user response time at the expense of throughput. For more
information about I/O pacing, see the following link:
http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=%2Fcom.ibm.aix.p
rftungd%2Fdoc%2Fprftungd%2Fdisk_io_pacing.htm
Write behind
This parameter enables the operating system to initiate I/O that is normally controlled by the
syncd. Writes are triggered when a specified number of sequential 16 KB clusters are
updated:
Sequential write behind:
– numclust for JFS
– j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2
Random write behind:
– maxrandwrt for JFS
– j2_maxRandomWrite
Mount options
Use release behind mount options when appropriate:
The release behind mount option can reduce syncd and lrud overhead. This option
modifies the filesystem behavior in such a way that it does not maintain data in JFS2
cache. You use these options if you know that data that goes into or out of certain
filesystems is not requested again by the application before the data is likely to be paged
out. Therefore, the lrud daemon has less work to do to free up cache and eliminates any
syncd overhead for this filesystem. One example of a situation where you can use these
options is if you have a Tivoli Storage Manager Server with disk storage pools in
filesystems and you configured the read ahead mechanism to increase the throughput of
data, especially when a migration takes place from disk storage pools to tape storage
pools:
– -rbr for release behind after a read
– -rbw for release behind after a write
– -rbrw for release behind after a read or a write
Direct I/O (DIO):
Asynchronous I/O
Asynchronous I/O (AIO) is the AIX facility that allows an application to issue an I/O request
and continue processing without waiting for the I/O to finish:
With AIX Version 6, the tunables fastpath and fsfastpath are classified as restricted
tunables and now are set to a value of 1, by default. Therefore, all asynchronous I/O
requests to a raw logical volume are passed directly to the disk layer by using the
corresponding strategy routine (legacy AIO or POSIX-compliant AIO). Or, all
asynchronous I/O requests for files opened with cio are passed directly to LVM or disk by
using the corresponding strategy routine.
Also, there are no more AIO devices in Object Data Manager (ODM) and all their
parameters now become tunables using the ioo command. The newer aioo command is
removed. For additional information, see the IBM AIX Version 6.1 Differences Guide,
SG24-7559, and the IBM AIX Version 7.1 Differences Guide, SG24-7910:
– http://www.redbooks.ibm.com/abstracts/sg247559.html
– http://www.redbooks.ibm.com/abstracts/sg247910.html
The maxFilesToCache parameter sets the number of inodes to cache for recently used files.
The default value is 1000. If the application has many small files, consider increasing this
value. There is a limit of 300000 tokens.
The maxStatCache sets the number of inodes to keep in stat cache. This value needs to be
four times the size of the maxFilesToCache value.
Number of threads
The workerThreads parameter controls the maximum number of concurrent file operations at
any instant. The suggested value is the same number of maxservers in AIX 5.3. There is a
limit of 550 threads.
maxMBpS
Increase the maxMBpS to 80% of the total bandwidth for all HBAs in a single host. The
default value is 150 MB/s.
maxblocksize
Configure the GPFS blocksize (maxblocksize) to match the application I/O size, the RAID
stripe size, or a multiple of the RAID stripe size. For example, if you use an Oracle database,
it is better to adjust a value that matches the product of the value of the DB_BLOCK_SIZE
and DB_FILE_MULTIBLOCK_READ_COUNT parameters. If the application performs many
sequential I/Os, it is better to configure a blocksize from 8 to 16 MB to take advantage of the
sequential prefetching algorithm on the DS8000.
In Figure 9-4, the DS8000 LUNs that are under the control of the LVM are called physical
volumes (PVs). The LVM splits the disk space into smaller pieces, which are called physical
partitions (PPs). A logical volume (LV) consists of several logical partitions (LPs). A
filesystem can be mounted over an LV, or it can be used as a raw device. Each LP can point to
up to three corresponding PPs. The ability of the LV to point a single LP to multiple PPs is the
way that LVM implements mirroring (RAID 1).
To set up the volume layout with the DS8000 LUNs, you can adopt one of the following
strategies:
Storage pool striping: In this case, you are spreading the workload at the storage level. At
the operating system level, you need to create the LVs with the inter-policy attribute set to
minimum, which is the default option when creating an LV.
PP Striping: A set of LUNs are created in different ranks inside of the DS8000. When the
LUNs are recognized in AIX, a volume group (VG) is created. The LVs are spread evenly
over the LUNs by setting the inter-policy to maximum, which is the most common method
used to distribute the workload. The advantage of this method compared to storage pool
striping is the granularity of data spread over the LUNs. With storage pool striping, the
data is spread in chunks of 1 GB. In a VG, you can create PP sizes from
8 MB to 16 MB. The advantage of this method compared to LVM Striping is that you have
PP Striping
Figure 9-5 on page 345 shows an example of PP Striping. The volume group contains four
LUNs and created 16 MB physical partitions on the LUNs. The logical volume in this example
consists of a group of 16 MB physical partitions from four logical disks: hdisk4, hdisk5, hdisk6,
and hdisk7.
vpath0, vpath1, vpath2, and vpath3 are hardware-striped LUNs on different DS8000 Extent Pools
8 GB/16 MB partitions ~ 500 physical partitions per LUN (pp1-pp500)
/dev/inter-disk_lv is made up of eight logical partitions
(lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 x 16 = 128 MB
The first step is to create a volume group. We suggest that you create a VG with a set of
DS8000 LUNs where each LUN is in a separate extent pool. If you plan to add a set of LUNs
to a host, define another VG. For you to create a VG, execute the following command to
create the data01vg and a PP size of 16 MB:
mkvg -S -s 16 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7
Commands: To create the volume group, if you use SDD, you use the mkvg4vp command.
If you use SDDPCM, you use the mkvg command. All the flags for the mkvg command apply
to the mkvg4vp command.
After you create the VG, the next step is to create the LVs. To create a VG with four disks
(LUNs), we suggest that you create the LVs as a multiple of the number of disks in the VG,
times the PP size. In our case, we create the LVs in multiples of 64 MB. You can implement
the PP Striping by using the option -e x. By adding the -a e option, the Intra-Physical Volume
Allocation Policy changes the allocation policy from middle (default) to edge so that the LV
physical partitions are allocated beginning at the outer edge and continuing to the inner edge.
This method ensures that all physical partitions are sequentially allocated across the physical
disks. To create an LV of 1 GB, execute the following command:
mklv -e x -a e -t jfs2 -y inter-disk_lv data01vg 64 hdisk4 hdisk5 hdisk6 hdisk7
Preferably, use inline logs for JFS2 logical volumes, which results in one log for every
filesystem and it is automatically sized. Having one log per filesystem improves performance,
because it avoids the serialization of access when multiple filesystems make metadata
changes. The disadvantage of inline logs is that they cannot be monitored for I/O rates, which
can provide an indication of the rate of metadata changes for a filesystem.
LVM Striping
8GB LUN = hdisk4
256MB 256MB 256MB 256MB
lp1 pp1
lp5
pp2 p p3 p p4 .... 256MB
p p29
256MB
p p30
256MB 256MB
pp31 pp32
1. 1 1. 2 1.3 5.1 5.2 5. 3
hdisk4, hdisk5, hdisk6, and hdisk7 are hardware-striped LUNS on different DS8000 Extent Pools
8 GB/256 MB partitions ~ 32 physical partitions per LUN (pp1 – pp32)
Notice that /dev/striped_lv is also made up of eight 256 MB physical partitions, but each
partition is then subdivided into 32 chunks of 8 MB; only three of the 8 MB chunks are shown
per logical partition for space reasons.
Again, the first step is to create a VG. To create a VG for LVM Striping, execute the following
command:
mkvg -S -s 256 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7
For you to create a striped LV, you need to combine the following options when you use LVM
Striping:
Stripe width (-C): This option sets the maximum number of disks to spread the data. The
default value is used from the option upperbound (-u).
Copies (-c): This option is only required when you create mirrors. You can set from 1 to 3
copies. The default value is 1.
Strict allocation policy (-s): This option is required only when you create mirrors and it is
necessary to use the value “s” (superstrict).
Stripe size (-S): This option sets the size of a chunk of a sliced PP. Since AIX 5.3, the valid
values include 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, 8M, 16M, 32M,
64M, and 128M.
AIX 5.3 implemented a new feature, the striped column. With this feature, you can extend an
LV in a new set of disks after the current disks where the LV is spread is full.
Memory buffers
Adjust the memory buffers (pv_min_pbuf) of LVM to increase the performance. Set it to 2048
for AIX 7.1.
Scheduling policy
If you have a dual-site cluster solution that uses PowerHA with LVM Cross-Site, you can
reduce the link requirements among the sites by changing the scheduling policy of each LV to
parallel write/sequential read (ps). You must remember that the first copy of the mirror needs
to point to the local storage.
http://sfdoccentral.symantec.com/sf/5.1/aix/pdf/vxvm_admin.pdf
http://docs.oracle.com/cd/B28359_01/server.111/b31107/asmprepare.htm
A paper from Oracle titled ASM Overview and Technical Best Practices:
http://www.oracle.com/technology/products/database/asm/pdf/asm_10gr2_bestpracti
ces%209-07.pdf
Check the interoperability matrix of SDDPCM (MPIO) to see which version is supported:
http://www.ibm.com/support/docview.wss?rs=540&uid=ssg1S7001350#AIXSDDPCM
MPIO is only supported with PowerHA (HACMP) if you configure the VGs in Enhanced
Concurrent Mode. For additional information about PowerHA with MPIO, see this link:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FLASH10504
If you use a multipathing solution with Virtual I/O Server (VIOS), use MPIO. There are
several limitations when you use SDD with VIOS. See 9.2.10, “Virtual I/O Server (VIOS)”
on page 350 and the VIOS support site for additional information:
http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/datasheet.ht
ml#multipath
http://sfdoccentral.symantec.com/sf/5.1SP1PR1/aix/pdf/dmp_admin_51sp1pr1_aix.pdf
9.2.9 FC adapters
FC adapters or host bus adapters (HBAs) provide the connection between the host and the
storage devices. We suggest that you configure the following four important parameters:
num_cmd_elems: This parameter sets the maximum number of commands to queue to
the adapter. When many supported storage devices are configured, you can increase this
attribute to improve performance. The range of supported values depends on the FC
adapter. Check with lsattr -Rl fcs0 -a num_cmd_elems.
max_xfer_size: This attribute for the fscsi device, which controls the maximum I/O size that
the adapter device driver can handle, also controls a memory area used by the adapter for
data transfers. When the default value is used (max_xfer_size=0x100000), the memory
area is 16 MB in size. When setting this attribute to any other allowable value (for example,
0x200000), the memory area is 128 MB in size. For typical DS8000 environments, this
setting must remain unchanged and use its default value. Any other change might imply
risks and not lead to performance improvements.
The fcstat command can be used to examine whether increasing num_cmd_elems or
max_xfer_size can increase performance. In selected environments with heavy I/O and
Changing the max_xfer_size: Changing max_xfer_size uses memory in the PCI Host
Bridge chips attached to the PCI slots. The sales manual, regarding the dual-port 4
Gbps PCI-X FC adapter states that “If placed in a PCI-X slot rated as SDR compatible
and/or has the slot speed of 133 MHz, the AIX value of the max_xfer_size must be kept
at the default setting of 0x100000 (1 megabyte) when both ports are in use. The
architecture of the DMA buffer for these slots does not accommodate larger
max_xfer_size settings.” Issues occur when configuring the LUNs if there are too many
FC adapters and too many LUNs attached to the adapter. Errors, such as DMA_ERR
might appear in the error report. If you get these errors, you need to change the
max_xfer_size back to the default value. Also if you boot from SAN, if you encounter
this error, you cannot boot, so be sure to have a back-out plan if you plan to change the
max_xfer_size and boot from SAN.
dyntrk: IBM AIX supports dynamic tracking of FC devices. Previous releases of AIX
required a user to unconfigure FC storage device and adapter device instances before
changing the system area network (SAN), which can result in an N_Port ID (SCSI ID)
change of any remote storage ports. If dynamic tracking of FC devices is enabled, the FC
adapter driver detects when the Fibre Channel N_Port ID of a device changes. The FC
adapter driver then reroutes traffic destined for that device to the new address while the
devices are still online. Events that can cause an N_Port ID to change include moving a
cable between a switch and storage device from one switch port to another, connecting
two separate switches that use an inter-switch link (ISL), and possibly rebooting a switch.
Dynamic tracking of FC devices is controlled by a new fscsi device attribute, dyntrk. The
default setting for this attribute is no. To enable dynamic tracking of FC devices, set this
attribute to dyntrk=yes, as shown in the example: chdev -l fscsi0 -a dyntrk=yes.
fc_err_recov: IBM AIX supports Fast I/O Failure for FC devices after link events in a
switched environment. If the FC adapter driver detects a link event, such as a lost link
between a storage device and a switch, the FC adapter driver waits a short time,
approximately 15 seconds, so that the fabric can stabilize. At that point, if the FC adapter
driver detects that the device is not on the fabric, it begins failing all I/Os at the adapter
driver. Any new I/O or future retries of the failed I/Os are failed immediately by the adapter
until the adapter driver detects that the device rejoins the fabric. Fast Failure of I/O is
controlled by a new fscsi device attribute, fc_err_recov. The default setting for this attribute
is delayed_fail, which is the I/O failure behavior seen in previous versions of AIX. To
enable Fast I/O Failure, set this attribute to fast_fail, as shown in the example: chdev -l
fscsi0 -a fc_err_recov=fast_fail.
Important: Change the attributes fc_err_recov to fast_fail and dyntrk to yes only if you
use a multipathing solution with more than one path.
Example 9-3 on page 350 is the output of the attributes of an fcs device.
Regarding the Fast I/O Failure (fc_err_recov) and Dynamic Tracking (dyntrk) options, for more
information, see the following links:
http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=%2Fcom.ibm.aix.p
rftungd%2Fdoc%2Fprftungd%2Ffast_io_failure.htm
http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=%2Fcom.ibm.aix.k
ernelext%2Fdoc%2Fkernextc%2Ffastfail_dyntracking.htm
Also, see the information about num_cmd_elems and max_xfer_size in AIX disk queue depth
tuning for performance:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745
For additional information, see the SDD User’s Guide at the following link:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303
The VIOS allows a physical adapter with disk attached at the VIOS partition level to be shared
by one or more partitions and enables clients to consolidate and potentially minimize the
number of required physical adapters. See Figure 9-7 on page 351 for an illustration.
When you assign several LUNs from the DS8000 to the VIOS and then map those LUNs to
the LPAR clients with the time, trivial activities, such as upgrading the SDDPCM device driver,
can become challenging. For additional information about VIOS, see the following links:
IBM PowerVM Virtualization Introduction and configuration:
http://www.redbooks.ibm.com/redbooks/pdfs/sg247940.pdf
IBM PowerVM Virtualization Managing and Monitoring:
http://www.redbooks.ibm.com/abstracts/sg247590.html
Performance suggestions
We suggest these performance settings when configuring Virtual SCSI for performance:
Processor:
– Typical entitlement is 0.25
– Virtual processor of 2
– Always run uncapped
– Run at higher priority (weight factor >128)
– More processor power with high network loads
Memory:
– Typically >= 10 GB (at least 1 GB of memory is required. The minimum is 2 GB +
20 MB per hdisk.) This type of a configuration is needed to avoid any paging activity in
the VIOS, which might lead to performance degradation in all LPARs.
– Add more memory if there are high device (vscsi and hdisk) counts.
– Small LUNs drive up the memory requirements.
For multipathing with VIOS, check the configuration of the following parameters:
– fscsi devices on VIOS:
• The attribute fc_err_recov is set to fast_fail
• The attribute dyntrk is set to yes with the command chdev -l fscsiX -a
dyntrk=yes
– hdisk devices on VIOS:
• The attribute algorithm is set to load_balance
• The attribute reserve_policy is set to no_reserve
• The attribute hcheck_mode is set to nonactive
• The attribute hcheck_interval is set to 20
– vscsi devices in client LPARs:
The attribute vscsi_path_to is set to 30
– hdisk devices in client:
• The attribute algorithm is set to failover
• The attribute reserv_policy is set to no_reserve
• The attribute hcheck_mode is set to nonactive
• The attribute hcheck_interval is set to 20
Important: Change the reserve_policy parameter to no_reserve only if you are going to
map the LUNs of the DS8000 directly to the client LPAR.
Example 9-5 shows how vmstat can help you monitor filesystem activity by using the
command vmstat -I.
Example 9-5 The vmstat -I utility output for filesystem activity analysis
[root@p520-tic-3]# vmstat -I 1 5
Example 9-6 on page 354 shows you another option that you can use, vmstat -v, from which
you can understand whether the blocked I/Os are due to a shortage of buffers.
For the preferred practice values, see the application papers listed under “AIX filesystem
caching” on page 338.
By using lvmo, you can also check whether contention is happening due to a lack of LVM
memory buffer, which is illustrated in Example 9-7.
As you can see in Example 9-7, there are two incremental counters: pervg_blocked_io_count
and global_blocked_io_count. The first counter indicates how many times an I/O block
happened because of a lack of LVM pinned memory buffer (pbufs) on that VG. The second
incremental counter counts how many times an I/O block happened due to the lack of LVM
pinned memory buffer (pbufs) in the whole OS. Other indicators for I/O bound can be seen
with the disk xfer part of the vmstat output when run against the physical disk as shown in
Example 9-8.
9.3.2 pstat
The pstat command counts how many legacy asynchronous I/O servers are used in the
server. There are two asynchronous I/O subsystems (AIOs):
Legacy AIO
Posix AIO
You can use the command psat -a | grep ‘aioserver’ | wc -l to get the number of legacy
AIO servers that are running. You can use the command pstat -a | grep posix_aioserver
| wc -l to see the number of Posix AIO servers.
Important: If you use raw devices, you have to use ps -k instead of pstat -a to measure
the legacy AIO activity.
Example 9-9 shows that the host does not have any AIO servers that are running. This
function is not enabled, by default. You can enable this function with mkdev -l aio0 or by
using SMIT. For Posix AIO, substitute posix_aio for aio0.
In AIX Version 6 and Version 7, both AIO subsystems are loaded by default but are activated
only when an AIO request is initiated by the application. Use the command pstat -a | grep
aio to see the AIO subsystems that are loaded, as shown in Example 9-10.
Example 9-10 pstat -a output to show the AIO subsystem defined in AIX 6
[root@p520-tic-3]# pstat -a | grep aio
18 a 1207c 1 1207c 0 0 1 aioLpool
33 a 2104c 1 2104c 0 0 1 aioPpool
In AIX Version 6 and Version 7, you can use the new ioo tunables to show whether the AIO is
used. An illustration is given in Example 9-11.
Example 9-11 ioo -a output to show the AIO subsystem activity in AIX 6
[root@p520-tic-3]# ioo -a | grep aio
aio_active = 0
aio_maxreqs = 65536
aio_maxservers = 30
aio_minservers = 3
aio_server_inactivity = 300
posix_aio_active = 0
posix_aio_maxreqs = 65536
posix_aio_maxservers = 30
From Example 9-11 on page 355, aio_active and posix_aio_active show whether the AIO
is used. The parameters aio_server_inactivity and posix_aio_server_inactivity show
how long an AIO server sleeps without servicing an I/O request.
To check the Asynchronous I/O configuration in AIX 5.3, type the following commands that
are shown in Example 9-12.
Example 9-12 lsattr -El aio0 output to list the configuration of legacy AIO
[root@p520-tic-3]# lsattr -El aio0
autoconfig defined STATE to be configured at system restart True
fastpath enable State of fast path True
kprocprio 39 Server PRIORITY True
maxreqs 4096 Maximum number of REQUESTS True
maxservers 10 MAXIMUM number of servers per cpu True
minservers 1 MINIMUM number of servers True
Notes: If you use AIX Version 6, there are no more Asynchronous I/O devices in the Object
Data Manager (ODM), and the command aioo is removed. You must use the ioo command
to change them.
If your AIX 5.3 is between TL05 and TL08, you can also use the aioo command to list and
increase the values of maxservers, minservers, and maxreqs.
The rule is to monitor the I/O wait by using the vmstat command. If the I/O wait is more than
25%, consider enabling AIO, which reduces the I/O wait but does not help disks that are busy.
You can monitor busy disks by using iostat, which we explain in the next section.
The lsattr -E -l sys0 -a iostat command option indicates whether the iostat statistic
collection is enabled. To enable the collection of iostat data, use chdev -l sys0 -a
iostat=true.
The disk and adapter-level system throughput can be observed by using the iostat -aDR
command.
The option a retrieves the adapter-level details, and the option D retrieves the disk-level
details. The option R resets the min* and max* values at each interval. See Example 9-12.
Vadapter:
vscsi0 xfer: Kbps tps bkread bkwrtn partition-id
29.7 3.6 2.8 0.8 0
read: rps avgserv minserv maxserv
0.0 48.2S 1.6 25.1
write: wps avgserv minserv maxserv
Paths/Disks:
hdisk0 xfer: %tm_act bps tps bread bwrtn
1.4 30.4K 3.6 23.7K 6.7K
read: rps avgserv minserv maxserv timeouts fails
2.8 5.7 1.6 25.1 0 0
write: wps avgserv minserv maxserv timeouts fails
0.8 9.0 2.1 52.8 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
11.5 0.0 34.4 0.0 0.0 0.9
Check for the following situations when analyzing the output of iostat:
Check whether the number of I/Os is balanced among the disks. If not, it might indicate
that you have problems in the distribution of PPs over the LUNs. With the information
provided by lvmstat or filemon, select the most active LV, and with the lslv -m command,
check whether the PPs are distributed evenly among the disks of the VG. If not, check the
inter-policy attribute on the LVs to see whether they are set to maximum. If the PPs are not
distributed evenly and the (LV inter-policy attribute is set to minimum, you need to change
the attribute to maximum and reorganize the VG.
Check in the read section that the avgserv is larger than 15 ms. It might indicate that your
bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. Also,
check whether the same problem occurs with other disks of the same VG. If yes, you need
to add up the number of I/Os per second, add up the throughput by vpath (if it is the case),
rank, and host, and compare with the performance numbers from Tivoli Storage
Productivity Center for Disk.
Check in the write section whether the avgserv is larger than 3 ms. Writes that average
significantly and consistently higher indicate that write cache is full, and there is a
bottleneck in the disk.
Check in the queue section whether avgwqsz is larger than avgsqsz. Compare with other
disks in the storage. Check whether the PPs are distributed evenly in all disks in the VG. If
avgwqsz is smaller than avgsqsz, compare with other disks in the storage. If there are
differences and the PPs are distributed evenly in the VG, it might indicate that the
unbalanced situation is at the rank level.
The following example shows how multipath needs to be considered to interpret the iostat
output.
In this example on Figure 9-8 on page 358, a server has two FC adapters and is zoned so
that it uses four paths to the DS8000.
To determine the I/O statistics for the example in Figure 9-8 on page 358, you need to add up
the iostats for hdisk1 - hdisk4. One way to establish relationships between the hdisks and the
DS8000 LVs is to use the pcmpath query device command that is included with SDDPCM.
}
disk1
disk2 reported on
disk3 by iostat
disk4
DS8000
LUN 1
Another way is shown in Example 9-14. The command, pcmpath query device 1, lists the
paths (hdisks). In this example, the logical disk on the DS8000 has LUN serial number
75065513000. The disk devices presented to the operating system are hdisk4, hdisk2,
hdisk3, and hdisk4, so we can add up the iostats for these four hdisk devices.
The option shown in Example 9-15 on page 359 provides details in a record format, which
can be used to sum up the disk activity.
It is not unusual to see a device reported by iostat as 90% - 100% busy, because a DS8000
volume that is spread across an array of multiple disks can sustain a much higher I/O rate
than a single physical disk. A device that is 100% busy is generally a problem for a single
device, but it is probably not a problem for a RAID 5 device.
Further Asynchronous I/O can be monitored through iostat -A for legacy AIO and iostat
-P for Posix AIO.
Because the asynchronous I/O queues are assigned by filesystem, it is more interesting to
measure the queues per filesystem. If you have several instances of the same application
where each application uses a set of filesystems, you can see which instances consume
more resources. Execute the iostat -AQ command to see the legacy AIO, which is shown in
Example 9-16. Similarly for POSIX-compliant AIO statistics, use iostat -PQ.
Example 9-16 iostat -AQ output to measure legacy AIO activity by filesystem
[root@p520-tic-3]# iostat -AQ 1 2
aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 16384 0.0 0.1 99.9 0.0
aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 16384 0.0 0.1 99.9 0.0
If your AIX system is in a SAN environment, you might have so many hdisks that iostat does
not provide much information. We suggest that you use nmon, which can report iostats based
on vpaths or ranks, as discussed in “Interactive nmon options for DS8000 performance
monitoring” on page 363.
For detailed information about the enhancements of the iostat tool in AIX Version 7, see 6.4
“Iostat command enhancement” in the IBM AIX Version 7.1 Differences Guide, SG24-7910:
http://www.redbooks.ibm.com/abstracts/sg247910.html
9.3.4 lvmstat
The lvmstat command reports input and output statistics for logical partitions, logical
volumes, and volume groups. This command is useful in determining the I/O rates to LVM
volume groups, logical volumes, and logical partitions. This command is useful for dealing
with unbalanced I/O situations where the data layout was not considered initially.
To enable statistics collection for all logical volumes in a volume group (in this case, the rootvg
volume group), use the -e option together with the -v <volume group> flag as the following
example shows:
#lvmstat -v rootvg -e
When you do not need to continue to collect statistics with lvmstat, disable it, because it
affects the performance of the system. To disable the statistics collection for all logical
volumes in a volume group (in this case, the rootvg volume group), use the -d option together
with the -v <volume group> flag as the following example shows:
#lvmstat -v rootvg -d
This command disables the collection of statistics on all logical volumes in the volume group.
The first report section generated by lvmstat provides statistics that concern the time since
the statistical collection was enabled. Each later report section covers the time since the
previous report. All statistics are reported each time that lvmstat runs. The report consists of
The lvmstat tool has powerful options, such as reporting on a specific logical volume or
reporting busy logical volumes in a volume group only. For additional information about
usage, see the following links:
IBM AIX Version 7.1 Differences Guide, SG24-7910-00:
http://www.redbooks.ibm.com/abstracts/sg247910.html
AIX 5L Performance Tools Handbook, SG24-6039
http://www.redbooks.ibm.com/abstracts/sg246039.html?Open
The man page of lvmstat:
http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.
cmds/doc/aixcmds3/lvmstat.htm&tocNode=int_215986
9.3.5 topas
The interactive AIX tool, topas, is convenient if you want to get a quick overall view of the
current activity of the system. A fast snapshot of memory usage or user activity can be a
helpful starting point for further investigation. Figure 9-9 on page 362 contains a sample
topas output.
With AIX6.1, the topas monitor offers enhanced monitoring capabilities and now also
provides I/O statistics for filesystems:
Enter ff (first f turns it off, the next f expands it) to expand the filesystem I/O statistics.
Type F to get an exclusive and even more detailed view of the filesystem I/O statistics.
Expanded disk I/O statistics can be obtained by typing dd or D in the topas initial window.
9.3.6 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis
resource, and it is no charge. It was written by Nigel Griffiths who works for IBM in the UK. We
use this tool, among others, when we perform client benchmarks. It is available at this
website:
http://www.ibm.com/developerworks/aix/library/au-analyze_aix/
The nmon tool: The nmon tool is not supported. No warranty is given or implied, and you
cannot obtain help or maintenance from IBM.
The nmon tool currently is available in two versions to run on different levels of AIX:
The nmon Version 12e for AIX 6.1 and later
The nmon Version 9 for previous versions of AIX
The interactive nmon tool is similar to monitor or topas, which you perhaps used before to
monitor AIX, but it offers more features that are useful for monitoring the DS8000
performance. We explore these interactive options.
The different options you can select when you run nmon Version 12 are shown in
Example 9-18.
Then, type nmon with the -g flag to point to the map file:
nmon -g /tmp/vg-maps
When nmon starts, press the G key to view statistics for your disk groups. An example of the
output is shown in Example 9-22.
Use the SDDPCM command pcmpath query device to provide a view of your host system
logical configuration on the DS8000. You can, for example, create a nmon disk group of
storage type (DS8000), logical subsystem (LSS), rank, and port to show you unique views
into your storage performance.
Recording nmon information for import into the nmon analyzer tool
A great benefit that the nmon tool provides is the ability to collect data over time to a file and
then to import the file into the nmon analyzer tool:
http://www.ibm.com/developerworks/aix/library/au-nmon_analyser/
To collect nmon data in comma-separated value (csv) file format for easy spreadsheet import:
1. Run nmon with the -f flag. See nmon -h for the details, but as an example, to run nmon for
an hour to capture data snapshots every 30 seconds, use this command:
nmon -f -s 30 -c 120
2. This command creates the output file in the current directory called
<hostname>_date_time.nmon.
The nmon analyzer is a macro-customized Microsoft Excel spreadsheet. After transferring the
output file to the machine that runs the nmon analyzer, simply start the nmon analyzer, enabling
the macros, and click Analyze nmon data. You are prompted to select your spreadsheet and
then to save the results.
Many spreadsheets have fixed numbers of columns and rows. We suggest that you collect up
to a maximum of 300 snapshots to avoid experiencing these issues.
Tip: The use of the CHARTS setting instead of PICTURES for graph output simplifies the
analysis of the data, which makes it more flexible.
When you capture data to a file, the nmon tool disconnects from the shell to ensure that it
continues running even if you log out, which means that nmon can appear to fail, but it is still
running in the background until the end of the analysis period.
9.3.7 fcstat
The fcstat command displays statistics from an specific FC adapter. Example 9-23 shows
the output of the fcstat command.
The “No Command Resource Count” indicates how many times the num_cmd_elems value was
exceeded since AIX was booted. You can continue to take snapshots every 3 - 5 minutes
9.3.8 filemon
The filemon command monitors a trace of filesystem and I/O system events, and reports
performance statistics for files, virtual memory segments, logical volumes, and physical
volumes. The filemon command is useful to individuals whose applications are thought to be
disk-bound, and who want to know where and why.
The filemon command provides a quick test to determine whether there is an I/O problem by
measuring the I/O service times for reads and writes at the disk and logical volume level.
The filemon command resides in /usr/bin and is part of the bos.perf.tools file set, which
can be installed from the AIX base installation media.
filemon measurements
To provide a complete understanding of filesystem performance for an application, the
filemon command monitors file and I/O activity at four levels:
Logical filesystem
The filemon command monitors logical I/O operations on logical files. The monitored
operations include all read, write, open, and seek system calls, which might result in actual
physical I/O, depending on whether the files are already buffered in memory. I/O statistics
are kept on a per-file basis.
Virtual memory system
The filemon command monitors physical I/O operations (that is, paging) between
segments and their images on disk. I/O statistics are kept on a per segment basis.
Logical volumes
The filemon command monitors I/O operations on logical volumes. I/O statistics are kept
on a per-logical volume basis.
Physical volumes
The filemon command monitors I/O operations on physical volumes. At this level, physical
resource utilizations are obtained. I/O statistics are kept on a per-physical volume basis.
filemon examples
A simple way to use filemon is to run the command that is shown in Example 9-24, which
performs these actions:
Run filemon for 2 minutes and stop the trace.
Store output in /tmp/fmon.out.
Collect only logical volume and physical volume output.
Tip: To set the size of the buffer of option -T, in general, start with 2 MB per logical CPU.
To produce sample output for filemon, we ran a sequential write test in the background, and
started a filemon trace, as shown in Example 9-25. We used the lmktemp command to create
a 2 GB file full of nulls while filemon gathered I/O statistics.
In Example 9-26 on page 368, we look at parts of the /tmp/fmon.out file. When analyzing the
output from filemon, focus on these areas:
Most active physical volume:
– Look for balanced I/O across disks.
– Lack of balance might be a data layout problem.
Look at I/O service times at the physical volume layer:
– Writes to cache that average less than 3 ms are good. Writes averaging significantly
and consistently longer times indicate that write cache is full, and there is a bottleneck
in the disk.
– Reads that average less than 10 ms - 20 ms are good. The disk subsystem read cache
hit rate affects this value considerably. Higher read cache hit rates result in lower I/O
service times, often near 5 ms or less. If reads average greater than 15 ms, it can
indicate that something between the host and the disk is a bottleneck, although it
indicates a bottleneck in the disk subsystem.
– Look for consistent I/O service times across physical volumes. Inconsistent I/O service
times can indicate unbalanced I/O or a data layout problem.
– Longer I/O service times can be expected for I/Os that average greater than 64 KB in
size.
– Look at the difference between the I/O service times between the logical volume and
the physical volume layers. A significant difference indicates queuing or serialization in
the AIX I/O stack.
The following fields show in the filemon report of the filemon command:
util Utilization of the volume (fraction of time busy). The rows are sorted by
this field, in decreasing order. The first number, 1.00, means 100
percent.
skipping...........
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
skipping...........
------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
skipping to end.....................
In the filemon output in Example 9-26 on page 368, we notice these characteristics:
The most active logical volume is /dev/305glv (/interdiskfs); it is the busiest logical volume
with an average data rate of 87 MBps.
The Detailed Logical Volume Status field shows an average write time of 1.816 ms for
/dev/305glv.
The Detailed Physical Volume Stats show an average write time of 1.934 ms for the
busiest disk, /dev/hdisk39, and 1.473 ms for /dev/hdisk55, which is the next busiest disk.
The filemon command is a useful tool to determine where a host spends I/O. More details
about the filemon options and reports are available in the publication AIX 5L Performance
Tools Handbook, SG24-6039:
http://www.redbooks.ibm.com/abstracts/sg246039.html?Open
9.4.1 UFS
UFS is the standard filesystem of Solaris. You can configure a journaling feature, adjust
cache filesystem parameters, implement Direct I/O, and adjust the mechanism of sequential
read ahead.
http://www.solarisinternals.com/si/reading/fs2/fs2.html
http://www.solarisinternals.com/si/reading/sunworldonline/swol-07-1999/swol-07-
filesystem3.html
You can obtain the Oracle Solaris Tunable Parameters Reference Manual at this website:
http://docs.oracle.com/cd/E23824_01/html/821-1450/index.html
For more information about Oracle Solaris commands and tuning options, see the
following website:
http://www.solarisinternals.com
With the Veritas File System, you can adjust the mechanisms of sequential read ahead and
sequential and random write behind. You can tune its buffers to increase performance, and it
supports asynchronous I/O.
Filesystem blocksize
The smallest allocation unit of a filesystem is the blocksize. In VxFS, you can choose from
512 bytes to 8192 bytes. To decide which size is best for your application, consider the
average size of the application files. If the application is a file server and the average size is
about 1 KB, choose a blocksize of 1 KB. But if the application is a database with a few large
files, choose the maximum size of 8 KB. The default blocksize is 2 KB. In addition, when
creating and allocating file space inside of VxFS and by using standard tools, such as mkfile
(Solaris only) or database commands, you might see performance degradations. For
additional information, see these websites:
http://www.symantec.com/business/support/index?page=answerlink&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fww
w.symantec.com%2Fbusiness%2Fsupport%2Fresources%2Fsites%2FBUSINESS%2Fcontent%2Fsta
ging%2FTECHNICAL_SOLUTION%2F10000%2FTECH10174%2Fen_US%2F2.0%2Ffsed_sag_sol_50_2352
03.pdf&answerid=16777220&searchid=1322121458096
http://www.symantec.com/business/support/resources/sites/BUSINESS/content/staging/
TECHNICAL_SOLUTION/61000/TECH61155/en_US/3.0/305161.doc
For the latest updates in ZFS and the administrator guide, see this link:
http://docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html
Storage pools
ZFS implements built-in features of volume management. You define a storage pool and add
disks for that storage pool. It is not necessary to partition the disks and create filesystems on
top of those partitions. Instead, you simply define the filesystems, and ZFS allocates disk
space dynamically. ZFS abstracts the real memory through a virtual memory address space,
similar to virtual memory. You also can implement quotas and reserve space for a specific
filesystem.
For detailed information about how to check and configure the checksumming method, see
this link:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Checksums
Important: Every time that ZFS issues an I/O write to the disk, it does not know whether
the storage has a nonvolatile random access memory (NVRAM). Therefore, it requests to
flush the data from cache to disk in the storage. If your Solaris Release is 10 or later, you
can disable the flush by setting the following parameter in the /etc/system file: set
zfs:zfs_nocacheflush = 1. Additional details are provided at the following link:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH
Dynamic striping
With the storage pool model, after you add more disks to the storage pool, ZFS automatically
redistributes the data among the disks.
Multiple blocksize
In ZFS, there is no need to define a blocksize for each filesystem. ZFS tries to match the
blocksize with the application I/O size. However, if your application is a database, we suggest
that you enforce the blocksize to match the database blocksize. The parameter is recordsize
and can range from 512 bytes to 128 KB. For example, if you want to configure a blocksize of
8 KB, you type the command zfs set recsize=16384 <userpool name> or <filesystem>.
Cache management
Cache management is implemented by a modified version of an Adaptive Replacement
Cache (ARC) algorithm. By default, it tries to use the real memory of the system while the
utilization increases and there is free memory in the system. Therefore, if your application
also uses a large amount of memory, such as a database, you might need to limit the amount
of memory available for the ZFS ARC. For detailed information and instructions about how to
limit the ZFS ARC, see the following link:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE
The DS8000 LUNs that are under the control of VxVM are called VM Disks. You can split
those VM Disks in smaller pieces that are called subdisks. A plex looks like a mirror and
consists of a set of subdisks. It is at the plex level that you can configure RAID 0, RAID 5, or
simply concatenate the subdisks. A volume consists of one or more plexes. When you add
more than one plex to a volume, you can implement RAID 1. On top of that volume, you can
create a filesystem or simply use it as a raw device.
To set up the volume layout with the DS8000 LUNs, you can adopt one of the following
strategies:
Storage pool striping: In this case, you spread the workload at the storage level. At the
operating system level, you need to create the plexes with the layout attribute set to
concat, which is the default option when creating a plex.
Striped Plex: A set of LUNs is created in different ranks inside of the DS8000. After the
LUNs are recognized in Solaris, a Disk Group (DG) is created, the plexes are spread
evenly over the LUNs, and the stripe size of a plex is set from 8 MB to 16 MB.
RAID considerations
When using VxVM with the DS8000 LUNs, spread the workload over the several DS8000
LUNs by creating RAID 0 plexes. The stripe size is based on the I/O size of your application.
If your application has I/O sizes of 1 MB, define the stripe sizes as 1 MB. If your application
performs many sequential I/Os, it is better to configure stripe sizes of 4 MB or more to take
advantage of the DS8000 prefetch algorithm. See Chapter 9, “Performance considerations for
UNIX servers” on page 327 for details about RAID configuration.
vxio:vol_maxio
When you use VxVM on the DS8000 LUNs, you must set the VxVM maximum I/O size
parameter (vol_maxio) to match the I/O size of your application or the stripe size of VxVM
RAID 0. If the I/O size of your application is 1 MB and you use the Veritas Volume Manager on
your DS8000 LUNs, edit the /etc/system and add the entry set vxio:vol_maxio=2048. The
value is in blocks of 512 bytes.
9.4.7 MPxIO
MPxIO is the multipathing device driver that comes with Oracle Solaris and is required when
implementing Oracle Solaris Clusters. For additional information, see the following links:
A presentation providing an overview of MPxIO:
http://opensolaris.org/os/project/mpxio/files/mpxio_toi_sio.pdf
The Solaris SAN Configuration and Multipathing Guide:
http://docs.oracle.com/cd/E19253-01/820-1931/820-1931.pdf
Important: For Solaris 10 SPARC, the sd and ssd transfer size setting defaults to maxphys.
Certain Fibre Channel HBAs do not support requests greater than 8 MB. Do not forget to
test the new values before you put them in production.
9.4.10 FC adapter
AMCC (formerly JNI), Emulex, QLogic, and SUN FC adapters are described in the DS8000
Host System Attachment Guide, GC27-2298-02, with suggested performance parameters.
For more information, see the following link:
http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ib
m.storage.ssic.help.doc/f2c_agrs62105inst_1atzyy.html
fcachestat
The fcachestat command and its output are illustrated in Example 9-27.
You can check the actual buffer size with the sysdef command that is shown in Example 9-28.
To change bufhwm, you need to edit the /etc/system and look for the parameter bufhwm.
For additional information and to download this tool, see the following link:
http://www.brendangregg.com/cachekit.html
You can also use sar -a, sar -b, and sar -v to check the DNLC and inode cache utilizations.
Check the following links for more details about how to use sar in Solaris:
The sar -a and sar -b command:
http://dennis_caparas.tripod.com/Configuring_sar_for_your_system.html
directiostat
The directiostat command and its output are illustrated in Example 9-29.
With this tool, you can measure the I/O request that is executed in filesystems with the Direct
I/O mount option enabled.
The vmstat output has five major columns (memory, page, executable, anonymous, and
filesystem). The filesystem column contains three subcolumns:
fpi: It means file pages in. It tells how many file pages were copied from disk to memory.
fpo: It means file pages out. It tells how many file pages were copied from memory to disk.
fpf: It means file pages free. It tells how many file pages are being freed at every sample of
time.
If you see no page activity in anonymous (api/apo) and only in file page activity (fpi/fpo), it
means that you do not have memory constraints but that too many file page activities occur
and you might need to optimize it. One way is by enabling Direct I/O in the filesystems of your
application. Another way is by adjusting the read ahead mechanism if that mechanism is
enabled, or adjusting the scanner parameters of virtual memory. We suggest the following
values:
fastscan: This parameter sets how many memory pages are scanned per second.
Configure it for 1/4 of real memory with a limit of 1 GB.
handspreadpage: This parameter sets the distance between the two-handled clock
algorithm to look for candidate memory pages to be reclaimed when memory is slow.
Configure it with the same value set for the fastscan parameter.
maxpgio: This parameter sets the maximum number of pages that can be queued by the
Virtual Memory Manager. Configure it for 1024 if you use eight or more ranks in the
DS8000 and if you use a high-end server.
For detailed information about the options of iostat, see the following link:
http://docs.oracle.com/cd/E23824_01/html/821-1451/spmonitor-4.html
With the DS8000, we suggest that you collect performance information from VM disks and
subdisks. To display 10 sets of disk statistics, with intervals of one second, use vxstat -i 1
-c 10 -d. To display 10 sets of subdisk statistics, with intervals of one second, use vxstat -i
1 -c 10 -s. You need to dismiss the first sample, because it provides statistics since the boot
of the server.
9.5.5 dtrace
The dtrace is not only a trace tool. DTrace is also a framework for tracking dynamically the
operating system’s kernel and also applications that run on top of Solaris 10. You can write
your own tools for performance analysis by using the D programming language. The syntax is
based on the C programming language with several specific commands for tracing
instrumentation. There are many scripts already developed that you can use for performance
analysis. You can start by downloading the DTrace Toolkit from the following link:
http://www.brendangregg.com/dtrace.html#DTraceToolkit
Follow the instructions at the website to install the DTrace Toolkit. When installed, set your
PATH environment variable to avoid having to type the full path every time, as shown in
Example 9-32.
One example is a large sequential I/O that might be reaching a limit. See the script in
Example 9-33 on page 381.
PID CMD
0 sched\0
3 fsflush\0
In the previous example, we executed a dd command with a blocksize of 2 MB, but when we
measure the I/O activity, we can see that in fact the maximum I/O size is 1 MB and not 2 MB,
As you can see, the maxphys parameter is not set in the /etc/system configuration file. It
means that Solaris 10 is using the default value, which is 1 MB. You can increase the value of
maxphys to increase the size of I/O requests.
For additional information about how to use DTrace, check the following links:
Introduction about DTrace:
http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Intro
The DTrace Toolkit:
http://www.solarisinternals.com/wiki/index.php/DTraceToolkit
When planning the DS8000 volume layout for HP-UX host systems, due to limitations in the
host operating system, the DS8000 LUN IDs greater than x'3FFF' are not supported. When
you create or assign LUNs and volumes, only LUN and volume IDs less than x'3FFF' can be
used, which limits the maximum number of volumes that can be used on a single DS8000
system for HP-UX hosts to 16384.
Select Configuring Attaching hosts HP-UX host attachment for initial information
about configuring HP-UX hosts for attachment to the DS8000 systems.
Asynchronous I/O
Asynchronous I/O is a feature of HP-UX that is not enabled by default. It allows the
application to keep processing while issuing I/O requests without needing to wait for a reply,
therefore, reducing the application response time. Normally, database applications take
advantage of that feature. If your application supports asynchronous I/O, enable it in the
operating system as well. For detailed information about how to configure asynchronous I/O,
see the appropriate application documentation:
For Oracle 11g with HP-UX using asynchronous I/O:
http://download.oracle.com/docs/cd/B28359_01/server.111/b32009/appb_hpux.htm#BA
BBFDCI
For Sybase Adaptive Server Enterprise (ASE) 15.0 using asynchronous I/O:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc35823_1500/html
/uconfig/BBCBEAGF.htm
The number of pages of memory allocated for buffer cache use at any specific time is
determined by the system needs, but the two parameters ensure that allocated memory never
drops under dbc_min_pct and does not exceed dbc_max_pct of the total system memory.
The default value for dbc_max_pct is 50%, which is usually too much. If you want to use a
dynamic buffer cache, set the dbc_max_pct value to 25%. If you have 4 GB of memory or
more, start with an even smaller value.
With a large buffer cache, the system is likely to need to page out or shrink the buffer cache to
meet the application memory needs. These actions cause I/Os to lose paging space. You
need to avoid that situation from happening and set memory buffers to favor applications over
cached files.
The DS8000 LUNs that are under control of LVM are called physical volumes (PVs). The LVM
splits the disk space in smaller pieces that are called physical extents (PEs). A logical volume
(LV) is composed of several logical extents (LEs). A filesystem is created on top of an LV or
simply used as a raw device. Each LE can point to up to two corresponding PEs in LVM
Version 1.0 and up to five corresponding PEs in LVM Version 2.0/2.1, which is how LVM
implements mirroring (RAID 1).
To set up the volume layout with the DS8000 LUNs, you can adopt one of the following
strategies:
Storage pool striping: In this case, you spread the workload at the storage level. At the
operating system level, you need to create the LVs with the inter-policy attribute set to
minimum, which is the default option when creating an LV.
Distributed Allocation Policy: A set of LUNs is created in different ranks inside the DS8000.
When the LUNs are recognized in the HP-UX, a VG is created and the LVs are spread
evenly over the LUNs with the option -D. The advantage of this method compared to
storage pool striping is the granularity of data spread over the LUNs. With storage pool
striping, the data is spread in chunks of 1 GB. With the Distributed Allocation Policy, you
can create PE sizes from 8 MB to 16 MB in a VG.
LVM Striping: As with Distributed Allocation Policy, a set of LUNs is created in different
ranks inside the DS8000. When the LUNs are recognized in the HP-UX, a VG is created
with larger PE sizes, such as 128 MB or 256 MB. And the LVs are spread evenly over the
LUNs by setting the stripe size of LV from 8 MB to 16 MB. From a performance standpoint,
LVM Striping and PP Striping provide the same results.
disk4, disk5, disk6, and disk7 are hardware-striped LUNs on different DS8000 Extent Pools
8 GB/16 MB partitions ~ 500 physical extents per LUN (pe1-pe500)
/dev/inter-disk_lv is made up of eight logical extents
(le1 + le2 + le3 + le4 + le5 +le6 +le7 + le8) = 8 x 16 = 128 MB
The first step is to initialize the PVs with the following commands:
pvcreate /dev/rdisk/disk4
pvcreate /dev/rdisk/disk5
pvcreate /dev/rdisk/disk6
pvcreate /dev/rdisk/disk7
The next step is to create the VG. We suggest that you create a VG with a set of DS8000
LUNs, and each LUN is in a different extent pool. If you add a set of LUNs to a host, define
another VG, and so on. Follow these steps to create the VG data01vg and the PE size of 16
MB:
1. Create the directory /dev/data01vg with a special character file called group:
mkdir /dev/data01vg
mknod /dev/data01vg/group c 70 0x020000
2. Create the VG with the following command:
vgcreate /dev/data01vg -g data01pvg01 -s 16 /dev/data01vg /dev/disk/disk4
/dev/disk/disk5 /dev/disk/disk6 /dev/disk/disk7
Then, you can create the LVs with the option -D, which stripes the logical volume from one
LUN to the next LUN in chunks the size of the physical partition size of the volume group, for
instance:
lvcreate -D y -l 16 -m 1 -n inter-disk_lv -s g /dev/data01vg
Notice that /dev/striped_lv is also made up of eight 256 MB physical partitions, but each
partition is then subdivided into 64 chunks of 4 MB (only three of the 4 MB chunks are shown
per logical partition for space reasons).
8 GB LUN = disk5
256MB 256MB 256MB 256MB 256MB 256MB 256MB 256MB
le2 pe1
le6 pe2 pe3 pe4 .... pe29 pe30 pe31 pe32
2.1 2.2 2.3 6.1 6.2 6.3
IO 8 GB LUN = disk6
256MB 256MB 256MB 256MB 256MB 256MB 256MB 256MB
le3 pe1 le7 pe2 pe3 pe4 .... pe29 pe30 pe31 pe32
3.1 3.2 3.3 7.1 7.2 7.3
8 GB LUN = disk7
256MB 256MB 256MB 256MB 256MB 256MB 256MB 256MB
le4 pe1
le8 pe2 pe3 pe4 .... pe29 pe30 pe31 pe32
4.1 4.2 4.3 8.1 8.2 8.3
disk4, disk5, disk6, and disk7 are hardware-striped LUNS on different DS8000 Extent Pools
8 GB/256 MB partitions ~ 32 physical extents per LUN (pe1 – pe32)
As with LVM Distributed Allocation Policy, the first step is to initialize the PVs with the following
commands:
pvcreate /dev/rdisk/disk4
pvcreate /dev/rdisk/disk5
pvcreate /dev/rdisk/disk6
pvcreate /dev/rdisk/disk7
9.6.5 PV Links
PV Links is the multipathing solution that comes with HP-UX. It primarily provides a failover
capability, but if the storage allows it, you can use the alternate path for load balancing.
9.6.10 FC adapter
The FC adapter provides the connection between the host and the storage devices.
[root@rx4640-1]# sar -u 1 5
Average 1 10 39 50
Not all sar options are the same for AIX, HP-UX, and Solaris, but the sar -u output is the
same. The output in Example 9-35 shows CPU information every 1 second, five times.
To check whether a system is I/O-bound, the important column to check is the %wio column.
The %wio column includes the time that is spent waiting on I/O from all drives, including
internal and DS8000 logical disks. If %wio values exceed 40, you need to investigate to
understand the storage I/O performance. The next action is to look at I/O service times
reported by the sar -Rd command (Example 9-36).
The avwait and avserv columns show the average times spent in the wait queue and service
queue. The avque column represents the average number of I/Os in the queue of that device.
With HP-UX 11i v3, the sar command has new options to monitor the performance:
-H reports I/O activity by HBA
-L reports I/O activity by lunpath
-R with option -d splits the number of I/Os per second between reads and writes
-t reports I/O activity by tape device
For additional information about the HP-UX sar command, see this website:
http://docs.hp.com/en/B2355-60130/sar.1M.html
To configure a system to collect data for sar, you can run the sadc command or the modified
sa1 and sa2 commands. The following list includes more information about the sa commands
and how to configure sar data collection:
The sa1 and sa2 commands are shell procedure variants of the sadc command.
The sa1 command collects and stores binary data in the /var/adm/sa/sadd file, where dd
is the day of the month.
The sa2 command is designed to be run automatically by the cron command and run
concurrently with the sa1 command. The sa2 command generates a daily report called
/var/adm/sa/sardd. It also removes a report more than one week old.
The /var/adm/sa/sadd file contains the daily data file, and dd represents the day of the
month. And, /var/adm/sa/sardd contains the daily report file, and dd represents the day
of the month. Note the r in /var/adm/sa/sardd for sa2 output.
To configure a system to collect data, edit the root crontab file. For our example, if we want to
run sa1 every 15 minutes every day, and we want the sa2 program to generate ASCII versions
of the data immediately before midnight, we change the cron schedule to look like this
example:
0,15,30,45 * * * 0-6 /usr/lib/sa/sa1
55 23 * * 0-6 /usr/lib/sa/sa2 -A
You can view performance information files from these files with this command:
You can also focus on a certain period, for example, 8 a.m. to 5:15 p.m. with this command:
sar -s 8:00 -e 17:15 -f /var/adm/sa/sadd
Remember, sa2 removes the data collection files that are over a week old as scheduled in
cron.
You can save sar information to view later with the commands:
sar -A -o data.file interval count > /dev/null & (SAR data saved to data.file)
All data is captured in binary form and saved to a file (data.file). The data can then be
selectively displayed with the sar command by using the -f option.
sar summary
The sar tool helps you to tell quickly if a system is I/O-bound. Remember though that a busy
system can mask I/O issues, because io_wait counters are not increased if the CPUs are
busy.
The sar tool can help you to save a history of I/O performance so that you have a baseline
measurement for each host. You can then verify whether tuning changes make a difference.
You might want, for example, to collect sar data for a week and create reports: 8 a.m. - 5 p.m.
Monday - Friday if that time is the prime time for random I/O, and 6 p.m. - 6 a.m. Saturday -
Sunday if those times are batch/backup windows.
9.7.2 vxstat
The vxstat tool is a performance tool that comes with VxVM. For additional information, see
9.5.4, “vxstat” on page 380.
9.7.3 HP Perfview/Measureware
HP Perfview/Measurware is good for recording performance measurements and maintaining
a baseline of system performance data for reference. The HP Perfview/Measurware tool can
show statistics for each physical disk in graphical format and you can change the time scale
easily.
3. Next, run sequential reads and writes (using the dd command, for example) to all of the
vpath devices (raw or block) for about an hour. Then, look at your SAN infrastructure to
see how it performs.
Look at the UNIX error report. Problems show up as storage errors, disk errors, or adapter
errors. If there are problems, they are not to hard to identify in the error report, because
there are many errors. The source of the problem can be hardware problems on the
storage side of the SAN, Fibre Channel cables or connections, down-level device drivers,
or device (HBA) microcode. If you see errors similar to the errors shown in Example 9-38,
stop and fix them.
Ensure that after you run dd commands on all your vpaths for one hour, there are no
storage errors in the UNIX error report.
4. Next, issue the following command to see whether SDD correctly balances the load
across paths to the LUNs:
Total Devices : 16
Check to ensure that for every LUN, the counters under the Select column are the same
and that there are no errors.
5. Next, randomly check the sequential read speed of the raw vpath device. The following
command is an example of the command run against a LUN called vpath0. For the LUNs
that you test, ensure that they each yield the same results:
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
Tip: For this dd command, for the first time that it is run against rvpath0, the I/O must
be read from disk and staged to the DS8000 cache. The second time that this dd
command is run, the I/O is already in cache. Notice the shorter read time when we get
an I/O cache hit.
If any of these LUNs are on ranks that are also used by another application, you see a
variation in the throughput. If there is a large variation in the throughput, you need to give
If everything looks good, continue with the configuration of volume groups and logical
volumes.
For HP-UX, use the prealloc command instead of lmktemp for AIX to create large files. For
Oracle Solaris, use the mkfile command.
Tip: The prealloc command for HP-UX and the lmktemp command for AIX have a 2 GB
size limitation. Those commands are not able to create a file greater than 2 GB in size. If
you want a file larger than 2 GB for a sequential read test, concatenate a couple of 2 GB
files.
Disk throughput and I/O response time for any server connected to a DS8000 are affected by
the workload and configuration of the server and the DS8000, data layout and volume
placement, connectivity characteristics, and the performance characteristics of the DS8000.
While the health and tuning of all of the system components affect the overall performance
management and tuning of a Windows server, this chapter limits its discussion to the
following topics:
General Windows performance tuning
I/O architecture overview
Filesystem
Volume management
Multipathing and the port layer
Host bus adapter (HBA) settings
Windows Server 2008 I/O enhancements
I/O performance measurement
Problem determination
Load testing
For detailed instructions about these tuning suggestions, see the following publications:
Tuning IBM System x Servers for Performance, SG24-5287
Tuning Windows Server 2003 on IBM System x Servers, REDP-3943
http://msdn.microsoft.com/en-us/windows/hardware/gg463394
http://msdn.microsoft.com/en-us/windows/hardware/gg463392
To initiate an I/O request, an application issues an I/O request by using one of the supported
I/O request calls. The I/O manager receives the application I/O request and passes the I/O
request packet (IRP) from the application to each of the lower layers that route the IRP to the
appropriate device driver, port driver, and adapter-specific driver.
Windows server filesystems can be configured as file allocation table (FAT), FAT32, or NTFS.
The file structure is specified for a particular partition or logical volume. A logical volume can
contain one or more physical disks. All Windows volumes are managed by the Windows
Logical Disk Management utility.
For additional information relating to the Windows Server 2003 and Windows Server 2008 I/O
stacks and performance, see the following documents:
http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715eb
/Storport.doc
http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd
/STO089_WH06.ppt
I/O priorities
The Windows Server 2008 I/O subsystem provides a mechanism to specify I/O processing
priorities. Windows primarily uses this mechanism to prioritize critical I/O requests over
background I/O requests. API extensions exist to provide application vendors file-level I/O
http://blogs.technet.com/b/askperf/archive/2008/02/07/ws2008-memory-management-dyn
amic-kernel-addressing-memory-priorities-and-i-o-handling.aspx
10.4 Filesystem
A filesystem is a part of the operating system that determines how files are named, stored,
and organized on a volume. A filesystem manages files, folders, and the information needed
to locate and access these files and folders for local or remote users.
Important: NTFS filesystem compression seems to be the easiest way to increase the
amount of the available capacity. However, it is strictly not advised for use in enterprise
environments. Filesystem compression consumes too much of disk and processor
resources and dramatically increases response times of reading and writing. For better
capacity utilization, consider the DS8000 Thin Provisioning technology and IBM Data
Deduplication technologies.
Start sector offset: The start sector offset must be 256 KB due to the stripe size on the
DS8000. Workloads with small, random I/Os (<16 KB) are unlikely to experience any
significant performance improvement from sector alignment on the DS8000 logical unit
numbers (LUNs).
For additional information about the paging file, see the following links:
http://support.microsoft.com/kb/889654
http://technet.microsoft.com/en-us/magazine/ff382717.aspx
For additional information that relates to VxVM, see Veritas Storage Foundation High
Availability for Windows:
http://www.symantec.com/business/storage-foundation-for-windows
Table 10-1 on page 404 demonstrates examples of typical Windows server workloads,
categorizes them as potential candidates for hybrid pool usage, and describes the goals of
priority management.
Suggestions: The example applications listed in Table 10-1 are examples only and not
specific rules.
The prior approach of workload isolation on the rank level might work for the low skew factor
workloads or some specific workloads. Also, you can use this approach if you are confident in
planning the workload and volume layout.
It also provides I/O load balancing. For each I/O request, SDD dynamically selects one of the
available paths to balance the load across all possible paths.
To receive the benefits of path balancing, ensure that the disk drive subsystem is configured
so that there are multiple paths to each LUN. By using multiple paths to each LUN, you can
benefit from the performance improvements from SDD path balancing. This approach also
prevents the loss of access to data in the event of a path failure.
We describe the Subsystem Device Driver in further detail in “Subsystem Device Driver” on
page 317.
You can obtain additional information about SDDDSM in the SDD User’s Guide:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303
In Windows Server 2003, the MPIO drivers are provided as part of the SDDDSM package. On
Windows Server 2008, they ship with the OS.
SDDDSM: For non-clustered environments, we suggest that you use SDDDSM for its
performance and scalability improvements.
To configure the HBA, see the IBM System Storage DS8700 and DS8800 Introduction and
Planning Guide, GC27-2297-07. This guide contains detailed procedures and recommended
settings. You also need to read the readme file and manuals for the driver, BIOS, and HBA.
Obtain a list of supported HBAs, firmware, and device driver information at this website:
http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w
ss?start_over=yes
Newer versions: When configuring the HBA, we strongly advise that you install the
newest version of driver and the BIOS. The newer version includes more effective function
and problem fixes so that the performance or reliability, availability, and service (RAS) can
improve.
The application layer is monitored and tuned with the application-specific tools and metrics
available for monitoring and analyzing application performance on Windows servers.
Application-specific objects and counters are outside the scope of this book.
The I/O Manager and filesystem levels are controlled with the built-in Windows tool, which
is available in Windows Performance Console (perfmon).
The volume manager level can be also monitored with perfmon. However, we do not
suggest that you use any logical volume management in Windows servers.
Fibre Channel port level multipathing is monitored with the tools provided by the
multipathing software: IBM SDD, SDDPCM, or Veritas DMP drivers.
http://www.emulex.com/support.html
http://solutions.qlogic.com/KanisaSupportSite/supportcentral/supportcentral.do?
id=m1
The SAN fabric level and the D8000 level are monitored with the IBM Tivoli Storage
Productivity Center and the DS8000 built-in tools. Because Tivoli Storage Productivity
Center provides more functions to monitor the DS8000 systems, we suggest that you use
it for monitoring and analysis.
Table 10-2 describes the key I/O-related metrics that are reported by perfmon.
Table 10-2 Performance monitoring counters for PhysicalDisk and other objects
Counter Normal Critical values Description
values
Disk Transfers/sec According Close to the The momentary number of disk transfers
to the limits of volume, per second during the collection interval.
workload rank, and extent
pool
Disk Reads/sec According Close to the The momentary number of disk reads per
to the limits of volume, second during the collection interval.
workload rank, and extent
pool
Disk Bytes/sec According Close to the The momentary number of bytes per
to the limits of volume, second during the collection interval.
workload rank, and extent
pool
Disk Read According Close to the The momentary number of bytes read per
Bytes/sec to the limits of volume, second during the collection interval.
workload rank, and extent
pool
Rules
We provide the following rules based on our field experience. Before using these rules for
anything specific, such as a contractual service-level agreement (SLA), you must carefully
analyze and consider these technical requirements: disk speeds, RAID format, workload
variance, workload growth, measurement intervals, and acceptance of response time and
throughput variance. We suggest these rules:
Write and read response times in general must be as specified in the Table 10-2 on
page 409.
There must be a definite correlation between the counter values; therefore, the increase of
one counter value needs to lead to the increase to the others connected to it. For example,
the increase of the Transfers/sec counter leads to the increase of the Average sec/Transfer
The Performance console is a snap-in tool for the Microsoft Management Console (MMC).
You use the Performance console to configure the System Monitor and Performance Logs
and Alerts tools.
You can open the Performance console by clicking Start Programs Administrative
Tools Performance or by typing perfmon on the command line.
With Windows Server 2003, you can open the Performance console by clicking Start
Programs Administrative Tools Performance or by typing perfmon on the command
line.
Figure 10-5 shows several disks. To identify them, right-click the name and click the
Properties option on the right-click menu on each of them. Disks from the DS8000 show IBM
2107 in the properties, which is the definition of the DS8000 machine-type. So, in this
example, our disk from the DS8000 is Disk 2. See Figure 10-6 on page 415.
Multi-Path Disk Device means that you are running SDD. You can also check for SDD from
the Device Manager option of the Computer Management snap-in. See Figure 10-7.
In Figure 10-7, you see several devices and one SDD that is running.
Use the datapath query device command to show the disk information in the SDDDSM
console (Example 10-1).
Figure 10-1 on page 399 shows the disk information. Disk 2 has serial number 75V1818601.
The last four digits are the volume ID, which is 8601 in this example. This disk is connected
with two FC ports of the FC adapter.
List the worldwide port names (WWPNs) of the ports with the datapath query wwpn command
(Figure 10-2 on page 408).
In Figure 10-2 on page 408, there are two WWPNs for Port 2 and Port 4.
Identify the disk in the DS8000 with the DSCLI console (Figure 10-3 on page 412).
Example 10-3 Listing the volumes in the DS8800 DSCLI console (output truncated)
dscli> lsfbvol
Name ID accstate datastate configstate deviceMTM datatype extpool cap (2^30B) cap (10^9B) cap (blocks)
===========================================================================================================
- 8600 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8601 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8603 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8604 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8605 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8606 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8607 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
Figure 10-3 on page 412 shows the output of the DSCLI command lsfbvol that lists all the
fixed block (FB) volumes in the system. Our volume has ID 8601. It is shown in bold. It is
created on extent pool number P4.
Next, list the ranks allocated with this volume with command showfbvol -rank Vol_ID
(Example 10-4).
Figure 10-4 on page 413 shows the output of the command. Volume 8601 is the extent
space-efficient (ESE) volume, with virtual capacity of 100 GB, that occupies four extents (two
from each) from ranks R17 and R18. We have 4 GB of occupied space. Extent pool P4 is
under the Easy Tier automatic management.
Check for the arrays, RAID type, and the DA pair allocation (Figure 10-5 on page 414).
dscli> lsarray
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B)
======================================================================
A0 Assigned Normal 5 (6+P+S) S1 R0 0 300.0
A1 Assigned Normal 5 (6+P+S) S2 R1 0 300.0
A2 Assigned Normal 5 (6+P+S) S3 R17 2 600.0
A3 Assigned Normal 5 (6+P+S) S4 R18 2 600.0
A4 Unassigned Normal 5 (6+P+S) S5 - 2 600.0
Figure 10-5 on page 414 shows the array and DA pair allocation with the lsarray and lsrank
commands. Ranks R17 and R18 relate to arrays A2 and A3 and the Enterprise drives of
600 GB on the DA Pair 2.
From Example 10-4 on page 416, we know that Volume 8601 is in Volume Group V8.
We can list the host connection properties (Figure 10-6 on page 415).
With the command lshostconnect -volgrp VolumeGroup_ID, we can list the ports to which
this volume group is connected. See Example 10-6 on page 417. This volume group uses
host connections with IDs 0008 and 0009 and the worldwide port names (WWPNs) that are
shown in bold. These WWPNs are the same as the WWNs in Example 10-2 on page 416.
List the ports that are used in the disk system for the host connections (Example 10-7).
dscli> lsioport
ID WWPN State Type topo portgrp
===============================================================
I0000 500507630A00029F Online Fibre Channel-SW SCSI-FCP 0
I0001 500507630A00429F Online Fibre Channel-SW FC-AL 0
I0002 500507630A00829F Online Fibre Channel-SW SCSI-FCP 0
I0003 500507630A00C29F Online Fibre Channel-SW SCSI-FCP 0
I0004 500507630A40029F Online Fibre Channel-SW FICON 0
I0005 500507630A40429F Online Fibre Channel-SW SCSI-FCP 0
I0006 500507630A40829F Online Fibre Channel-SW SCSI-FCP 0
I0007 500507630A40C29F Online Fibre Channel-SW SCSI-FCP 0
Example 10-7 shows how to obtain the WWPNs and port IDs with the showhostconnect and
lsioport commands. All of the information that we need is in bold.
After these steps, we have all the configuration information for a single disk in the system.
After the performance data is correlated to the DS8000 LUNs and reformatted, open the
performance data file in Microsoft Excel. It looks similar to Figure 10-8.
DATE TIME Subsyste LUN Disk Disk Avg Read Avg Total Avg Read Read
m Serial Reads/se RT(ms) Time Queue KB/sec
c Length
11/3/2008 13:44:48 75GB192 75GB1924 Disk6 1,035.77 0.612 633.59 0.63 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk2 1,035.75 0.613 634.49 0.63 66,288.07
11/3/2008 13:44:48 75GB192 75GB1924 Disk3 1,035.77 0.612 633.87 0.63 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk5 1,035.77 0.615 637.11 0.64 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk4 1,035.75 0.612 634.38 0.63 66,288.07
11/3/2008 13:44:48 75GB192 75GB1924 Disk1 1,035.77 0.612 633.88 0.63 66,289.14
11/3/2008 14:29:48 75GB192 75GB1924 Disk6 1,047.24 5.076 5,315.42 5.32 67,023.08
11/3/2008 14:29:48 75GB192 75GB1924 Disk2 1,047.27 5.058 5,296.86 5.30 67,025.21
11/3/2008 14:29:48 75GB192 75GB1924 Disk3 1,047.29 5.036 5,274.30 5.27 67,026.28
11/3/2008 14:29:48 75GB192 75GB1924 Disk5 1,047.25 5.052 5,291.01 5.29 67,024.14
11/3/2008 14:29:48 75GB192 75GB1924 Disk4 1,047.29 5.064 5,303.36 5.30 67,026.28
11/3/2008 14:29:48 75GB192 75GB1924 Disk1 1,047.29 5.052 5,290.89 5.29 67,026.28
11/3/2008 13:43:48 75GB192 75GB1924 Disk6 1,035.61 0.612 634.16 0.63 66,279.00
11/3/2008 13:43:48 75GB192 75GB1924 Disk2 1,035.61 0.612 633.88 0.63 66,279.00
11/3/2008 13:43:48 75GB192 75GB1924 Disk3 1,035.61 0.615 636.72 0.64 66,279.00
A quick look at the compiled data in Figure 10-8 gives us a high increase of the response time
without an appropriate increase in the number of IOPS (Disk Reads/sec). This counter shows
that a problem occurred, which we confirm with the increased Queue Length value. We must
look at the drives that show the response time increase and collect additional data for the
drives.
Because we have a possible reason for the read response time increase, we can specify the
further steps to confirm it:
Gather additional performance data for the volumes from the Windows server, including
write activity.
Gather performance data from the back end of the disk system on those volumes for any
background activity or secondary operations.
Examine the balancing policy on the disk path, but it looks good.
Examine the periodic processes initiated in the application. There might be activity on the
log files.
For database applications, separate the log files from the main data and indexes.
Check for any other activity on that drive that can cause an increase of the write I/Os.
You can see how even a small and uncertain amount of the collected performance data can
help you in the early detection of performance problems and help you quickly identify further
steps.
At the disk subsystem level, there can be bottlenecks on the rank level, extent pool level,
device adapter pair level, and cache level that lead to problems on the volume level.
Table 10-3 on page 421 describes the reasons for the problems on different levels.
Rank level 1. Rank IOPS capability exceeded: 1. Split workload between several
Rank bandwidth capability exceeded ranks organized into one
extent pool with rotate extents
2. RAID type does not fit the workload feature. If already organized,
type manually rebalance the ranks.
3. Disk type is wrong 2. Change the RAID level to a
better performing RAID level
4. Physical problems with the disks in
(RAID 5 to RAID 10, for
the rank
example) or migrate extents to
another extent pool with better
conditions.
3. Migrate extents to the better
performing disks.
4. Fix the problems with the
disks.
Extent pool level 1. Extent pool capability reached its 1. Add more ranks to the pool;
maximum examine the STAT reports for
2. Conflicting workloads mixed in the the recommendations; add
same extent pool more tiers in the pool; or
3. No Easy Tier management for this benefit from hot promotion and
pool cold demotion.
4. One of the ranks in the pool is 2. Split workloads to the separate
overloaded pools; split one pool into two
5. Physical problems with the disk in the dedicated to both processor
rank complexes; examine the STAT
data and add required tier; or
set priorities for the workloads
and enable IOPM.
3. Start Easy Tier for this pool by
following the
recommendations from the
STAT tool.
4. Perform the extent
redistribution for the pool; start
Easy Tier and follow
recommendations from the
STAT tool; or use rotate
extents method.
5. Fix the problems with the disks
or remove the rank from the
pool.
Cache level 1. Cache-memory limits are reached 1. Upgrade the cache memory;
2. Workload is not “cache-friendly” add more ranks to the extent
3. Large number of write requests to pool; enable Easy Tier; or split
single volume (rank or extent pool) extent pools evenly between
CECs.
2. Add more disks; benefit from
the micro-tiering to unload the
15K rpm drives; or tune
application if possible.
3. Split the pools and ranks
evenly between CEC to be
able to use all the cache
memory.
At the application level, there can be bottlenecks in the application, multipathing drivers,
device drivers, zoning misconfiguration, or adapter settings. However, a Microsoft
environment is self-tuning and many problems might be fixed without any indication. Windows
can cache many I/Os and serve them from cache. It is important to maintain a large amount
of free Windows Server memory for peak usage. Also, the paging file must be set up based
on our suggestions in 10.4.3, “Paging file” on page 401.
For the other Microsoft applications, follow the suggestions in Table 10-1 on page 404:
To avoid bottlenecks on the SDDDSM side, maintain a balanced use of all the paths and
keep them active always. See Example 10-8. You can see the numbers for reads and
writes on each adapter, which are nearly the same.
SAN zoning, cabling, and FC-adapter settings must be done according to the IBM System
Storage DS8700 and DS8800 Introduction and Planning Guide, GC27-2297-07, but
remember not to have more than four paths per logical volume.
After you detect a disk bottleneck, you might perform several of these actions:
If the disk bottleneck is a result of another application in the shared environment that
causes disk contention, request a LUN on a less utilized rank and migrate the data from
the current rank to the new rank. Start by using Priority Groups.
If the disk bottleneck is caused by too much load that is generated from the Windows
Server to a single DS8000 LUN, spread the I/O activity across more DS8000 ranks, which
might require the allocation of additional LUNs. Start Easy Tier for the volumes and
migrate to hybrid pools.
For more information about Windows Server disk subsystem tuning, see the following
document:
http://www.microsoft.com/whdc/archive/subsys_perf.mspx
This chapter introduces the relevant logical configuration concepts needed to attach VMware
ESX/ESXi Server to a DS8800/DS8700 and focuses on performance-relevant configuration
and measuring options. For further information about how to set up ESX/ESXi Server with the
DS8000, see IBM System Storage DS8000 Architecture and Implementation, SG24-8886.
You can obtain general suggestions about how to set up VMware with IBM hardware from
Tuning IBM System x Servers for Performance, SG24-5287.
The main difference between ESX and ESXi is that ESXi is an embedded platform of ESX on
a hardware component and instead of a service console with a traditional platform, it acts like
firmware. Among other differences, ESXi does not support SAN boot. You can obtain more
information about a comparison of the versions at this website:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC
&externalId=1015000
VMware ESX/ESXi Server supports the use of external storage that can reside on a DS8000
system. The DS8000 storage is typically connected by Fibre Channel (FC) and accessed
over a SAN. Each logical volume that is accessed by a ESX/ESXi Server is configured in a
specific way, and this storage can be presented to the virtual machines (VMs) as virtual disks.
To understand how storage is configured in ESX Server, you must understand the layers of
abstraction that are shown in Figure 11-1.
Virtual machine
vmdk vmdk
Virtual disks
ESX Server
VMFS volume
External storage
For VMware to use external storage, VMware needs to be configured with logical volumes
that are defined in accordance with the expectations of the users, which might include the use
of RAID or striping at a storage hardware level. Striping at a storage hardware level is
preferred, because DS8800/DS8700 can combine EasyTier and IOPM mechanisms. These
At the ESX Server layer, these logical volumes can be addressed as a VMware ESX/ESXi
Server File System (VMFS) volume or as a raw disk that uses Raw Device Mapping (RDM). A
VMFS volume is a storage resource that can serve several VMs as well as several ESX
Servers as consolidated storage. An RDM volume, however, is intended for usage as isolated
storage by a single VM.
Two options exist to use these logical drives within vSphere Server:
Formatting these disks with the VMFS: This option is the most common option, because a
number of features require that the virtual disks are stored on VMFS volumes.
Passing the disk through to the guest OS as a raw disk. No further virtualization occurs.
Instead, the OS writes its own filesystem onto that disk directly as though it is in a
stand-alone environment without an underlying VMFS structure.
The VMFS volumes house the virtual disks that the guest OS sees as its real disks. These
virtual disks are in the form of a file with the extension.vmdk. The guest OS either read/writes
to the virtual disk file (.vmdk) or writes through the VMware ESX/ESXi Server abstraction
layer to a raw disk. In either case, the guest OS considers the disk to be real.
Next, in Figure 11-2, we compare VMware VMFS volumes to logical volumes, so that you can
understand the logical volumes for a DS8800/DS8700 as references to volume IDs, for
example, 1000, 1001, and 1002.
On the Virtual Machine layer, you can configure one or several Virtual Disks (VMDKs) out of a
single VMFS volume. These Virtual Disks can be configured for use by several VMs.
The virtual machine disks are stored as files within a VMFS. When a guest operating system
issues a Small Computer System Interface (SCSI) command to its virtual disks, the VMware
virtualization layer converts this command to VMFS file operations. From the standpoint of the
virtual machine operating system, each virtual disk is recognized as a direct-attached SCSI
drive connected to a SCSI adapter. Device drivers in the virtual machine operating system
communicate with the VMware virtual SCSI controllers. Figure 11-3 on page 428 illustrates
the virtual disk mapping within VMFS.
VMFS
LUN0
disk1
disk2
disk3
VMFS is optimized to run multiple virtual machines as one workload to minimize disk I/O
overhead. A VMFS volume can be spanned across several logical volumes, but there is no
striping available to improve disk throughput in these configurations. Each VMFS volume can
be extended by adding additional logical volumes while the virtual machines use this volume.
Important: A VMFS volume can be spanned across several logical volumes, but there is
no striping available to improve disk throughput in these configurations. With Easy Tier,
hot/cold extents can be promoted or demoted, and you can achieve superior performance
versus economics on VMware ESX/ESXi hosts as well.
An RDM is implemented as a special file in a VMFS volume that acts as a proxy for a raw
device. An RDM combines the advantages of direct access to physical devices with the
advantages of virtual disks in the VMFS. In special configurations, you must use RDM raw
devices, such as in Microsoft Cluster Services (MSCS) clustering, by using virtual machine
snapshots, or by using VMotion, which enables the migration of virtual machines from one
datastore to another with zero downtime.
With RDM volumes, ESX/ESXi Server supports the use of N_Port ID Virtualization (NPIV).
This host bus adapter (HBA) virtualization technology allows a single physical HBA port to
function as multiple logical ports, each with its own worldwide port name (WWPN). This
function can be helpful when you migrate virtual machines between ESX/ESXi Servers by
ESX Server
virtual machine 1 virtual machine 2
HBA1 HBA2
VMFS
LUN0 LUN1 LUN4
.vmdk RDM
Example 11-1 Creating volume groups and host connections for VMware hosts
dscli> mkvolgrp -type scsimap256 VMware_Host_1_volgrp_1
CMUC00030I mkvolgrp: Volume group V19 successfully created.
dscli> mkhostconnect -wwpn 21000024FF2D0F8D -hosttype VMware -volgrp V19 -desc "Vmware host1 hba1"
Vmware_host_1_hba_1
CMUC00012I mkhostconnect: Host connection 0036 successfully created.
The abbreviation vmhba refers to the physical HBA types: either a Fibre Channel HBA, a
SCSI adapter, or even an iSCSI initiator. The SCSI target number and SCSI LUN are
assigned during the scanning of the HBAs for available storage and usually do not change
later. The fourth number indicates a partition on a disk that a VMFS datastore occupies and
must never change for a selected disk. Thus, this example, vmhba2:0:1:1, refers to the first
partition on SCSI LUN1, SCSI target 1, and is accessed through HBA 2.
Figure 11-5 Storage Adapters properties view in the Virtual Infrastructure Client (VI Client)
ESX/ESXi Server provides built-in multipathing support, which means that it is not necessary
to install any additional failover driver. Any external failover drivers, such as subsystem device
drivers (SDD), are not supported for VMware ESX/ESXi. Since ESX/ESXi 4.0, it supports
path failover and the round-robin algorithm.
The default multipath policy for ALUA devices since ESX/ESXi 5 is MRU.
The multipathing policy and the preferred path can be configured from the VI Client or by
using the command-line tool esxcfg-mpath or esxcli (newer versions). See Table 11-1 for
command differences among the ESX/ESXi versions.
Figure 11-6 shows how the preferred path changed from the VI Client.
By using the Fixed multipathing policy, you can implement static load balancing if several
LUNs are attached to the VMware ESX/ESXi Server. The multipathing policy is set on a per
Important: Remember that before zoning your VMware host to a DS8000, you must
ensure that the number of available paths for each LUN must meet a minimum of two paths
(for redundancy), but due to limitations on VMware a maximum of all paths on a host is
1024. The number of paths to a LUN is limited to 32, and the maximum available LUNS per
host is 256. Also, the maximum size of a LUN is (2 TB - 256 bytes). So, plan the LUN size
and the number of paths available to each VMware host carefully to avoid future problems
with provisioning to your VMware hosts.
For example, when you want to configure four LUNs, assign the preferred path of LUN0
through the first path, the one for LUN1 through the second path, the preferred path for LUN2
through the third path, and the one for LUN3 through the fourth path. With this method, you
can spread the throughput over all physical paths in the SAN fabric. Thus, this method results
in optimized performance for the physical connections between the ESX/ESXi Server and the
DS8000.
If the workload varies greatly between the accessed LUNs, it might be a good approach to
monitor the performance on the paths and adjust the configuration according to the workload.
It might be necessary to assign one path as preferred to only one LUN with a high workload
but to share another path as preferred between five separate LUNs that show moderate
workloads. This static load balancing works only if all paths are available. As soon as one
path fails, all LUNs that selected this failing path as preferred fail over to another path and put
additional workload onto those paths. Furthermore, there is no capability to influence the
failover algorithm to which path the failover occurs.
When the active path fails, for example, due to a physical path failure, I/O might pause for
about 30 - 60 seconds until the FC driver determines that the link is down and fails over to one
of the remaining paths. This behavior can cause the virtual disks used by the operating
systems of the virtual machines to appear unresponsive. After failover is complete, I/O
resumes normally. The timeout value for detecting a failed link can be adjusted; it is set in the
HBA BIOS or driver and the way to set this option depends on the HBA hardware and vendor.
The typical failover timeout value is 30 seconds. With VMware ESX/ESXi, you can adjust this
value by editing the device driver options for the installed HBAs in /etc/vmware/esx.conf.
Additionally, you can increase the standard disk timeout value in the virtual machine operating
system to ensure that the operating system is not extensively disrupted and to ensure that the
system logs permanent errors during the failover phase. Adjusting this timeout again depends
on the operating system that is used and the amount of queue that is expected by one path
once it fails; see the appropriate technical documentation for details.
VC includes real-time performance counters that display the past hour (which is not archived),
as well as archived statistics that are stored in a database. The real-time statistics are
collected every 20 seconds and presented in the VI Client for the past 60 minutes
(Figure 11-7).
These real-time counters are also the basis for the archived statistics, but to avoid too much
performance database expansion, the granularity is recalculated according to the age of the
performance counters. Virtual Center collects those real-time counters, aggregates them for a
data point every 5 minutes and stores them as past-day statistics in the database. After one
day, these counters are aggregated once more to a 30-minute interval for the past week
statistics. For the past month, a data point is available every 2 hours, and for the last year, one
datapoint is stored per day.
In general, the Virtual Center statistics are a good basis to get an overview about the actual
performance statistics and to further analyze performance counters over a longer period, for
example, several days or weeks. If a granularity of 20-second intervals is sufficient for your
individual performance monitoring perspective, VC can be a good data source after
configuration. You can obtain more information about how to use the Virtual Center
Performance Statistics at this website:
http://communities.vmware.com/docs/DOC-5230
After this initial configuration, the performance counters are displayed as shown in
Example 11-3.
ADAPTR CID TID LID NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s
vmhba1 - - - 2 1 1 32 238 0 0 - - - 4.11 0.20 3.91 0.00
vmhba2 0 0 0 1 1 1 10 4096 32 0 8 25 0.25 25369.69 25369.30 0.39 198.19
vmhba2 0 0 1 1 1 1 10 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 2 1 1 1 10 4096 32 0 0 0 0.00 0.39 0.00 0.39 0.00
vmhba2 0 0 3 1 1 1 9 4096 32 0 0 0 0.00 0.39 0.00 0.39 0.00
vmhba2 0 0 4 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 5 1 1 1 17 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 6 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 7 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 8 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 9 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 1 - 1 1 10 76 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba3 - - - 1 2 4 16 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba4 - - - 1 2 4 16 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba5 - - - 1 2 20 152 4096 0 0 - - - 0.78 0.00 0.78 0.00
Additionally, you can change the field order and select or clear various performance counters
in the view. The minimum refresh rate is 2 seconds, and the default setting is 5 seconds.
When you use esxtop in Batch Mode, always include all of the counters by using the option
-a. To collect the performance counters every 10 seconds for 100 iterations and save them to
a file, run esxtop this way:
esxtop -b -a -d 10 -n 100 > perf_counters.csv
For additional information about how to use esxtop and other tools, see vSphere Resource
Management Guide:
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_resource_mgmt.pdf
The guest operating system is unaware of the underlying VMware ESX/ESXi virtualization
layer, so any performance data captured inside the VMs can be misleading and must be
analyzed and interpreted only in conjunction with the actual configuration and performance
data gathered in ESX/ESXi Server or on a disk or SAN layer.
There is one additional benefit of the Windows Performance Monitor perfmon (see 10.8.2,
“Windows Performance console (perfmon)” on page 411). When you use esxtop in Batch
Mode with option -a, it collects all available performance counters and thus the collected
comma-separated values (csv) data gets large and cannot be easily parsed. Perfmon can
help you to quickly analyze results or to reduce the amount of csv data to a subset of counters
that can be analyzed more easily by using other utilities. You can obtain more information
about importing the esxtop csv output into perfmon:
http://communities.vmware.com/docs/DOC-5100
It is also important to identify and separate specific workloads, because they can negatively
influence other workloads that might be more business critical.
Within ESX/ESXi Server, it is not possible to configure striping over several LUNs for one
datastore. It is possible to add more than one LUN to a datastore, but adding more than one
LUN to a datastore only extends the available amount of storage by concatenating one or
more additional LUNs without balancing the data over the available logical volumes.
The easiest way to implement striping over several hardware resources is to use storage pool
striping in extent pools (see 4.8, “Planning extent pools” on page 115 for further information)
of the attached DS8000.
The only other possibility to achieve striping at the Virtual Machine level is to configure
several virtual disks for a VM that are on different hardware resources, such as different
HBAs, device adapters, or servers, and then configure striping of those virtual disks within the
host operating system layer.
For performance monitoring purposes, be careful with spanned volumes or even avoid these
configurations. When configuring more than one LUN to a VMFS datastore, the volume space
is spanned across multiple LUNs, which can cause an imbalance in the utilization of those
LUNs. If several virtual disks are initially configured within a datastore and the disks are
mapped to different virtual machines, it is no longer possible to identify in which area of the
configured LUNs the data of each VM is allocated. Thus, it is no longer possible to pinpoint
which host workload causes a possible performance problem.
In summary, avoid using spanned volumes and configure your systems with only one LUN per
datastore.
If a virtual machine (VM) generates more commands to a LUN than the LUN queue depth,
these additional commands are queued in the ESX/ESXi kernel, which increases the latency.
The queue depth is defined on a per LUN basis, not per initiator. An HBA (SCSI initiator)
supports many more outstanding commands.
For ESX/ESXi Server, if two virtual machines access their virtual disks on two different LUNs,
each VM can generate as many active commands as the LUN queue depth. But if those two
virtual machines have their virtual disks on the same LUN (within the same VMFS volume),
the total number of active commands that the two VMs combined can generate without
queuing I/Os in the ESX/ESXi kernel is equal to the LUN queue depth. Therefore, when
several virtual machines share a LUN, the maximum number of outstanding commands to
that LUN from all those VMs together must not exceed the LUN queue depth.
To reduce latency, it is important to ensure that the sum of active commands from all virtual
machines of an ESX/ESXi Server does not frequently exceed the LUN queue depth. If the
LUN queue depth is exceeded regularly, you might either increase the queue depth or move
the virtual disks of a few virtual machines to different VMFS volumes. Therefore, you lower
the number of virtual machines that access a single LUN. The maximum LUN queue depth
per ESX/ESXi Server must not exceed 64. The maximum LUN queue depth per ESX/ESXi
Server can be up to 128 only when a server has exclusive access to a LUN.
VMFS is a filesystem for clustered environments, and it uses SCSI reservations during
administrative operations, such as creating or deleting virtual disks or extending VMFS
volumes. A reservation ensures that at a specific time, a LUN is only available to one
The maximum number of virtual machines that can share the LUN depends on several
conditions. In general, virtual machines with heavy I/O activity result in a smaller number of
possible VMs per LUN. Additionally, you must consider the already discussed LUN queue
depth limits per ESX/ESXi Server and the storage system-specific limits.
RDM offers two configuration modes: virtual compatibility mode and physical compatibility
mode. When you use physical compatibility mode, all SCSI commands toward the virtual disk
are passed directly to the device, which means that all physical characteristics of the
underlying hardware become apparent. Within virtual compatibility mode, the virtual disk is
mapped as a file within a VMFS volume, which allows advanced file locking support and the
use of snapshots. Figure 11-8 compares both possible RDM configuration modes and VMFS.
ESX Server
virtual machine virtual machine virtual machine
Figure 11-9 Result of random workload test for VMFS, RDM physical, and RDM virtual
Performance data varies: The performance data contained in Figure 11-9 and
Figure 11-10 was obtained in a controlled, isolated environment at a specific point in time
by using the configurations, hardware, and software levels available at that time. Actual
results that might be obtained in other operating environments can vary. There is no
guarantee that the same or similar results can be obtained elsewhere. The data is intended
to help illustrate only how different technologies behave in relation to each other.
250,00
200,00
VMFS write
VMFS read
100,00 RDM physical read
RDM virtual read
50,00
0,00
4 8 16 32 64 128
transfer size in KB
Figure 11-10 Result of sequential workload test for VMFS, RDM physical, and RDM virtual
The choice between the available filesystems, VMFS and RDM, has a limited influence on the
data performance of the virtual machines. These tests verified a possible performance
increase of about 2 - 3%.
When using VMware ESX/ESXi, each VMFS datastore segments the allocated LUN into
blocks, which can be between 1 - 8 MB in size. The filesystem used by the virtual machine
operating system optimizes I/O by grouping several sectors into one cluster. The cluster size
usually is in the range of several KB.
If the VM operating system reads a single cluster from its virtual disk, at least one block
(within VMFS) and all the corresponding stripes on physical disk need to be read. Depending
on the sizes and the starting sector of the clusters, blocks, and stripes, reading one cluster
might require reading two blocks and all of the corresponding stripes. Figure 11-11 on
page 440 illustrates that in an unaligned structure, a single I/O request can cause additional
I/O operations. Thus, an unaligned partition setup results in additional I/O that incurs a
penalty on throughput and latency and leads to lower performance for the host data traffic.
cluster
VMFS
block
DS8000 LUN
stripe
Figure 11-11 Processing of a data request in an unaligned structure
An aligned partition setup ensures that a single I/O request results in a minimum number of
physical disk I/Os, eliminating the additional disk operations, which, in fact, results in an
overall performance improvement.
Operating systems using the x86 architecture create partitions with a master boot record
(MBR) of 63 sectors. This design is a relief from older BIOS code from personal computers
that used cylinder, head, and sector addressing instead of Logical Block Addressing (LBA).
The first track is always reserved for the MBR, and the first partition starts at the second track
(cylinder 0, head 1, and sector 1), which is sector 63 in LBA. Also, in current operating
systems, the first 63 sectors cannot be used for data partitions. The first possible start sector
for a partition is 63.
cluster
VMFS
block
DS8000 LUN
stripe
Partition alignment is a known issue in filesystems, but its effect on performance is somehow
controversial. In performance lab tests, it turned out that in general all workloads show a slight
increase in throughput when the partitions are aligned. A significant effect can only be verified
on sequential workloads. Starting with transfer sizes of 32 KB and larger, we recognized
performance improvements of up to 15%.
In general, aligning partitions can improve the overall performance. For random workloads,
we only identified a slight effect. For sequential workloads, a possible performance gain of
about 10% seems to be realistic. So, we can suggest that you align partitions especially for
sequential workload characteristics.
Aligning partitions within an ESX/ESXi Server environment requires two steps. First, the
VMFS partition needs to be aligned. And then, the partitions within the VMware guest system
filesystems must be aligned as well for maximum effectiveness.
You can align the VMFS partition only when configuring a new datastore. When using the VI
Client, the new partition is automatically configured to an offset of 128 sectors = 64 KB. But,
in fact, this configuration is not ideal when using the DS8000 disk storage. Because the
DS8000 uses larger stripe sizes, the offset must be configured to at least the stripe size. For
RAID 5 and RAID 10 in Open Systems attachments, the stripe size is 256 KB, and it is a good
approach to set the offset to 256 KB (or 512 sectors). You can configure an individual offset
only from the ESX/ESXi Server command line.
Example 11-4 shows how to create an aligned partition with an offset of 512 using fdisk.
Then, you must create a VMFS filesystem within the aligned partition by using the vmkfstools
command as shown in Example 11-5.
DISKPART>
You can obtain additional information about aligning VMFS partitions and the performance
effects from the document VMware Infrastructure 3: Recommendations for aligning VMFS
partitions:
http://www.vmware.com/pdf/esx3_partition_align.pdf
We also describe the supported distributions of Linux when you use the DS8000, as well as
the tools that can be helpful for the monitoring and tuning activities:
Linux disk I/O architecture
Host bus adapter (HBA) considerations
Multipathing
Software RAID functions
Logical Volume Manager (LVM)
Disk I/O schedulers
Filesystem considerations
For further clarification and the most current information about supported Linux distributions
and hardware prerequisites, see the System Storage Interoperation Center (SSIC) website:
http://www.ibm.com/systems/support/storage/config/ssic
Further information about supported kernel versions and additional restrictions can be
obtained from the IBM Subsystem Device Driver for Linux website:
http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S4000107
This chapter introduces the relevant logical configuration concepts needed to attach Linux
operating systems to a DS8000 and focuses on performance relevant configuration and
measuring options. For further information about hardware-specific Linux implementation and
general performance considerations about the hardware setup, see the following
documentation:
For a general Linux implementation overview:
Linux Handbook A Guide to IBM Linux Solutions and Resources, SG24-7000
For x86-based architectures:
– Tuning IBM System x Servers for Performance, SG24-5287
– Tuning Red Hat Enterprise Linux on IBM eServer xSeries Servers, REDP-3861
– Tuning SUSE LINUX Enterprise Server on IBM eServer xSeries Servers, REDP-3862
For System p hardware:
– Virtualizing an Infrastructure with System p and Linux, SG24-7499
– Tuning Linux OS on System p The POWER Of Innovation, SG24-7338
For System z hardware:
– Linux on IBM System z: Performance Measurement and Tuning, SG24-6926
– Linux for IBM System z9 and IBM zSeries, SG24-6694
– z/VM and Linux on IBM System z, SG24-7492
The architecture that we describe applies to Open Systems servers attached to the DS8000
by using the Fibre Channel Protocol (FCP). For Linux on System z with extended count key
pdflush
Block layer
I/O scheduler
Device driver
Device driver
Disk device
Disk
Sector
For a quick overview of overall I/O subsystem operations, we use an example of writing data
to a disk. The following sequence outlines the fundamental operations that occur when a
disk-write operation is performed, assuming that the file data is on sectors on disk platters,
that it was read, and is on the page cache:
1. A process requests to write a file through the write() system call.
2. The kernel updates the page cache mapped to the file.
3. A pdflush kernel thread takes care of flushing the page cache to disk.
4. The filesystem layer puts each block buffer together to a bio struct (see 12.2.3, “Block
layer” on page 449) and submits a write request to the block device layer.
5. The block device layer gets requests from upper layers and performs an I/O elevator
operation and puts the requests into the I/O request queue.
This sequence is simplified, because it reflects only I/Os to local physical disks (a SCSI disk
attached via a native SCSI adapter). Storage configurations that use additional virtualization
layers and SAN attachment require additional operations and layers, such as in the DS8000
storage system.
CPU CPU
Data Data
Register Register
Data Data
Cache Cache
CPU CPU
Data Data
Register Register
Data Data
Cache Cache
Disk Disk
Linux uses this principle in many components, such as page cache, file object cache (i-node
cache and directory entry cache), and read ahead buffer.
The synchronization process for a dirty buffer is called flush. In the Linux kernel 2.6
implementation, the pdflush kernel thread is responsible for flushing data to the disk. The
flush occurs on a regular basis (kupdate) and when the proportion of dirty buffers in memory
exceeds a certain threshold (bdflush). The threshold is configurable in the
/proc/sys/vm/dirty_background_ratio file.
The operating system synchronizes the data regularly. But with large amounts of system
memory, it might keep updated data for several days. This delay is unsafe for the data in a
failure. To avoid this situation, we advise that you use the sync command. When invoking the
sync command, all changes and records are updated on the disks and all buffers are cleared.
Periodic usage of sync is necessary in transaction processing environments that frequently
update the same dataset file, which is intended to stay in memory. Data synchronization can
be set to frequent updates automatically, but the sync command is useful in situations when
large data movements are required and copy functions are involved.
Important: Remember to invoke the sync command after the data synchronization in an
application before issuing a FlashCopy and starting remote mirror initial data
synchronization.
When a write is performed, the filesystem layer tries to write to the page cache, which is
made up of block buffers. It makes up a bio structure by putting the contiguous blocks
together and then sends the bio to the block layer (see Figure 12-1 on page 447).
The block layer handles the bio request and links these requests into a queue called the I/O
request queue. This linking operation is called I/O elevator or I/O scheduler. In Linux kernel
2.6 implementations, four types of I/O elevator algorithms are available.
The Linux kernel 2.6 employs a new I/O elevator model. The Linux kernel 2.4 used a single,
general-purpose I/O elevator, but the Linux kernel 2.6 offers a choice of four elevators.
Because the Linux operating system can be used for a wide range of tasks, both I/O devices
and workload characteristics change significantly. A notebook computer probably has
different I/O requirements than a 10000 user database system. To accommodate these
differences, four I/O elevators are available. Further discussion about I/O elevator
implementation and tuning is discussed in 12.3.4, “Tuning the disk I/O scheduler” on
page 456.
SCSI
The Small Computer System Interface (SCSI) is the most commonly used I/O device
technology, especially in the enterprise server environment. In Linux kernel implementations,
SCSI devices are controlled by device driver modules. They consist of the following types of
modules (Figure 12-3):
Upper level drivers: sd_mod, sr_mod (SCSI-CDROM), st (SCSI Tape), and sq (SCSI
generic device)
They provide functionality to support several types of SCSI devices, such as SCSI
CD-ROM, and SCSI tape.
Middle level driver: scsi_mod
It implements SCSI protocol and common SCSI functionality.
Low-level drivers
They provide lower-level access to each device. A low-level driver is specific to a hardware
device and is provided for each device, for example, ips for the IBM ServeRAID controller,
qla2300 for the QLogic HBA, and mptscsih for the LSI Logic SCSI controller.
Pseudo driver: ide-scsi
It is used for IDE-SCSI emulation.
Process
Device
Figure 12-3 Structure of SCSI drivers
For further general performance and tuning recommendations, see Linux Performance and
Tuning Guidelines, REDP-4285.
For each HBA, there are BIOS levels and driver versions available. The supported versions
for each Linux kernel level, distribution, and related information are available from the
following link:
http://www.ibm.com/systems/support/storage/config/hba/index.wss
To configure the HBA correctly, see the IBM TotalStorage DS8000: Host Systems Attachment
Guide, SC26-7625, which includes detailed procedures and suggested settings. Also, read
the readme files and manuals of the driver, BIOS, and HBA.
With each HBA driver, you can configure several parameters. The list of available parameters
depends on the specific HBA type and driver implementation. If these settings are not
configured correctly, it might affect performance or the system might not work correctly.
You can configure each parameter as either temporary or persistent. For temporary
configurations, you can use the modprobe command. Persistent configuration is performed by
editing the following file (based on distribution):
/etc/modprobe.conf for RHEL
/etc/modprobe.conf.local for SLES
To set the queue depth of an Emulex HBA to 20, add the following line to
modprobe.conf(.local):
options lpfc lpfc_lun_queue_depth=20
Specific HBA types support a failover on the HBA level, for example, QLogic HBAs. HBA level
multipathing is generally not supported for the DS8000. When using Device Mapper -
Multipath I/O (DM-MP) or Subsystem Device Driver (SDD) for multipathing, this failover on the
HBA level needs to be disabled. HBA level path failover is disabled by default in new Linux
distributions, although it is enabled in SLES9. To disable failover on a QLogic qla2xxx adapter,
add the following line to modeprobe.conf(.local):
options qla2xxx ql2xfailover=0
For performance reasons, the parameters queue depth and several timeout and retry
parameters in path errors can be interesting.
By changing the queue depth, you can queue more outstanding I/Os on the adapter level,
which can in certain configurations have a positive effect on throughput. However, increasing
the queue depth cannot be generally advised, because it can slow performance or cause
delays, depending on the actual configuration. Thus, the complete setup needs to be checked
carefully before adjusting the queue depth. Increasing queue depth is good enough for the
sequential large block write workloads as well as for some sequential read workloads.
Random workloads do not benefit much from the increased queue depth values. Indeed, you
Example 12-1shows the output of the iostat -kx command. You can see that the average
queue size value might look high enough that the queue depth parameter might be increased.
If you look closer at the statistics, you can see that the service time is low enough and the
counter for write requests merged is high. Many write requests can be merged into fewer
write requests before they are sent to the adapter. Service times for write requests less than
1 ms indicate that writes are cached. Taking all these observations into consideration,
everything is fine with queue depth setting in this example.
High queue depth parameter values might lead to adapter overload situations, which can
cause adapter resets and loss of paths. In turn, this situation might cause adapter failover and
overload the rest of the paths. It might lead to situations where I/O is stuck for a period of
time. We suggested that you decrease those values to allow the faster reaction of the
multipathing module in path or adapter problems to avoid potential failures when you use
DM-MP.
Subsystem Device Driver (SDD) is a generic device driver designed to support the multipath
configuration environments in the DS8000. SDD is provided and maintained by IBM for
several operating systems, including Linux. SDD is the old approach of a multipathing
solution. Starting with kernel Version 2.6, new, smarter multipathing support is available for
Linux.
The Multipath I/O support included in Linux 2.6 kernel versions is based on Device Mapper
(DM), a layer for block device virtualization that supports logical volume management,
multipathing, and software RAID.
In general, IBM SDD is supported on older releases, such as SLES 8 and 9 and RHEL 3 and
4 versions only. For the new distribution releases SLES 10 and 11 and RHEL 5 and 6, DM-MP
is the only supported multipathing solution. Only for SLES9 greater than and equal to SP3
and RHEL 4 greater than and equal to U6, there is both SDD and DM-MP support, but
DM-MP is generally the suggested solution.
We advise that you use DM-MP if possible for your system configuration. DM-MP already is
the preferred multipathing solution for most Linux 2.6 kernels. It is also available for 2.4
kernels, but it needs to be manually included and configured during kernel compilation.
DM-MP is the required multipathing setup for LVM2.
Further information about supported distribution releases, kernel versions, and multipathing
software is documented at the IBM Subsystem Device Driver for Linux website:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S4000107
DM-MP provides round-robin load balancing for multiple paths per LUN. The userspace
component is responsible for automated path discovery and grouping, as well as path
handling and retesting of previously failed paths. The framework is extensible for
hardware-specific functions and additional load balancing or failover algorithms. For more
information about DM-MP, see IBM System Storage DS8000 Host Attachment and
Interoperability, SG24-8887.
IBM provides a device-specific configuration file for the DS8000 for the supported levels of
RHEL and SLES. You must copy the device-specific section of the file to
/etc/multipath.conf before the multipath driver and multipath tools are started.
Example 12-2 sets default parameters for the scanned logical unit numbers (LUNs) and
creates user friendly names for the multipath devices that are managed by DM-MP. Further
configuration, adding aliases for certain LUNs, or blacklisting specific devices can be
manually configured by editing this file.
ftp://ftp.software.ibm.com/storage/subsystem/linux/dm-multipath/3.01/SLES11/multip
ath.conf
Using DM-MP, you can configure various path failover policies, path priorities, and failover
priorities. This type of configuration can be done for each device in the /etc/multipath.conf
setup.
Timeout value: Do not set low timeout and retry count settings on the HBA. They might
lead to the unnecessary failover of the paths. For shortwave distances, it is sufficient to use
the default timeout setting, which is about 70 seconds. However, heavy online transaction
processing (OLTP) environments might require a lower timeout setting because of data
protection, so setting a lower value of about 30 seconds might be required.
For further configuration and setup information, see the following publications:
For SLES:
http://www.suse.com/documentation/sles11/pdfdoc/stor_admin/stor_admin.pdf
For RHEL:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6-Beta/html/DM_Multi
path/index.html
Considerations and comparisons between IBM SDD for Linux and DM-MP:
http://www.ibm.com/support/docview.wss?uid=ssg1S7001664&rs=555
With the use of LVM, you can configure logical extents on multiple physical drives or LUNs.
Each LUN mapped from the DS8000 is divided into one or more physical volumes (PVs).
Several of those PVs can be added to a logical volume group (VG), and later on, logical
volumes (LVs) are configured out of a volume group. Each physical volume (PV) consists of a
number of fixed-size physical extents (PEs). Similarly, each logical volume (LV) consists of a
number of fixed-size logical extents (LEs). A logical volume (LV) is created by mapping logical
With LVM2, you can influence the way that LEs (for a logical volume) are mapped to the
available PEs. With LVM linear mapping, the extents of several PVs are concatenated to build
a larger logical volume. Figure 12-4 illustrates a logical volume across several physical
volumes. With striped mapping, groups of contiguous physical extents, which are called
stripes, are mapped to a single physical volume. With this functionality, it is possible to
configure striping between several LUNs within LVM, which provides approximately the same
performance benefits as the software RAID functions.
physical
extent
logical
extent
stripe
Logical Volume
Figure 12-4 LVM striped mapping of three LUNs to a single logical volume
http://www.tldp.org/HOWTO/LVM-HOWTO/index.html
Using EVMS does not influence the performance of a Linux system directly, but with EVMS
and the LVM, filesystem and Software RAID functions can be configured to influence the
storage performance of a Linux system.
You can obtain more information about using EVMS from this website:
http://evms.sourceforge.net/user_guide/
EVMS used to be included with older SLES releases but was never officially supported for
RHEL. It is not shipped with SLES11 anymore. The latest stable version of EVMS is 2.5.5,
which was released on February 26, 2006. Development on EVMS was discontinued and
LVM is the standard logical volume management solution for Linux.
You can obtain further documentation about how to use the command-line RAID tools in Linux
from this website:
http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html
Modern practice
The role of LVM changed with the DS8000 microcode Release 6.2, Easy Tier, and I/O Priority
Manager (IOPM). Earlier approaches with rank-based volume separation are not necessarily
reasonable configurations anymore. With the DS8000 features, such as Easy Tier and IOPM,
the use of hybrid or multi-tier pools with automated cross-tier and intra-tier management, as
well as micro-tiering capabilities for optimum data relocation might offer excellent
performance. These methods require less storage management effort than the use of many
single-tier volumes striped with LVM. The use of LVM in these configurations might even
result in decreasing real skew factors and inefficient Easy Tier optimization due to diluted heat
distributions.
The preferred way to use LVM, Easy Tier, and IOPM is to use LVM concatenated logical
volumes. This method might be useful when it is not possible to use volumes larger than
2 TB in DS8700/DS8800 (there are still some copy function limitations) or when implementing
disaster recovery solutions that require LVM involvement. In other cases, follow the preferred
practices described in Chapter 4, “Logical configuration performance considerations” on
page 87 and Chapter 3, “Logical configuration concepts and terminology” on page 51.
General LVM configurations must be with one logical volume per one DS8000 volume. With
this approach, you can fully use the latest DS8000 Easy Tier and IOPM capabilities.
You can obtain additional details about configuring and setting up I/O schedulers in Tuning
Linux OS on System p The POWER Of Innovation, SG24-7338.
12.3.5 Filesystems
The filesystems that are available for Linux are designed with different workload and
availability characteristics. If your Linux distribution and the application allow the selection of a
different filesystem, it might be worthwhile to investigate if Ext, Journal File System (JFS),
The workload patterns JFS and XFS are best suited for high-end data warehouses, scientific
workloads, large symmetric multiprocessor (SMP) servers, or streaming media servers.
ReiserFS and Ext3 are typically used for file, web, or mail serving. For write-intense
workloads that create smaller I/Os up to 64 KB, ReiserFS might have an edge over Ext3 with
default journaling mode as seen in Figure 12-5. However, this advantage is only true for
synchronous file operations.
An option to consider is the Ext2 filesystem. Due to its lack of journaling abilities, Ext2
outperforms ReiserFS and Ext3 for synchronous filesystem access regardless of the access
pattern and I/O size. So, Ext2 might be an option when performance is more important than
data integrity. But, the lack of journaling can result in an unrecoverable FS in certain
circumstances. The Ext3 filesystem can also be configured without journaling in which case it
is equal to Ext2 from a performance point of view.
80000
70000
60000
50000
Ext2
kB/sec 40000
Ext3
Ext3 Writeback
30000 ReiserFS
20000
10000
0
4 8 16 32 64 128 256 512 1024 2048
kB/op
Figure 12-5 Random write throughput comparison1 between Ext and ReiserFS (synchronous)
In the most common scenario of an asynchronous filesystem, ReiserFS most often delivers
solid performance and outperforms Ext3 with the default journaling mode (data=ordered).
However, Ext3 is equal to ReiserFS as soon as the default journaling mode is switched to
writeback.
1
The performance data contained in this figure was obtained in a controlled, isolated environment at a specific point
in time by using the configurations, hardware, and software levels available at that time. Actual results that might be
obtained in other operating environments can vary. There is no guarantee that the same or similar results can be
obtained elsewhere. The data is intended only to help illustrate how different technologies behave in relation to each
other.
Ionice only has an effect when several processes compete for I/Os. If you use ionice to favor
certain processes, other, maybe even essential, I/Os of other processes can suffer.
With the DS8000 I/O Priority Manager (IOPM) feature on the DS8700/DS8800 storage
system, you must decide where to use I/O priority management. IOPM has following
advantages:
Provides the flexibility of many levels of priorities to be set
Does not consume the resources on the server
Sets real priority at the disk level
Manages internal bandwidth access contention between several servers
With ionice, you can control the priorities for the use of server resources, that is, to manage
access to the HBA, which is not possible with the DS8000 IOPM. However, this capability is
limited to a single server only, and you cannot manage priorities at the disk system back end.
We suggest that you use IBM DS8000 IOPM in most of the cases for priority management.
The operating system priority management can be used combined with IOPM. This
combination provides the highest level of flexibility.
Mounting filesystems with the noatime option prevents inode access times from being
updated. If file and directory update times are not critical to your implementation, as in a web
serving environment, an administrator might choose to mount filesystems with the noatime
flag in the /etc/fstab file as shown in Example 12-4. The performance benefit of disabling
access time updates to be written to the filesystem ranges from 0 - 10% with an average of
3% for file server workloads.
Example 12-4 Update /etc/fstab file with noatime option set on mounted filesystems
/dev/sdb1 /mountlocation ext3 defaults,noatime 1 2
140000
120000
100000
80000
kB/sec
data=ordered
60000 data=writeback
40000
20000
0
4 8 16 32 64 128 256 512 1024 2048
kB/op
2
The performance data contained in this figure is obtained in a controlled, isolated environment at a specific point in
time by using the configurations, hardware, and software levels available at that time. Actual results that might be
obtained in other operating environments can vary. There is no guarantee that the same or similar results can be
obtained elsewhere. The data is intended to help illustrate how different technologies behave in relation to each
other.
Blocksizes
The blocksize, the smallest amount of data that can be read or written to a drive, can have a
direct impact on server performance. As a guideline, if your server handles many small files, a
smaller blocksize is more efficient. If your server is dedicated to handling large files, a larger
blocksize might improve performance. Blocksizes cannot be changed dynamically on existing
filesystems, and only a reformat modifies the current blocksize. Most Linux distributions allow
blocksizes between 1 K, 2 K, and 4 K. As benchmarks demonstrate, there is hardly any
performance improvement to gain from changing the blocksize of a filesystem, so it is better
to leave it at the default of 4 K. You can use the suggestions of the application vendor in
addition.
Example 12-6 Checking the sda disk with the dmesg|grep sda command
#dmesg|grep sda
[ 1.836698] sda: sda1 sda2
[ 1.839013] sd 0:2:0:0: [sda] Attached SCSI disk
[ 12.169482] EXT3 FS on sda2, internal journal
[ 29.832183] Adding 2104472k swap on /dev/sda1. Priority:-1 extents:1 across:2104472k
# mount
/dev/sda2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
In Example 12-6, you see that disk sda contains several partitions and the swap file. This disk
is the root disk, which is proven by the output of the mount command.
The other disks in the system seem to be the DS8000 disks. You can verify with the
multipath -ll command as shown in Example 12-7. The multipath -ll is deprecated. In
current Linux versions, the multipathd -k interactive prompt is used to communicate with
DM-MP. For more information, see IBM System Storage DS8000 Host Attachment and
Interoperability, SG24-8887.
Example 12-7 shows that the device is a DS8000 volume with an active-active configuration.
The LUNs have user-friendly names of mpathb and mpathc and device names of dm-0 and
dm-1, which appear in the performance statistics. The LUN IDs in the parentheses,
3600507630affc29f0000000000008603 and 3600507630affc29f0000000000008607, contain
the ID of the logical volume in the DS8000 in the last four digits of the whole LUN ID: 8603
and 8607. The output also indicates that the size of each LUN is 100 GB. There is no
hardware handler assigned to this device, and that I/O is supposed to be queued forever in
the event that no paths are available. All paths group are in the active state, which means
that all paths in this group carry all the I/Os to the storage. All paths to the device (LUN) are in
active ready mode. There are four paths per LUN presented in the system as sdX devices
where X is the index number of the disk.
Example 12-8 shows the contents of the /sys/class/fc_host folder. This system has 10 FC
ports. To get the information about each port, use the following script, as shown in
Example 12-9.
This script simplifies the information gathering. It uses 18 as the maximum index for the FC
port, which is true for Example 12-8. You might have your own indexing. See Example 12-10
for the script output.
0x100000051eb2887c
Linkdown
Unknown
unknown
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
0x100000051e8c23b0
Linkdown
Unknown
unknown
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
0x100000051e8c23b1
Linkdown
Unknown
unknown
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
0x210000c0dd17f1a9
Online
Unknown
unknown
10 Gbit
0x210000c0dd17f1ab
Online
Unknown
unknown
0x21000024ff2d0f8c
Online
NPort (fabric via point-to-point)
8 Gbit
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
0x21000024ff2d0f8d
Online
NPort (fabric via point-to-point)
8 Gbit
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
0x21000024ff2d0ed4
Online
NPort (fabric via point-to-point)
8 Gbit
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
0x21000024ff2d0ed5
Online
NPort (fabric via point-to-point)
8 Gbit
1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit
In Example 12-10 on page 464, there are four HBA FC ports that are offline and four FC ports
that are online. There are also two 10 Gb ports on the system.
Another way to discover the connection configuration is to use the systool -av -c fc_host
command as shown in Example 12-11. This command displays extended output and
information about the FC ports. However, this command might not be available in all Linux
distributions.
Example 12-11 Output of the port information with systool -av -c fc_host (only one port shown)
Class Device = "host18"
Class Device path =
"/sys/devices/pci0000:05/0000:05:00.0/0000:06:00.1/host18/fc_host/host18"
fabric_name = "0x100000051e470807"
issue_lip = <store method only>
max_npiv_vports = "254"
node_name = "0x20000024ff2d0ed5"
npiv_vports_inuse = "0"
port_id = "0x35e080"
port_name = "0x21000024ff2d0ed5"
port_state = "Online"
port_type = "NPort (fabric via point-to-point)"
speed = "8 Gbit"
supported_classes = "Class 3"
supported_speeds = "1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit"
symbolic_name = "QLE2562 FW:v5.03.02 DVR:v8.03.01.06.11.1-k8"
system_hostname = ""
tgtid_bind_type = "wwpn (world wide port name)"
uevent =
vport_create = <store method only>
vport_delete = <store method only>
Device = "host18"
Device path = "/sys/devices/pci0000:05/0000:05:00.0/0000:06:00.1/host18"
edc = <store method only>
Example 12-11 on page 465 shows the output for one FC port:
The device file for this port is host18.
This port has a worldwide port name (WWPN) of 21000024ff2d0ed5, which appears in the
fabric.
This port is connected at 8 Gb/sec.
It is a QLogic card with firmware version 5.03.02.
Example 12-10 on page 464 shows all the information for all the FC ports on the system that
are connected to the SAN with four QLogic HBA FC ports and their WWPNs shown in bold.
The rank configuration for the disk can be shown with the showfbvol -rank VOL_ID
command, where VOL_ID is 8607 in this example (Example 12-12).
Example 12-12 on page 466 shows the properties of the logical volume. The following
information can be discovered for the volume:
Occupies three ranks (R1, R17, and R18)
Belongs to volume group V17
Is 100 GB in size
Uses an extent allocation method (EAM) that is managed by Easy Tier
Uses a standard storage allocation method (SAM)
Is a regular, non-thin provisioned volume
dscli> showarray a1
Array A1
SN ATV1814D95C074K
State Assigned
Example 12-13 on page 467 shows how to reveal the physical disk and array information. The
properties of the rank provide the array number, and the array properties provide the disk
information and the RAID type. In this case, rank R1 is located on array A1, which consists of
300 GB solid-state drives (SSDs) in a RAID 5 configuration. The same procedure can be
used for the other ranks of the volume.
Example 12-14 shows that volume group V17 participates in two host connections for two
WWPNs: 21000024FF2D0ED4 and 21000024FF2D0F8C. These WWPNs are the same
WWPNs in Example 12-10 on page 464. Now, we have all the information for a specific
volume.
The symptoms that show that the server might be suffering from a disk bottleneck (or a
hidden memory problem) are shown in Table 12-1.
Disk I/O numbers and wait time Analyze the number of I/Os to the LUN.
This data can be used to discover if reads
or writes are the cause of problem. Use
iostat to get the disk I/Os. Use stap
ioblock.stp to get read/write blocks. Also,
use scsi.stp to get the scsi wait times,
requested submitted, and completed.
Also, long wait times might mean the I/O is
to specific disks and not spread out.
Disk I/O size The memory buffer available for the block
I/O request might not be sufficient, and the
page cache size can be smaller than the
maximum number of Disk I/O size. Use
stap ioblock.stp to get request sizes.
Use iostat to get the blocksizes.
Disk I/O to physical device If all the disk I/Os are directed to the same
physical disk, it might cause a disk I/O
bottleneck. Directing the disk I/O to
different physical disks increases the
performance.
Remember to gather statistics in extended mode with timestamp and with kilobytes or
megabytes values. This information is easier to understand, and you capture all necessary
information at a time. Use the iostat -kxt or iostat -mxt command.
Look at the statistics from the iostat tool to help you understand the situation. You can use
the following suggestions as shown in the examples.
Good situations
Good situations have the following characteristics:
High tps value, high %user value, low %iowait, and low svctm: Very good condition, as
expected
High tps value, high %user value, medium %iowait, and medium svctm: Situation is still
good, but requires attention. Probably write activity is a little higher than expected. Check
write block size and queue size.
Low tps value, low %user value, medium-high %iowait value, low %idle value, high MB\sec
value, and high avgrq-sz value: System performs well with large block write or read
activity.
Bad situations
Low tps value, low %user value, low %iowait, and low svctm: System is not handling disk
I/O. If the application still suffers from the disk I/O, look first in the application, not in the
disk subsystem.
Low tps value, low %user value, high %system value, high or low svctime, and 0 %idle
value: System is stuck with disk I/O. This situation can happen when a path failed, a
device adapter problem, and application errors.
High tps value, medium %user value, medium %system value, high %iowait value, and
high svctim: System consumed the disk resources, and you need to consider an upgrade.
Increase the number of disks first.
These situations are examples for your understanding. Plenty of similar situations might
occur. Remember to analyze not one or two values of the collected data, but try to obtain a full
picture by combining all the available data.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdc 0,00 0,00 978,22 11,88 124198,02 6083,17 263,17 1,75 1,76 0,51 50,30
sdd 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sde 0,00 0,00 20,79 0,00 2538,61 0,00 244,19 0,03 0,95 0,76 1,58
dm-0 59369,31 1309,90 1967,33 15,84 244356,44 8110,89 254,61 154,74 44,90 0,50 99,41
dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdf 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdg 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdh 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdi 0,00 0,00 968,32 3,96 117619,80 2027,72 246,12 1,67 1,71 0,49 47,52
Example 12-16 shows that disk dm-0 is running the workload. It has four paths: sdc, sde, sdg,
and sdi. Workload is distributed now to three paths for reading (sdc, sde, and sdi) and to two
paths for writing (sdc and sdi).
Example 12-17 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows
average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
/dev/sdb1 441.00 3030.00 7.00 30.50 3584.00 24480.00 1792.00 12240.00 748.37 101.70 2717.33 266.67 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
/dev/sdb1 441.00 3030.00 7.00 30.00 3584.00 24480.00 1792.00 12240.00 758.49 101.65 2739.19 270.27 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
/dev/sdb1 438.81 3165.67 6.97 30.35 3566.17 25576.12 1783.08 12788.06 781.01 101.69 2728.00 268.00 100.00
This output in Example 12-17 is a typical example of how a large block write workload can
affect the response time of the small block read workload. There are about 30 write IOPS with
12240 KB/s bandwidth and 7 read IOPS with 1792 KB/s. This output shows a large I/O size,
which is why it shows big queue sizes and high service times. However, the I/O bottleneck is
not necessarily the disk system in this case. The application level needs changes to avoid
Example 12-18 shows the output of the iostat command on a logical partition (LPAR)
configured with 1.2 CPUs running Red Hat Enterprise Linux AS 4 while issuing server writes
to the disks sda and dm-2. The disk transfers per second are 130 for sda and 692 for dm-2.
The %iowait is 6.37%, which might seem high for this workload, but it is not. It is normal for a
mix of write and read workloads. However, it might grow rapidly in the future, so pay attention
to it.
Example 12-19 shows the output of the iostat -k command on an LPAR configured with a
1.2 CPU running Red Hat Enterprise Linux AS 4 that issues server writes to the sda and dm-2
disks. The disk transfers per second are 428 for sda and 4024 for dm-2. The %iowait
increased to 12.42%. Our prediction from the previous example is now true. The workload
became higher and the %iowait value grew, but the %user value remained the same. The disk
system now can hardly manage the workload and requires some tuning or an upgrade.
Although the workload grew, the performance of the user processes did not improve. The
application might issue more requests, but they must wait in the queue instead of being
serviced. Gather the extended iostat statistics.
Changes made to the elevator algorithm as described in 12.3.4, “Tuning the disk I/O
scheduler” on page 456 are displayed in avgrq-sz (average size of request) and avgqu-sz
(average queue length), as illustrated in Example 12-16 on page 472. As the latencies are
lowered by manipulating the elevator settings, avgrq-sz decreases. You can also monitor the
rrqm/s and wrqm/s to see the effect on the number of merged reads and writes that the disk
can manage.
The system must be configured to collect the information and log it; therefore, a cron job
must be set up. Add the following lines as shown in Example 12-20 to the /etc/crontab for
automatic log reporting with cron.
You get a detailed overview of your CPU utilization (%user, %nice, %system, and %idle),
memory paging, network I/O and transfer statistics, process creation activity, activity for block
devices, and interrupts/second over time.
The sar -A command (the -A is equivalent to -bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL,
which selects the most relevant counters of the system) is the most effective way to grep all
relevant performance counters. We suggest that you use the sar command to analyze
whether a system is disk I/O-bound and waits too much, which results in filled-up memory
buffers and low CPU usage. Furthermore, this method is useful to monitor the overall system
performance over a longer time period, for example, days or weeks, to further understand
which times a claimed performance bottleneck is seen.
A variety of additional performance data collection utilities are available for Linux. Most of
them are transferred from UNIX systems. You can obtain more details about those additional
tools in Chapter 9, “Performance considerations for UNIX servers” on page 327.
For a precise performance analysis, you must have statistics from the operating system and
the disk subsystem. For a complete picture to understand the problem, you must analyze both
of them. Statistics from the disk system can be gathered with Tivoli Productivity Center for
Disk. For more information, see Chapter 7, “Practical performance management” on
page 235.
The following IBM i specific features are important for the performance of external storage:
Single-level storage
Object-based architecture
Storage management
Types of disk pools
We describe these features and explain how they relate to the performance of a connected
DS8000.
With the read request, the virtual addresses of the needed record are resolved, and for each
needed page, storage management first looks to see whether it is in main memory. If the
page is there, it is used to resolve the read request. However, if the corresponding page is not
in main memory, a page fault is encountered and it must be retrieved from disk. When a page
is retrieved, it replaces another page in memory that recently was not used; the replaced
page is paged out to disk.
Similarly writing a new record or updating an existing record is done in main memory, and the
affected pages are marked as changed. A changed page normally remains in main memory
until it is written to disk as a result of a page fault. Pages are also written to disk when a file is
closed or when write-to-disk is forced by a user through commands and parameters.
The handling of I/O operations is shown in Figure 13-1 on page 477.
When resolving virtual addresses for I/O operations, storage management directories map
the disk and sector to a virtual address. For a read operation, a directory lookup is performed
to get the needed information for mapping. For a write operation, the information is retrieved
from the page tables. Figure 13-2 illustrates resolving addresses to sectors in storage
volumes.
An I/O operation is done for a block of 8 sectors, which equals one 4 KB page. The exception
is when the storage is connected through Virtual I/O Server (VIOS). In this case, the 8 520
byte sectors from IBM i are mapped into the 9 512 byte sectors of the storage system.
Pages
System ASP
The system ASP is the basic disk pool for IBM i. This ASP contains the IBM i boot disk (load
source), system libraries, indexes, user profiles, and other system objects. The system ASP is
always present in IBM i and is needed for IBM i. IBM i does not IPL (boot) if the system ASP is
inaccessible.
User ASP
A user ASP separates the storage for different objects for easier management. For example,
the libraries and database objects that belong to one application are in one user ASP, and the
objects of another application are in a different user ASP. If user ASPs are defined in the
IBM i system, they are needed for IBM i to IPL.
The DS8000 can connect to an IBM i system in one of the following ways:
Native: FC adapters in IBM i are connected through a storage area network (SAN) to the
host bus adapters (HBAs) in the DS8000.
With Virtual I/O Server Node Port ID Virtualization (VIOS NPIV): FC adapters in the VIOS
are connected through a SAN to the HBAs in the DS8000. IBM i is a client of the VIOS and
uses virtual FC adapters; each virtual FC adapter is mapped to a port in an FC adapter in
the VIOS.
For more information about connecting the DS8000 to IBM i with VIOS_NPIV, see DS8000
Copy Services for IBM i with VIOS, REDP-4584, and IBM System Storage DS8000: Host
Attachment and Interoperability, SG24-8887.
With Virtual I/O Server (VIOS): FC adapters in the VIOS are connected through a SAN to
the HBAs in the DS8000. IBM i is a client of the VIOS, and virtual SCSI adapters in VIOS
are connected to the virtual SCSI adapters in IBM i.
For more information about connecting storage systems to IBM i with the VIOS, see IBM i
and Midrange External Storage, SG24-7668, and DS8000 Copy Services for IBM i with
VIOS, REDP-4584.
IOPs: The information provided in this section refers to connection with IBM i I/O processor
(IOP)-less adapters. For similar information regarding older IOP-based adapters, see IBM i
and IBM System Storage: A Guide to Implementing External Disks on IBM i,
SG24-7120-01.
All listed adapters are IOP-less adapters. They do not require an I/O processor card to offload
the data management. Instead, the CPU manages the I/O and communicates directly with the
FC adapter. Thus, the IOP-less FC technology takes full advantage of the performance
potential in IBM i.
IOP-less FC architecture enables two technology functions that are important for the
performance of the DS8000 with IBM i: Tag Command Queuing and Header Strip Merge.
13.2.3 Multipath
IBM i allows multiple connections from different ports on a single IBM i partition to the same
logical volumes in the DS8000. This multiple connections support provides an extra level of
availability and error recovery between the IBM i and the DS8000. If one IBM i adapter fails, or
one connection to the DS8000 is lost, we can continue using the other connections and
continue communicating with the disk unit. IBM i supports up to eight active connections
(paths) to a single LUN in the DS8000.
In addition to high availability, multiple paths to the same LUN provide load balancing. A
Round-Robin algorithm is used to select the path for sending the I/O requests. This algorithm
enhances the performance of IBM i with the DS8000 LUNs connected in Multipath.
Multipath is part of the IBM i operating system. This Multipath differs from other platforms that
have a specific software component to support multipathing, such as the Subsystem Device
Driver (SDD).
When the DS8000 connects to IBM i through the VIOS, the Multipath in IBM i is implemented
so that each path to a LUN uses a different VIOS. Therefore, at least two VIOSs are required
to implement Multipath for an IBM i client. This way of multipathing provides additional
resiliency in case one VIOS fails. In addition to the IBM i Multipath with two or more VIOS, the
FC adapters in each VIOS can multipath to the connected DS8000 to provide additional
resiliency and enhance performance.
We suggest that you use RAID 10 for IBM i systems, especially for the following types of
workloads:
Workloads with large I/O rates
Workloads with many write operations (low read/write ratio)
Workloads with many random writes
Workloads with low write-cache efficiency
When an IBM i page or a block of data is written to disk space, storage management spreads
it over multiple disks. By spreading data over multiple disks, multiple disk arms work in
parallel for any request to this piece of data, so writes and reads are faster.
When using external storage with IBM i, Storage management sees a logical volume (LUN) in
the DS8000 as a “physical” disk unit. If a LUN is created with the rotate volumes extent
allocation method (EAM), it occupies multiple stripes of a rank. If a LUN is created with the
rotate extents EAM, it is composed of multiple stripes of different ranks. Figure 13-3 shows
the use of the DS8000 disk with IBM i LUNs created with the rotate extents EAM.
6+P+S arrays
LUN 1
S
Disk
unit 1
S
Block of data
Disk
unit 2
7+P array
LUN 2
Figure 13-3 Use of disk arms with LUNs created in rotate extents method
We suggest that you use the Disk Magic tool when you plan the number of ranks in the
DS8000 for an IBM i workload. To provide a good starting point for Disk Magic modeling,
consider the number of ranks that is needed to keep disk utilization under 60% for your IBM i
workload. Table 13-1 shows the maximal number of IBM i I/O/sec for one rank to keep the
disk utilization under 60%, for the workloads with read/write ratios 70/30 and 50/50.
Use the following steps to calculate the necessary number of ranks for your workload by using
Table 13-1:
1. Decide which read/write ratio (70/30 or 50/50) is appropriate for your workload.
2. Decide which RAID level to use for the workload.
3. Look for the corresponding number in Table 13-1.
4. Divide the I/O/sec of your workload by the number from the table to get the number of
ranks.
For example, we show a calculation for a medium IBM i workload with a read/write ratio of
50/50. The IBM i workload experiences 8500 I/O per second at a read/write ratio of 50/50.
The 15K RPM disk drives in RAID 10 are used for the workload. The number of needed ranks
is:
Therefore, we suggest that you use 8 ranks of 15K RPM disk drives in RAID 10 for the
workload.
1
The calculations for the values in Table 13-1 are based on the measurements of how many I/O operations one rank
can handle in a certain RAID level, assuming 20% read cache hit and 30% write cache efficiency for the IBM i
workload. We assume that half of the used ranks have a spare and half are without a spare.
Number of disk drives in the DS8000: In addition to the suggestion for many LUNs, we
also suggest a sufficient number of disk drives in the DS8000 to achieve good IBM i
performance, as described in 13.3.2, “Number of ranks” on page 481.
Another reason why we suggest that you define smaller LUNs for the IBM i is the queue depth
in Tagged Command queuing. With a natively connected DS8000, IBM i manages the queue
depth of 6 concurrent I/O operations to a LUN. With the DS8000 connected through VIOS,
the queue depth for a LUN is 32 concurrent I/O operations. Both of these queue depths are
modest numbers compared to other operating systems. Therefore, you need to define
sufficiently small LUNs for IBM i not to exceed the queue depth with I/O operations.
Also, by considering the manageability and limitations of external storage and IBM i, we
currently suggest that you define LUN sizes of about 70 - 140 GB.
You might think that the rotate volumes EAM for creating IBM i LUNs provides sufficient disk
arms for I/O operations and that the use of the rotate extents EAM is “overvirtualizing”.
However, based on the performance measurements and preferred practices, the rotate
extents EAM of defining LUNs for IBM i still provides the best performance, so we advise that
you use it.
Sharing the ranks among the IBM i systems enables the efficient use of the DS8000
resources. However, the performance of each LPAR is influenced by the workloads in the
other LPARs.
For example, two extent pools are shared among IBM i LPARs A, B, and C. LPAR A
experiences a long peak with large blocksizes that causes a high I/O load on the DS8000
ranks. During that time, the performance of B and the performance of C decrease. But, when
the workload in A is low, B and C experience good response times, because they can use
most of the disk arms in the shared extent pool. In these periods, the response times in B and
C are possibly better than if they use dedicated ranks.
Many IBM i data centers successfully share the ranks with little unpredictable performance,
because the disk arms and cache in the DS8000 are used more efficiently this way.
Other IBM i data centers prefer the stable and predictable performance of each system even
at the cost of more DS8000 resources. These data centers dedicate extent pools to each of
the IBM i LPARs.
Many BM i installations have one or two LPARs with important workloads and several smaller,
less important LPARs. These data centers dedicate ranks to the large systems and share the
ranks among the smaller ones.
For more information about the use of Disk Magic with IBM i, see 6.1, “Disk Magic” on
page 176 and IBM i and IBM System Storage: A Guide to Implementing External Disks on
IBM i, SG24-7120.
To help you better understand the tool functions, we divide them into two groups. We divide
them into performance data collectors (the tools that collect performance data) and
performance data investigators (the tools to analyze the collected data).
Collectors can be managed by IBM System Director Navigator for i, IBM i Operations
Navigator, or IBM i commands.
The following tools are or contain the IBM i performance data investigators:
IBM Performance Tools for i
IBM System Director Navigator for i
iDoctor
Collection Services
The major IBM System i performance data collector is called Collection Services. It is
designed to run all the time to provide data for performance health checks, for analysis of a
sudden performance problem, or for planning new hardware and software upgrades. The tool
is documented in detail in the IBM i Information Center:
http://publib.boulder.ibm.com/eserver/ibmi.html
The following tools can be used to manage the data collection and report creation of
Collection Services:
IBM i Operation Navigator
IBM System Director navigator
IBM Performance Tools for i
iDoctor Collection Service Investigator can be used to create graphs and reports based on
Collection Services data. For more information about iDoctor, see the IBM i iDoctor online
documentation at the following link:
https://www-912.ibm.com/i_dir/idoctor.nsf/documentation.html
With IBM i level V7R1, the Collection Services tool offers additional data collection categories,
including a category for external storage. This category supports the collection of
non-standard data that is associated with certain external storage subsystems that are
attached to an IBM i partition. This data can be viewed within iDoctor, which is described in
“iDoctor” on page 487.
Job Watcher
Job Watcher is an advanced tool for collecting and analyzing performance information to help
you effectively monitor your system or to analyze a performance issue. It is job-centric and
thread-centric and can collect data at intervals of seconds. The collection contains vital
information, such as job CPU and wait statistics, call stacks, SQL statements, objects waited
on, sockets, and TCP. For more information about Job Watcher, see the IBM i Information
Center:
http://publib.boulder.ibm.com/eserver/ibmi.html
Or, see “Web Power - New browser-based Job Watcher tasks help manage your IBM i
performance” in the IBM Systems Magazine (on the IBM Systems magazine page, search for
the title of the article):
http://www.ibmsystemsmag.com/ibmi
Disk Watcher
Disk Watcher is a function of IBM i that provides disk data to help identify the source of
disk-related performance problems on the IBM i platform. It can either collect information
about every I/O in trace mode or collect information in buckets in statistics mode. In statistics
For more information about the use of Disk Watcher, see “A New Way to Look at Disk
Performance” and “Analyzing Disk Watcher Data” in the IBM Systems Magazine (on the IBM
Systems magazine page, search for the title of the article):
http://www.ibmsystemsmag.com/ibmi
Disk Watcher gathers detailed information associated with I/O operations to disk units, and
provides data beyond the data that is available in other IBM i integrated tools, such as Work
with Disk Status (WRKDSKSTS), Work with System Status (WRKSYSSTS), and Work with
System Activity (WKSYSACT).
Performance Explorer
Performance Explorer (PEX) is a data collection tool in IBM i that collects information about a
specific system process or resource to provide detailed insight. PEX complements IBM i
Collection Services.
An example of PEX, IBM i, and connection with external storage is identifying the IBM i
objects that are most suitable to relocate to solid-state drives (SSDs). We often use PEX to
collect IBM i disk events, such as synchronous and asynchronous reads, synchronous and
asynchronous writes, page faults, and page-outs. The collected data is then analyzed by the
iDoctor tool PEX-Analyzer to observe the I/O rates and disk service times of different objects.
The objects that experience the highest accumulated read service time, the highest read rate,
and a modest write rate at the same time are good candidates to relocate to SSDs.
For a better understanding of IBM i architecture and I/O rates, see 13.1, “IBM i storage
architecture” on page 476. For more details about using SSDs with IBM i, see 13.5, “Easy
Tier with IBM i” on page 489.
The Job Watcher part of Performance Tools analyzes the Job Watcher data through the IBM
Systems Director Navigator for i Performance Data Visualizer.
Collection Services reports about disk utilization and activity, which are created with IBM
Performance Tools for i, are used for sizing and Disk Magic modeling of the DS8000 for IBM i:
The Disk Utilization section of the System report
The Disk Utilization section of the Resource report
The Disk Activity section of the Component report
iDoctor
iDoctor is a suite of tools used to manage the collection of data, investigate performance
data, and analyze performance data on IBM i. The goals of iDoctor are to broaden the user
base for performance investigation, simplify and automate processes of collecting and
investigating the performance data, provide immediate access to collected data, and offer
more analysis options.
The iDoctor tools are used to monitor the overall system health at a high level or to drill down
to the performance details within jobs, disk units, and programs. Use iDoctor to analyze data
collected during performance situations. iDoctor is frequently used by IBM, clients, and
consultants to help solve complex performance issues quickly.
One example of using iDoctor PEX-Analyzer is to determine the IBM i objects that are
candidates to move to SSD. In iDoctor, you launch the tool PEX-Analyzer as shown in
Figure 13-5 on page 488.
Select the PEX collection on which you want to work and select the type of graph that you
want to create, as shown in Figure 13-6.
Figure 13-7 on page 489 illustrates an example of the graph that shows the accumulated read
disk service time on IBM i objects. The objects with the highest accumulated read service
time are good candidates to relocate to SSD. For more information about relocating IBM i
data to SSD, see 13.5.2, “IBM i methods for hot-spot management” on page 490.
IBM i Storage Manager spreads the IBM i data across the available disk units (LUNs) so that
each disk drive is about equally occupied. The data is spread in extents that are from 4 KB -
1 MB or even 16 MB. The extents of each object usually span as many LUNs as possible to
provide many volumes to serve the particular object. Therefore, if an object experiences a
high I/O rate, this rate is evenly split among the LUNs. The extents that belong to the
particular object on each LUN are I/O-intense.
Many of the IBM i performance tools work on the object level; they show different types of
read and write rates on each object and disk service times on the objects. For more detailed
Also, the Easy Tier tool monitors and relocates data on the 1 GB extent level. And, IBM i ASP
balancing, which is used to relocate data to SSDs, works on the 1 MB extent level. Monitoring
extents and relocating extents do not depend on the object to which the extents belong; they
occur on the sub-object level.
In certain cases, queries must be created to run on the PEX collection to provide specific
information, for example, the query that provides information about which jobs and threads
use the objects with the highest read service time. You might also need to run a query to
provide the blocksizes of the read operations, because we expect that the reads with
smaller blocksizes profit the most from SSDs. If these queries are needed, contact IBM
Lab Services to create them.
3. Based on the PEX analysis, decide which database objects to relocate to the SSD in the
DS8000. Then, use IBM i commands, such as Change Physical File (CHGPF) with
parameter UNIT(*SSD). Or, use the SQL command ALTER TABLE UNIT SSD, which sets
on the file a preferred media attribute that invokes dynamic data movement. The preferred
media attribute can be set on database tables and indexes, as well as on User-Defined
File Systems (UDFS).
For more information about the UDFS, see the IBM i Information Center:
http://publib.boulder.ibm.com/eserver/ibmi.html
ASP balancing
This IBM i method is similar to DS8000 Easy Tier, because it is based on the data movement
within an ASP by IBM i ASP balancing. The ASP balancing function is designed to improve
IBM i system performance by balancing disk utilization across all of the disk units (or LUNs) in
an auxiliary storage pool. It provides four ways to balance an ASP. Two of these ways relate to
data relocation to SSDs:
Hierarchical Storage Management (HSM) balancing
Media preference balancing
The HSM balancer function, which traditionally supports data migration between
high-performance and low-performance internal disk drives, is extended for the support of
data migration between SSDs and hard disk drives (HDDs). The disk drives can be internal or
reside on the DS8000. The data movement is based on the weighted read I/O count statistics
for each 1 MB extent of an ASP. Data monitoring and relocation is achieved by the following
two steps:
1. Run the ASP balancer tracing function during the important period by using the
TRCASPBAL command. This function collects the relevant data statistics.
2. By using the STRASPBAL TYPE(*HSM) command, you move the data to SSD and HDD
based on the statistics that you collected in the previous step.
The Media preference balancer function is the ASP balancing function that helps to correct
any issues with Media preference-flagged database objects or UDFS files not on their
preferred media type, which is either SSD or HDD, based on the specified subtype parameter.
ASP balancer migration priority is an option in the ASP balancer so that you can specify the
migration priority for certain balancing operations, including *HSM or *MP in levels of either
*LOW, *MEDIUM, or *HIGH, thus influencing the speed of data migration.
Location: For data relocation with Media preference or ASP balancing, the LUNs defined
on SSD and on HDD need to reside in the same IBM i ASP. It is not necessary that they are
in the same extent pool in the DS8000.
The method requires that you create a separate ASP that contains LUNs that reside on the
DS8000 SSD and, then, save the relevant IBM i libraries and restore them to the ASP with
SSD. All the files in the libraries then reside on SSDs, and the performance of the applications
that use these files improves.
Additional information
For more information about the IBM i methods for SSD hot-spot management, including the
information about IBM i prerequisites, see the following documents:
IBM i 7.1 Technical Overview, SG24-7858
“Performance Value of Solid State Drives using IBM i”:
http://www.ibm.com/systems/resources/ssd_ibmi.pdf
IBM i Two-Tiered Hybrid Storage with IBM System Storage DS8700 Solid State Drives
Hot-Spot Data Analysis and Migration in an IBM i DS8700 Environment:
http://w3.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101868
Before deciding on a mixed SSD and HDD environment or deciding to obtain additional
SSDs, consider these questions:
How many SSDs do you need to install to get the optimal balance between the
performance improvement and the cost?
What is the estimated performance improvement after you install the SSDs?
The clients that use IBM i Media preference get at least a partial answer to these questions
from the collected PEX data by using queries and calculations. The clients that decide on
DS8000 Easy Tier or even IBM i ASP balancing get the key information to answer these
questions by the skew level of their workloads. The skew level describes how the I/O activity is
To provide an example of the skew level of a typical IBM i installation, we use the IBM i
benchmark workload, which is based on the workload TPC-E. TPC-E is a new online
transaction processing (OLTP) workload developed by the Tivoli Storage Productivity Center.
It uses a database to model a brokerage firm with customers who generate transactions
related to trades, account inquiries, and market research. The brokerage firm in turn interacts
with financial markets to execute orders on behalf of the customers and updates relevant
account information. The benchmark workload is scalable, which means that the number of
customers defined for the brokerage firm can be varied to represent the workloads of
different-sized businesses. The workload runs with a configurable number of job sets. Each
job set executes independently, generating its own brokerage firm next transaction. By
increasing the number of job sets, we increased the throughput and CPU utilization of the run.
We used the following configuration for Easy Tier monitoring for which we obtained the skew
level:
IBM i LPAR with 8 processing units and 60 GB memory in POWER7 model 770
Disk space for the IBM i provided from an extent pool with 4 ranks of HDD in DS8800 code
level 6.2
48 LUNs of the size 70 GB used for IBM i (the LUNs are defined in the rotate extents EAM
from the extent pool)
The LUNs are connected to IBM i in Multipath through 2 ports in separate 4 Gb FC
adapters
In the IBM i LPAR, we ran the following two workloads in turn:
– The workload with 6 database instances and 6 job sets
– The workload with 6 database instances and 3 job sets
The workload with 6 database instances was used to achieve 35% occupation of the disk
space. During the run with 6 job sets, the access density was about 2.7 IO/sec/GB. During the
run with 3 job sets, the access density was about 0.3 IO/sec/GB.
Figure 13-9 on page 494 and Figure 13-10 on page 495 show the skew level of the IBM i
workload from the Easy Tier data collected during 24 hours. On Figure 13-9 on page 494, you
can see the percentage of reads with small blocksizes and the percentage of transferred MB
by the percentage of occupied disk space. The degree of skew is small due to the efficient
spreading of data across available LUNs by IBM i storage management, which is described in
13.5.1, “Hot data in an IBM i workload” on page 489.
As shown in Figure 13-10 on page 495, the degree of skew for the same workload on all
allocated extents is higher, because only 35% of the available disk space is occupied by IBM i
data.
Perform the following steps to use the STAT for an IBM i workload:
1. Enable the collection of the heat data I/O statistics by changing the Easy Tier monitor
parameter to all or automode. Use the DS8000 command-line interface (DSCLI)
command chsi -etmonitor all or chsi -etmonitor automode. The parameter -etmonitor
all enables monitoring on all LUNs in the DS8000. The parameter -etmonitor automode
monitors the volumes that are managed by Easy Tier automatic mode only.
2. Offload the collected data from the DS8000 clusters to the user workstation. Use either
the DS8000 Storage Manager GUI or the DSCLI command offloadfile -etdata
Figure 13-11 shows an example of the STAT heat distribution on IBM i LUNs after running the
IBM i workload described in 13.5.3, “Skew level of an IBM i workload” on page 492. The hot
and warm data is evenly spread across the volumes, which is typical for an IBM i workload
distribution.
An IBM i client can also use the IBM i Media preference or ASP balancing method for hot-spot
management. It is not our goal to compare the performance for the three relocation methods.
However, we do not expect much difference in performance by using one or another method.
Probably, factors such as ease of use, consolidation of the management method, or control
over which data to move, are more important for an IBM i client to decide which method to
use.
Many IBM i clients run multiple IBM i workloads in different POWER partitions that share the
disk space in the System Storage DS8000. The installations run important production
systems and less important workloads for testing and developing. The other partitions can be
used as disaster recovery targets of production systems in another location. We assume that
IBM i centers with various workloads that share the DS8000 disk space use I/O Priority
Manager to achieve a more efficient spread of storage resources.
Next, we show a simple example of using the I/O Priority Manager for two IBM i workloads.
The POWER partition ITSO_1 is configured with 4 processor units, 56 GB memory, and 48
70 GB LUNs in a DS8000 extent pool with Enterprise drives.
We set up the I/O Priority Manager Performance Group 1 (PG1) for the volumes of ITSO_1 by
using the DSCLI command:
chfbvol -perfgrp pg1 2000-203f
Performance Group 1 is defined for the 64 LUNs, but only 48 LUNs from these 60 LUNs are
added to the ASP and used by the system ITSO_1; the other LUNs are in IBM i
non-configured status.
We set up the I/O Priority Manager Performance Group 11 (PG11) for the volumes of ITSO_2
by using the DSCLI command:
chfbvol -perfgrp pg11 2200-221f
After we define the performance groups for the IBM i LUNs, we ran the IBM i benchmark
workload described in 13.5.3, “Skew level of an IBM i workload” on page 492, with 40 job sets
in each of the ITSO_1 and ITSO_2 partitions.
After the workload finished, we obtained the monitoring reports of each performance group,
PG1 with the LUNs of ITSO_1 and PG11 with the LUNs of ITSO_2, during the 5-hour
workload run with 15-minute monitoring intervals.
Figure 13-12 on page 498 and Figure 13-13 on page 498 show the DSCLI commands that we
used to obtain the reports and the displayed performance values for each performance group.
The workload in Performance Group PG11 shows different I/O characteristics than the
workload in Performance Group 1. Performance Group PG11 also experiences relatively high
response times compared to Performance Group 1. In our example, the workload
characteristics and response times are influenced by the different priority groups, types of
disk drives used, and Easy Tier management.
For more information about IBM i performance with the DS8000 I/O Priority Manager, see
IBM i Shared Storage Performance using IBM System Storage DS8000 I/O Priority Manager,
WP101935, which is available on the IBM Techdocs library.
The following specific DS8000 performance features relate to application I/O in a z/OS
environment:
Parallel Access Volumes (PAVs)
Multiple allegiance
I/O priority queuing
I/O Priority Manager (IOPM)
Logical volume sizes
Fibre Channel connection (FICON)
In the following sections, we describe those DS8000 features and discuss how to best use
them to boost performance.
Traditionally, access to highly active volumes involved manual tuning, splitting data across
multiple volumes, and more actions to avoid those hot spots. With PAV and the z/OS
Workload Manager, you can now almost forget about manual device level performance tuning
or optimizing. The Workload Manager can automatically tune your PAV configuration and
adjust it to workload changes. The DS8000 in conjunction with z/OS can meet the highest
performance requirements.
PAV is implemented by defining alias addresses to the conventional base address. The alias
address provides the mechanism for z/OS to initiate parallel I/O to a volume. An alias is
another address/unit control block (UCB) that can be used to access the volume defined on
the base address. An alias can be associated with a base address defined in the same logical
control unit (LCU) only. The maximum number of addresses that you can define in an LCU is
256. Theoretically, you can define one base address, plus 255 aliases in an LCU.
With dynamic PAV, you do not need to assign as many aliases in an LCU as compared to a
static PAV environment, because the aliases are moved around to the base addresses that
need an extra alias to satisfy an I/O request.
The z/OS Workload Manager (WLM) is used to implement dynamic PAVs. This function is
called dynamic alias management. With dynamic alias management, WLM can automatically
perform alias device reassignments from one base device to another base device to help
meet its goals and to minimize I/Os queuing as workloads change.
WLM manages PAVs across all the members of a sysplex. When deciding on alias
reassignment, WLM considers I/O from all systems in the sysplex. By default, the function is
turned off, and must be explicitly activated for the sysplex through an option in the WLM
service definition, and through a device-level option in the hardware configuration definition
(HCD). Dynamic alias management requires your sysplex to run in WLM Goal mode.
HyperPAV allows different hosts to use one alias to access different base addresses. This
capability reduces the number of alias addresses required to support a set of base addresses
in a System z environment.
10 7000
9
HyperPAV test Dynamic PAV test 6300
8 5600
7 4900
6 4200
IO/sec
msec
5 3500
4 2800
3 2100
2 1400
1 700
0 0
1:20
1:24
1:26
1:30
1:32
1:34
3:18
3:20
3:24
3:26
3:28
3:30
3:32
1:22
1:28
3:22
3:34
tim e
Figure 14-2 on page 503 shows the number of PAVs assigned to the base address.
10
8
# PAV
HyperPAV test
0
1:20
1:22
1:24
1:26
1:28
1:30
1:32
1:34
3:18
3:20
3:22
3:24
3:26
3:28
3:30
3:32
3:34
time
HyperPAV: The number of PAVs almost immediately jumped to around 10 and fluctuates
between 9 and 11.
Dynamic PAV: The number of PAVs starts at one, and gradually WLM increases the PAV
one at a time until it reaches a maximum number of nine.
In this test, we see that HyperPAV assigns more aliases compared to dynamic PAV. But, we
also see that HyperPAV reaches a higher I/O rate compared to dynamic PAV. This test is an
extreme test that tries to show how HyperPAV reacts to a high concurrent I/O rate to a single
volume, as compared to how dynamic PAV responds to this condition.
The conclusion is that HyperPAV reacts immediately to a condition where there is a high
concurrent demand on a volume. The other advantage of HyperPAV is that there is no
overhead for assigning and releasing an alias for every I/O operation that needs an alias.
HyperPAV reduces the number of aliases required even further. IBM Storage Advanced
Technical Support provides an analysis based on your Resource Measurement Facility
(RMF). As a result of this study, they provide a recommendation of how many UCB aliases
you need to define per LCU on your DS8000.
With older storage subsystems (before the DS8000 or ESS), a device has an implicit
allegiance, that is, a relationship created in the disk control unit between the device and a
channel path group, when an I/O operation was accepted by the device. The allegiance
caused the control unit to guarantee access (no busy status presented) to the device for the
remainder of the channel program over the set of paths associated with the allegiance.
The DS8000, because of Multiple Allegiance (MA), can accept multiple parallel I/O requests
from different hosts to the same device address, increasing parallelism and reducing channel
overhead. The requests are accepted by the DS8000 and all requests are processed in
parallel, unless there is a conflict when writing data to the same extent of the count key data
(CKD) logical volume. Still, good application access patterns can improve the global
parallelism by avoiding reserves, limiting the extent scope to a minimum, and setting an
appropriate file mask, for example, if no write is intended.
In systems without MA, all, except the first I/O request to a shared volume, are rejected, and
the I/Os are queued in the System z channel subsystem, showing up as PEND time in the
RMF reports.
The DS8000 ability to run channel programs to the same device in parallel can dramatically
reduce the IOSQ and the PEND time components in shared environments.
In particular, different workloads, for example, batch and online, running in parallel on different
systems can unfavorably affect each other. In these cases, MA can dramatically improve the
overall throughput.
First, we look at a disk subsystem that does not support both of these functions. If there is an
outstanding I/O operation to a volume, all later I/Os must wait, as illustrated in Figure 14-3 on
page 505. I/Os coming from the same LPAR wait in the LPAR, and this wait time is recorded
in IOSQ Time. I/Os coming from different LPARs wait in the disk control unit and are recorded
in Device Busy Delay Time, which is part of PEND Time.
In the DS8000, all these I/Os are executed concurrently by using PAV and MA, as shown in
Figure 14-4 on page 505. I/O from the same LPAR is executed concurrently using UCB 1FF,
which is an alias of base address 100. I/O from a different LPAR is accepted by the disk
control unit and executed concurrently. All these I/O operations are satisfied from either the
cache or one of the disk drive modules (DDMs) on a rank where the volume resides.
One I/O to
one volume
at one time 100
z/OS 1 z/OS 2
Appl.A Appl.B Appl.C
UCB 1FF UCB 100 UCB 100
- Alias to
UCB 100
TCB1: READ1
TCB2: READ2 TCB READ2 READ1 TCB
concurrent concurrent
Important: The domain of an I/O covers the specified extents to which the I/O operation
applies. It is identified by the Define extent command in the channel program. The
domain covered by the Define extent used to be much larger than the domain covered by
the I/O operation. When concurrent I/Os to the same volume are not allowed, there is not
an issue, because later I/Os waited anyway.
With the availability of PAV and MA, this extent conflict might prevent multiple I/Os from
being executed concurrently. This extent conflict can occur when multiple I/O operations try
to execute against the same domain on the volume. The solution is to update the channel
programs so that they minimize the domain that each channel program is covering. For a
random I/O operation, the domain must be the one track where the data resides.
If a write operation is being executed, any read or write to the same domain must wait. The
same case happens if a read to a domain starts, later I/Os that want to write to the same
domain must wait until the read operation is finished.
To summarize, all reads can be executed concurrently, even if they are going to the same
domain on the same volume. A write operation cannot be executed concurrently with any
other read or write operations that access the same domain on the same volume. The
purpose of serializing a write operation to the same domain is to maintain data integrity.
TCB1: WRITE1
TCB2: WRITE2 TCB WRITE2 WRITE1 TCB
concurrent concurrent
Channel programs that cannot execute in parallel are processed in the order that they are
queued. A fast system cannot monopolize access to a device also accessed from a slower
system. Each system gets a fair share.
The DS8000 can also queue I/Os from different z/OS system images in a priority order. z/OS
Workload Manager can use this prioritization to prioritize I/Os from one system against the
others. You can activate I/O Priority Queuing in WLM Goal mode with the I/O priority
management option in the WLM Service Definition settings.
When a channel program with a higher priority comes in and is put ahead of the queue of
channel programs with lower priorities, the priorities of the lower priority programs are
increased. This priority increase prevents high priority channel programs from dominating
lower priority ones and gives each system a fair share, based on the priority assigned by
WLM.
The WLM integration for the I/O Priority Manager is available with z/OS V1.11 and higher,
with the following necessary APARs: OA32298, OA34063, and OA34662.
With z/OS and zWLM software support, the user assigns application priorities through the
Workload Manager. z/OS then assigns an importance value to each I/O, based on the zWLM
inputs. Then, based on the prior history of I/O response times for I/Os with the same
importance value, and based on the zWLM expectations for this response time, z/OS assigns
an achievement value to each I/O. Importance and achievement values for each I/O are then
compared, and the I/O becomes associated with a performance policy, independently of the
volume performance group or policy. When a rank is overloaded, I/O is then managed
according to the preassigned zWLM performance policy, that is, I/O Priority Manager begins
throttling the IO with the lower priority.
For a more detailed explanation of I/O Priority Manager, see DS8000 I/O Priority Manager,
REDP-4760.
In addition to these standard models, the 3390-27 supports up to 32760 cylinders and the
3390-54 supports up to 65520 cylinders. With the availability of the Extended Address
Volume (EAV), we now can support large volumes of up to 1182006 cylinders (1062 times the
capacity of a 3390-1).
Maximum size increase: Originally, the maximum size for EAV volumes was 262668
cylinders. With R6.2 of the DS8000 and the introduction of EAV II volumes, the maximum
size is increased to 1182006 cylinders.
When planning the configuration, also consider future growth. You might want to define more
alias addresses than needed, so that in the future you can add a rank on this LCU, if needed.
Random workload
The measurements for DB2 and IMS online transaction workloads in our measurements
showed that there was only a slight difference in device response time between a six 3390-27
volume configuration compared to a 60 3390-3 volume configuration of equal capacity on the
ESS-F20 using FICON channels.
The measurements for DB2 are shown in Figure 14-7. Even when the device response time
for a large volume configuration is higher, the online transaction response time can
sometimes be lower due to the reduced system overhead of managing fewer volumes.
3
Device response time (msec)
3390-3
3390-27
0
2101 3535
Total I/O rate (IO/sec)
The measurements were carried out so that all volumes were initially assigned with zero or
one alias. WLM dynamic alias management then assigned additional aliases, as needed. The
number of aliases at the end of the test run reflects the number that was adequate to keep
IOSQ down. For this DB2 benchmark, the alias assignment done by WLM resulted in an
approximately 4:1 reduction in the total number of UCBs used.
Sequential workload
Figure 14-8 on page 510 shows elapsed time comparisons between nine 3390-3s compared
to one 3390-27 when a DFSMSdss full volume physical dump and full volume physical
restore are executed. The workloads were run on a 9672-XZ7 processor connected to an
ESS-F20 with eight FICON channels. The volumes are dumped to or restored from a single
3590E tape with an A60 Control Unit with one FICON channel. No PAV aliases were assigned
to any volumes for this test, even though an alias might improve the performance.
500
0
Full volume dump Full volume restore
Larger volumes
To avoid potential I/O bottlenecks when using large volumes, you might also consider the
following suggestions:
Use PAVs to reduce IOS queuing.
Parallel Access Volume (PAV) is of key importance when using large volumes. PAV
enables one z/OS system image to initiate multiple I/Os to a device concurrently, which
keeps IOSQ times down even with many active datasets on the same volume. PAV is a
practical must with large volumes. In particular, we suggest using HyperPAV.
Multiple Allegiance is a function that the DS8000 automatically provides.
Multiple Allegiance automatically allows multiple I/Os from different z/OS systems to be
executed concurrently, which reduces the Device Busy Delay time, which is part of PEND
time.
Eliminate unnecessary reserves.
As the volume sizes grow larger, more data and datasets reside on a single CKD device
address. Thus, having a few large volumes can reduce performance when there are
significant and frequent activities on the volume that reserve the entire volume or the
VTOC/VVDS.
Certain applications might use poorly designed channel programs that define the whole
volume or the whole extent of the dataset it is accessing as their extent range or domain,
14.8 FICON
The DS8800 storage system connects to System z hosts by using FICON channels, with the
addition of Fibre Channel Protocol (FCP) connectivity for Linux on System z hosts.
FICON provides simplified system connectivity and high throughputs. The high data rate
allows short data transfer times, improving overall batch window processing times and
response times especially for data stored using large blocksizes. The pending time
component of the response time, which is caused by director port busy, is eliminated,
because collisions in the director are eliminated with the FICON architecture.
Another performance advantage delivered by FICON is that the DS8000 accepts multiple
channel command words (CCWs) concurrently without waiting for completion of the previous
CCW. Therefore, the setup and execution of multiple CCWs from a single channel happen
concurrently. Contention among multiple I/Os that access the same data is now handled in
the FICON host adapter and queued according to the I/O priority indicated by the Workload
Manager.
FICON Express2 and FICON Express4 channels on the z9® EC and z9 BC systems
introduced the support to the Modified Indirect Data Address Word (MIDAW) facility and a
maximum of 64 open exchanges per channel. The previous maximum of 32 open exchanges
was available on FICON, FICON Express, and IBM System z 990 (z990) and IBM System z
890 (z890) FICON Express2 channels.
Significant performance advantages can be realized by users that access the data remotely.
FICON eliminates data rate droop effect for distances up to 100 km (62.1 miles) for both read
and write operations by using enhanced data buffering and pacing schemes. FICON thus
extends the DS8000 ability to deliver high bandwidth potential to the logical volumes that
need it, when they need it.
For additional information about FICON, see 8.3.1, “FICON” on page 320.
Standard FICON supports IU pacing of 16 IUs in flight. Extended Distance FICON now
extends the IU pacing for the RRS CCW chain to permit 255 IUs inflight without waiting for an
acknowledgement from the control unit, eliminating engagement between the channel and
control unit. This support allows the channel to remember the last pacing information and use
this information for later operations to avoid performance degradation at the start of a new I/O
operation.
Improved IU pacing with 255 IU instead of 16 improves the utilization of the link. For a
4 Gbps link, the channel remains used at a distance of 50 km (31.06 miles) and improves the
distance between servers and control units.
Extended Distance FICON reduces the need for channel extenders in the DS8000 series
2-site and 3-site z/OS Global Mirror configurations because of the increased number of read
commands simultaneously inflight. This capability provides greater throughput over distance
for IBM z/OS Global Mirror (XRC) using the same distance.
Extended Distance FICON does not extend the achievable physical FICON distances, or offer
any performance enhancements in a non-z/OS GM (XRC) environment.
Extended Distance FICON is, at this time, only supported on z10, z Enterprise 196, and z144
channels. FICON Express8S, FICON Express8, FICON Express4, and FICON Express2 are
supported.
For more information about Extended Distance FICON, see IBM System Storage DS8000
Host Attachment and Interoperability, SG24-8887.
2
0
Format writes, multi-domain I/O 1 DS8700/DS8800 with R6.2
QSAM/BSAM/BPAM exploitation 1
z196 FICON Express 8S (FEx8S)
z/OS R11 and above
Figure 14-9 shows the major steps in the zHPF evolution. The latest generation of zHPF
architecture introduced important enhancements:
QSAM/BSAM/BPAM support:
– Allows DSNTYPE=BASIC/LARGE datasets to use zHPF, and to achieve equal or
better than Extended Format datasets
– Partitioned datasets (but not “search key” I/Os for the directory)
– VTOC reads that use BSAM
Format Writes support:
– Especially important when using 4K or 8K page sizes
– Important to DB2 utility loads and restores, as well as Copy and Rebuild Index
– Important to batch jobs in general
DB2 list prefetch support:
– Important for DB2 queries when index and data are disorganized
– Enables new caching/disk algorithm, to be named later
We summarize the benefits of the zHPF architecture compared to the standard FICON
architecture:
Improve execution of small block I/O requests.
Improve the efficiency of channel processors, host bus interfaces, and adapters by
reducing the number of Information Units that need to be processed, increasing the
average bytes per frame, and reducing the number of frames per I/O that need to be
transferred.
Improve reliability, availability, and serviceability (RAS) characteristics by enabling CU to
provide additional information for fault isolate with an interrogate mechanism to query CU
for a missing interrupt.
zHPF multi-track data transfers support up to 256 tracks in a single transfer operation.
Eliminating the 64 K byte limit and allowing a full exploitation of FICON Express8S and
FICON Express8 available bandwidth.
Increased performance for the zHPF protocol:
– FICON Express8S uses a hardware data router that increases the performance for
zHPF.
– The FICON Express8, FICON Express4, and FICON Express2 channels use
firmware-only zHPF implementation.
Enhancements made to both the z/Architecture and the FICON interface architecture deliver
optimizations for online transaction processing (OLTP) workloads by using channel Fibre
ports to transfer commands and data. zHPF helps to reduce the overhead that relates to the
supported commands and improve performance.
From the z/OS point of view, the existing FICON architecture is called command mode, and
zHPF architecture is called transport mode. Both modes are supported by the FICON
Express4 and FICON Express2 features. A parameter in the Operation Request Block (ORB)
is used to determine whether the FICON channel is running in command or transport mode.
There is a microcode requirement on both the CU side and the channel side to enable zHPF
dynamically at the links level. The links basically initialize themselves to allow zHPF as long
as both sides support zHPF. zHPF is exclusive to Systems zEnterprise 196, z114, z10 EC,
and z10 BC.
LMC levels: The zHPF support for QSAM/BSAM/BPAM, Format Writes, and DB2 list
prefetch needs the LMC 7.60.2.xx.xx or higher for DS8800 or LMC 6.60.2.xx.xx or higher
for DS8700. Furthermore, all these enhancements are supported only on Systems z
Enterprise 196 or System z114.
zHPF is available as a licensed feature (7092 and 0709) on the DS8800. Currently, the
following releases of z/OS for zHPF are supported:
z/OS V1.11 and higher with PTFs
z/OS V1.8, V1.9, and V.10 with the IBM Lifecycle Extension with PTFs
ZHPF=YES | NO
This setting can be dynamically changed using the SETIOS MVS command. Furthermore, to
use the zHPF support for QSAM/BSAM/BPAM, the following statement is introduced to the
IGDSMSxx member of z/OS parmlib:
SAM_USE_HPF(YES | NO)
Also, this setting can be dynamically changed by using the SETSMS MVS command.
Finally, to verify whether zHPF is enabled on your system, use the following command:
/D IOS,zHPF
RESPONSE=SYSX
/D IOS,zHPF
RESPONSE=SYSY
/D IOS,zHPF
RESPONSE=SYSZ
zHPF performance
Figure 14-10 on page 516 and Figure 14-11 on page 516 show a performance comparison
among the various generations of FICON channels with and without zHPF.
z19 6
z11 4
FICON
Expr ess8
8Gbps
FI CON FI CON FICON
Express4 Express8 Express8S
4Gbps 8 Gbps 8Gbps
FICON z1 96
Express4 z10
4Gbps z19 6 z196
z 10 z114
z10
PF
z 10 PF PF zH
zH zH
Figure 14-10 shows that zHPF brought significative improvement to the throughput since the
introduction of the new protocol with FICON Express4 channels. The most impressive
improvement is with the FICON Express8S channels running zHPF where the throughput
increases 158% and 108% compared to FICON Express8S without zHPF and FICON
Express8 with zHPF.
Figure 14-11 shows the improvements to the I/O per second. In this case, the improvement
with the FICON Express8S channels running zHPF is 300% and 77% compared to FICON
Express8S without zHPF and FICON Express8 with zHPF.
FIC ON
Ex p re ss 8S
8 G bp s
z1 96
z1 14
FI CON
Ex p re ss8
8 G b ps
FICO N
z19 6
E xp re ss 4 z10
4G b p s
F ICO N
F ICO N
Ex pr es s8 S
Ex p re ss 8 8G b p s
FIC ON 8G b p s
Ex p re ss4 z10
4 G bp s
z 196
z1 96 z 114
z10
z10
With the latest generation of zHPF and DS8000 microcode Version R6.2, the xSAM access
methods (QSAM/BSAM/BPAM) benefit from this enhanced architecture. Figure 14-12 on
400 45%
200
0
1 stream 8 streams
FICON zHPF
+57%
400
+19%
200
0
1 stream 8 streams
FICON zHPF
The MIDAW facility is a modification to a channel programming technique from S/360 days.
MIDAW is a method of gathering and scattering data into and from noncontiguous storage
locations during an I/O operation, thus decreasing channel, fabric, and control unit overhead
by reducing the number of channel command words (CCWs) and frames processed. There is
no tuning needed to use this MIDAW facility. The following requirements are the minimum
requirements to take advantage of this MIDAW facility:
z9 server.
Applications that use Media Manager.
Applications that use long chains of small blocks.
The biggest performance benefit comes with FICON Express4 channels running on
4 Gbps links, especially when processing extended format datasets.
If chains of small records are processed, MIDAW can improve FICON Express4 performance
if the I/Os use Media Manager. Figure 14-14 shows improved FICON Express4 performance
for a 32x4k READ channel program.
With MIDAWs, 100 MBps is achieved at only 30% channel utilization, and 200 MBps, which is
about the limit of a 2 Gigabit/s FICON link, is achieved at about 60% channel utilization. A
FICON Express4 channel operating at 4 Gigabit/s link speeds can achieve over 325 MBps.
The maximum throughput of a FICON 2 Gb port on the DS8000 with zHPF is 200 MB/s. The
maximum throughput of a FICON 4 Gb port on the DS8000 with zHPF is 400 MB/s. The
maximum throughput of a FICON 8 Gb port on the DS8000 with zHPF is 800 MB/s. The
maximum throughput of FICON Express channels on the System z servers is shown in
Figure 14-10 on page 516.
Considering, for instance, that the maximum throughput of a DS8000 FICON 8 Gb port is not
that much higher than the maximum throughput of a FICON Express8 channel using zHPF, in
general, we do not advise daisy chaining several FICON channels from multiple CECs onto
the same DS8000 I/O port.
However, if you have multiple DS8000s installed, it might be a good option to balance the
channel load on the System z server. You can double the number of required FICON ports on
the DS8000s and daisy chain these FICON ports to the same channels on the System z
server. This design provides the advantage of being able to balance the load on the FICON
channels, because the load on the DS8000 fluctuates during the day.
Figure 14-15 on page 520 shows configuration A with no daisy chaining. In this configuration,
each DS8000 uses 8 FICON ports and each port is connected to a separate FICON channel
on the host. In this case, we have two sets of 8 FICON ports connected to 16 FICON
channels on the System z host.
CEC
DS8000 DS8000
Assumption:
Each line from the FICON channel in the CEC
CEC and each line from the FICON port in the
DS8000 represents a set of 8-paths
DS8000 DS8000
If you have a workload that is truly mission-critical, consider isolating it from other workloads,
particularly if those other workloads are unpredictable in their demands.
Important: z/OS (CKD) data and Open Systems (fixed block (FB)) data are always on
different extent pools and therefore can never share the ranks or DDMs.
The device activity report accounts for all activity to a base and all of its associated alias
addresses. Activity on alias addresses is not reported separately, but the alias addresses are
accumulated into the base address.
Starting with z/OS Release 1.10, this report also shows the number of cylinders allocated to
the volume. Example 14-1 shows 3390-9 volumes that have either 10017 or 30051 cylinders.
PAV
PAV is the number of addresses assigned to a UCB, which includes the base address plus
the number of aliases assigned to that base address.
RMF reports the number of PAV addresses (or in RMF terms, exposures) that are used by a
device. In a dynamic PAV environment, when the number of exposures changes during the
reporting interval, there is an asterisk next to the PAV number. Example 14-1 shows that
address A106 has a PAV of 8*. The asterisk indicates that the number of PAVs is either lower
or higher than 8 during the previous RMF period.
Important: The number of PAVs includes the base address plus the number of aliases
assigned to it. Thus, a PAV=1 means that the base address has no aliases assigned to it.
Example 14-2 RMF DASD report for HyperPAV volumes (report created on pre-z/OS 1.10)
D I R E C T A C C E S S D E V I C E A C T I V I T Y
9500 3390 HY9500 1.0H 0227 0.900 0.8 0.0 0.3 0.0 0.4 0.0 0.4 0.04 0.04 0.0 0.0 100.0 0.0
9501 3390 HY9501 1.0H 0227 0.000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 100.0 0.0
9502 3390 HY9502 1.0H 0227 0.000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 100.0 0.0
9503 3390 HY9503 1.0H 0227 0.000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 100.0 0.0
9504 3390 HY9504 1.0H 0227 0.000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.0 100.0 0.0
9505 3390 HY9505 9.6H 0227 5747.73 1.6 0.0 0.4 0.0 0.5 0.0 1.0 60.44 60.83 0.0 10.9 100.0 0.0
IOSQ time
IOSQ time is the time measured when an I/O request is being queued in the LPAR by z/OS.
The following situations can cause high IOSQ time:
One of the other response time components is high.
When you see a high IOSQ time, look at the other response time components to
investigate where the problem exists.
Sometimes, the IOSQ time is due to the unavailability of aliases to initiate an I/O request.
There is also a slight possibility that the IOSQ is caused by a long busy condition during
device error recovery.
PEND time
PEND time represents the time that an I/O request waits in the hardware. This PEND time
can be increased by the following conditions:
High FICON Director port or DS8000 FICON port utilization:
– High FICON Director port or DS8000 FICON port utilization can be caused by a high
activity rate on those ports.
– More commonly, high FICON Director port or DS8000 FICON port utilization is due to
daisy chaining multiple FICON channels from different CECs to the same port on the
FICON DIrector or the DS8000 FICON host adapter.
In this case, the FICON channel utilization as seen from the host might be low, but the
combination or sum of the utilization of these channels that share the port (either on
the Director or the DS8000) can be significant.
– For more information, see “FICON director” on page 527 and “DS8000 FICON/Fibre
port and host adapter” on page 530.
High FICON host adapter utilization. Using too many ports within a DS8000 host adapter
can overload the host adapter. We suggest that only two out of the four ports in a host
adapter are used.
I/O Processor (IOP/SAP) contention at the System z host. More IOP might be needed.
IOP is the processor in the CEC that is assigned to handle I/Os. For more information, see
“IOP/SAP” on page 525.
CMR Delay is a component of PEND time. See Example 14-1 on page 522. It is the initial
selection time for the first I/O command in a chain for a FICON channel. It can be
elongated by contention downstream from the channel, such as a busy control unit.
Device Busy Delay is also a component of PEND time. See Example 14-1 on page 522.
Device Busy Delay is caused by a domain conflict, because of a read or write operation
against a domain that is in use for update. A high Device Busy Delay time can be caused
by the domain of the I/O not being limited to the track that the I/O operation is accessing. If
you use an Independent Software Vendor (ISV) product, ask the vendor for an updated
version, which might help solve this problem.
DISC time
If the major cause of delay is the DISC time, you need to search more to find the cause. The
most probable cause of high DISC time is having to wait while data is being staged from the
DS8000 rank into cache, because of a read miss operation. This time can be elongated by
the following conditions:
Low read hit ratio. See “Cache and NVS” on page 528. The lower the read hit ratio, the
more read operations must wait for the data to be staged from the DDMs to the cache.
Adding cache to the DS8000 can increase the read hit ratio.
CONN time
For each I/O operation, the channel subsystem measures the time that the DS8000, channel,
and CEC are connected during the data transmission. When there is a high level of utilization
of resources, significant time can be spent in contention, rather than transferring data. Several
reasons exist for high CONN time:
FICON channel saturation. If the channel or BUS utilization at the host exceeds 50%, it
elongates the CONN time. See 14.11.4, “FICON host channel” on page 526. In FICON
channels, the data transmitted is divided into frames. When the channel is busy with
multiple I/O requests, the frames from an I/O are multiplexed with the frames from other
I/Os, thus elongating the elapsed time that it takes to transfer all of the frames that belong
to that I/O. The total of this time, including the transmission time of the other multiplexed
frames, is counted as CONN time.
Contention in the FICON Director, FICON port, and FICON host adapter elongate the
PEND time, which also has the same effect on CONN time. See the PEND time
discussion in “PEND time” on page 524.
Rank saturation caused by high DDM utilization increases DISC time, which also
increases CONN time. See the DISC time discussion in “DISC time” on page 524.
14.11.3 IOP/SAP
The IOP/SAP is the CEC processor that handles the I/O operation. We check the I/O
QUEUING ACTIVITY report (Example 14-3) to determine whether the IOP is saturated. An
average queue length greater than 1 indicates that the IOP is saturated, even though an
average queue length greater than 0.5 is considered as a warning sign. A burst of I/O can
also trigger a high average queue length.
If only certain IOPS are saturated, redistributing the channels assigned to the disk
subsystems can help balance the load to the IOP, because an IOP is assigned to handle a
certain set of channel paths. So, assigning all of the channels from one IOP to access a busy
disk subsystem can cause a saturation on that particular IOP. See the appropriate hardware
manual of the CEC that you use.
- INITIATIVE QUEUE - ------- IOP UTILIZATION ------- -- % I/O REQUESTS RETRIED -- -------- RETRIES / SSCH ---------
IOP ACTIVITY AVG Q % IOP I/O START INTERRUPT CP DP CU DV CP DP CU DV
RATE LNGTH BUSY RATE RATE ALL BUSY BUSY BUSY BUSY ALL BUSY BUSY BUSY BUSY
00 259.349 0.12 0.84 259.339 300.523 31.1 31.1 0.0 0.0 0.0 0.45 0.45 0.00 0.00 0.00
01 127.068 0.14 100.0 126.618 130.871 50.1 50.1 0.0 0.0 0.0 1.01 1.01 0.00 0.00 0.00
02 45.967 0.10 98.33 45.967 54.555 52.0 52.0 0.0 0.0 0.0 1.08 1.08 0.00 0.00 0.00
03 262.093 1.72 0.62 262.093 279.294 32.9 32.9 0.0 0.0 0.0 0.49 0.49 0.00 0.00 0.00
SYS 694.477 0.73 49.95 694.017 765.243 37.8 37.8 0.0 0.0 0.0 0.61 0.61 0.00 0.00 0.00
If these numbers exceed the threshold, you observe an elongated CONN time.
For small block transfers, the BUS utilization is less than the FICON channel utilization. For
large block transfers, the BUS utilization is greater than the FICON channel utilization.
The Generation (G) field in the channel report shows the combination of the generation
FICON channel that is being used and the speed of the FICON channel link for this CHPID at
the time of the machine IPL. The G field does not include any information about the link
between the director and the DS8000. See Table 14-1.
1 Link between the channel and the director is operating at 1 Gbps, which is applicable
to a FICON Express channel
2 Link between the channel and the director is operating at 2 Gbps, which is applicable
to a FICON Express channel
3 Link between the channel and the director is auto negotiating to 1 Gbps, which is
applicable to a FICON Express2 or FICON Express4 channel
4 Link between the channel and the director is auto negotiating to 2 Gbps, which is
applicable to a FICON Express2 or FICON Express4 channel
5 Link between the channel and the director is operating at 4 Gbps, which is applicable
to a FICON Express4 channel
7 Link between the channel and the director is auto negotiating to 2 Gbps, which is
applicable to a FICON Express8 channel
8 Link between the channel and the director is auto negotiating to 4 Gbps, which is
applicable to a FICON Express8 channel
9 Link between the channel and the director is operating at 8 Gbps, which is applicable
to a FICON Express8 channel
11 Link between the channel and the director is operating at 2 Gbps, which is applicable
to a FICON Express8S channel
12 Link between the channel and the director is operating at 4 Gbps, which is applicable
to a FICON Express8S channel
13 Link between the channel and the director is operating at 8 Gbps, which is applicable
to a FICON Express8S channel
If the channel is point-to-point connected to the DS8000 FICON port, the G field indicates the
speed that was negotiated between the FICON channel and the DS8000 port.
The measurements provided for a port in this report include the I/O for the system on which
the report is taken and also include all I/Os that are directed through this port, regardless of
which LPAR requests the I/O.
PORT -CONNECTION- AVG FRAME AVG FRAME SIZE PORT BANDWIDTH (MB/SEC) ERROR
ADDR UNIT ID PACING READ WRITE --READ ----WRITE -- COUNT
05 CHP FA 0 808 285 50.04 10.50 0
07 CHP 4A 0 149 964 20.55 5.01 0
09 CHP FC 0 558 1424 50.07 10.53 0
0B CHP-H F4 0 872 896 50.00 10.56 0
12 CHP D5 0 73 574 20.51 5.07 0
13 CHP C8 0 868 1134 70.52 2.08 1
14 SWITCH ---- 0 962 287 50.03 10.59 0
15 CU C800 0 1188 731 20.54 5.00 0
Sometimes, you see a saturation when you run a benchmark to test a new disk subsystem.
Usually, in a benchmark, you try to run at the highest possible I/O rate on the disk subsystem.
The report shows the I/O requests by read and by write. It shows the rate, the hit rate, and the
hit ratio of the read and the write activities. The read-to-write ratio is also calculated. The total
I/O requests can be higher than the I/O rate shown in the DASD report. In the DASD report,
one channel program is counted as one I/O. However, in the cache report, if there are multiple
Locate Record commands in a channel program, each Locate Record command is counted
as one I/O request.
In this report, we can check to see the value of the read hit ratio. Low read hit ratios contribute
to higher DISC time. For a cache friendly workload, we see a read hit ratio of better than 90%.
The write hit ratio is usually 100%.
High DFW BYPASS is an indication that persistent memory or nonvolatile storage (NVS) is
overcommitted. DFW BYPASS means DASD Fast Write I/Os that are retried, because
persistent memory is full. Calculate the quotient of DFW BYPASS divided by the total I/O rate.
As a rule, if this number is higher than 1%, the write retry operations significantly affect the
DISC time.
Check the DISK ACTIVITY part of the report. The Read response time must be less than 35
ms. If it is higher than 35 ms, it is an indication that the DDMs on the rank where this LCU
resides are saturated.
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM STATUS
------------------------------------------------------------------------------------------------------------------------------------
SUBSYSTEM STORAGE NON-VOLATILE STORAGE STATUS
CONFIGURED 31104M CONFIGURED 1024.0M CACHING - ACTIVE
AVAILABLE 26290M PINNED 0.0 NON-VOLATILE STORAGE - ACTIVE
PINNED 0.0 CACHE FAST WRITE - ACTIVE
OFFLINE 0.0 IML DEVICE AVAILABLE - YES
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
TOTAL I/O 19976 CACHE I/O 19976 CACHE OFFLINE 0
TOTAL H/R 0.804 CACHE H/R 0.804
CACHE I/O -------------READ I/O REQUESTS------------- ----------------------WRITE I/O REQUESTS---------------------- %
REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ
NORMAL 14903 252.6 10984 186.2 0.737 5021 85.1 5021 85.1 5021 85.1 1.000 74.8
SEQUENTIAL 0 0.0 0 0.0 N/A 52 0.9 52 0.9 52 0.9 1.000 0.0
CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A
TOTAL 14903 252.6 10984 186.2 0.737 5073 86.0 5073 86.0 5073 86.0 1.000 74.6
-----------------------CACHE MISSES----------------------- ------------MISC------------ ------NON-CACHE I/O-----
REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE
DFW BYPASS 0 0.0 ICL 0 0.0
NORMAL 3919 66.4 0 0.0 3921 66.5 CFW BYPASS 0 0.0 BYPASS 0 0.0
SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0
CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 3947 66.9
TOTAL 3919 RATE 66.4
---CKD STATISTICS--- ---RECORD CACHING--- ----HOST ADAPTER ACTIVITY--- --------DISK ACTIVITY-------
BYTES BYTES RESP BYTES BYTES
WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC
WRITE HITS 0 WRITE PROM 3456 READ 6.1K 1.5M READ 6.772 53.8K 3.6M
WRITE 5.7K 491.0K WRITE 12.990 6.8K 455.4K
Following the report in Example 14-5 is the CACHE SUBSYSTEM ACTIVITY report by
volume serial number, as shown in Example 14-6. You can see to which extent pool each
volume belongs. In the case where we have the following setup, it is easier to perform the
analysis if a performance problem happens on the LCU:
One extent pool has one rank.
All volumes on an LCU belong to the same extent pool.
If we look at the rank statistics in the report in Example 14-9 on page 532, we know that all
the I/O activity on that rank comes from the same LCU. So, we can concentrate our analysis
on the volumes on that LCU only.
Important: Depending on the DDM size used and the 3390 model selected, you can put
multiple LCUs on one rank, or you can also have an LCU that spans more than one rank.
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM DEVICE OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
VOLUME DEV XTNT % I/O ---CACHE HIT RATE-- ----------DASD I/O RATE---------- ASYNC TOTAL READ WRITE %
SERIAL NUM POOL I/O RATE READ DFW CFW STAGE DFWBP ICL BYP OTHER RATE H/R H/R H/R READ
*ALL 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6
*CACHE-OFF 0.0 0.0
*CACHE 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6
PR7000 7000 0000 22.3 75.5 42.8 19.2 0.0 13.5 0.0 0.0 0.0 0.0 14.4 0.821 0.760 1.000 74.6
PR7001 7001 0000 11.5 38.8 20.9 10.5 0.0 7.5 0.0 0.0 0.0 0.0 7.6 0.807 0.736 1.000 73.1
PR7002 7002 0000 11.1 37.5 20.4 9.5 0.0 7.6 0.0 0.0 0.0 0.0 7.0 0.797 0.729 1.000 74.7
PR7003 7003 0000 11.3 38.3 22.0 8.9 0.0 7.4 0.0 0.0 0.0 0.0 6.8 0.806 0.747 1.000 76.8
PR7004 7004 0000 3.6 12.0 6.8 3.0 0.0 2.3 0.0 0.0 0.0 0.0 2.6 0.810 0.747 1.000 75.2
PR7005 7005 0000 3.7 12.4 6.8 3.2 0.0 2.4 0.0 0.0 0.0 0.0 2.7 0.808 0.741 1.000 74.1
PR7006 7006 0000 3.8 12.8 6.5 3.6 0.0 2.6 0.0 0.0 0.0 0.0 3.1 0.796 0.714 1.000 71.5
PR7007 7007 0000 3.6 12.3 6.9 3.1 0.0 2.4 0.0 0.0 0.0 0.0 2.5 0.806 0.742 1.000 75.2
PR7008 7008 0000 3.6 12.2 6.7 3.4 0.0 2.2 0.0 0.0 0.0 0.0 2.7 0.821 0.753 1.000 72.5
If you specify REPORTS(CACHE(DEVICE)) when running the cache report, you get the
detailed report by volume as in Example 14-7. This report gives you the detailed cache
statistics of each volume. By specifying REPORTS(CACHE(SSID(nnnn))), you can limit this
report to only certain LCUs.
The report shows the same performance statistics as in Example 14-5 on page 529, but at
the level of each volume.
The report shows that the ports are running at 2 Gbps. There are FICON ports, shown under
the heading of LINK TYPE as ECKD READ and ECKD WRITE. There are also PPRC ports,
shown as PPRC SEND and PPRC RECEIVE.
The I/O INTENSITY is the result of multiplication of the operations per second and the
response time per operation. For FICON ports, it is calculated for both the read and write
For FICON ports, if the total I/O intensity reaches 4000, the response time is affected, most
probably, the PEND and CONN times. When this number already approaches 2000, proactive
actions might be needed to prevent a further increase in the total I/O intensity. See the
discussion on PEND and CONN times in “PEND time” on page 524 and “CONN time” on
page 525. This rule does not apply for PPRC ports, especially if the distance between the
primary site and the secondary site is significant.
If the DS8000 is shared between System z and Open Systems, the report in Example 14-8
also shows the port activity used by the Open Systems. It shows up as SCSI READ and SCSI
WRITE on ports 0200 and 0201.
Example 14-8 DS8000 link statistics
E S S L I N K S T A T I S T I C S
Slo t 0 Sl o t 3 Sl o t 0 Slo t 3
Example 14-9 Rank statistics for multiple ranks on one extent pool that uses storage pool striping
E S S R A N K S T A T I S T I C S
For more information about SAN Volume Controller, see Implementing the IBM System
Storage SAN Volume Controller V6.3, SG24-7933.
The SAN Volume Controller solution is designed to reduce both the complexity and costs of
managing your SAN-based storage. With the SAN Volume Controller, you can perform these
tasks:
Simplify management and increase administrator productivity by consolidating storage
management intelligence from disparate storage controllers into a single view.
Improve application availability by enabling data migration between disparate disk storage
devices non-disruptively.
Improve disaster recovery and business continuance needs by applying and managing
Copy Services across disparate disk storage devices within the SAN.
Provide advanced features and functions to the entire SAN:
– Large scalable cache
– Advanced Copy Services
– Space management
– Mapping based on desired performance characteristics
– Quality of Service (QoS) metering and reporting
For I/O purposes, SAN Volume Controller nodes within the cluster are grouped into pairs
(called I/O Groups). A single pair is responsible for serving I/O on a specific virtual disk
(VDisk) volume. One node within the I/O Group represents the preferred path for I/O to a
specific VDisk volume, and the other node represents the non-preferred path. This preference
alternates between nodes as each volume is created within an I/O Group to balance the
workload evenly between the two nodes.
Beyond automatic configuration and cluster administration, the data transmitted from
attached application servers is also treated in the most reliable manner. When data is written
by the server, the preferred node within the I/O Group stores the write data in its own write
cache and the write cache of its partner (non-preferred) node before sending an I/O complete
status back to the server application. To ensure that data is written in the event of a node
failure, the surviving node empties its write cache and proceeds in write-through mode until
the cluster is returned to a fully operational state.
Write-through mode: Write-through mode is where the data is not cached in the nodes
but is written directly to the disk system instead. While operating in this mode, performance
is slightly degraded.
Furthermore, each node in the I/O Group is protected by its own dedicated uninterruptible
power supply.
The SAN must be zoned in such a way that the application servers cannot see the back-end
storage, preventing the SAN Volume Controller and the application servers from both trying to
manage the back-end storage. In the SAN fabric, distinct zones are defined:
In the server zone, the server systems can identify and address the nodes. You can have
more than one server zone. Generally, you create one server zone per server attachment.
In the disk zone, the nodes can identify the disk storage subsystems. Generally, you
create one zone for each distinct storage subsystem.
In the SAN Volume Controller zone, all SAN Volume Controller node ports are permitted to
communicate for cluster management.
Where remote Copy Services are used, an inter-cluster zone must be created.
The SAN Volume Controller I/O Groups are connected to the SAN in such a way that all
back-end storage and all application servers are visible to all of the I/O Groups. The SAN
Volume Controller I/O Groups see the storage presented to the SAN by the back-end
controllers as a number of disks, known as Managed Disks (MDisks). Because the SAN
Volume Controller does not attempt to provide recovery from physical disk failures within the
back-end controllers, MDisks are usually, but not necessarily, part of a RAID array.
MDisks are collected into one or several groups, known as Managed Disk Groups (MDGs), or
storage pools. When an MDisk is assigned to a storage pool, the MDisk is divided into a
number of extents (extent minimum size is 16 MiB and extent maximum size is 8 GiB). The
extents are numbered sequentially from the start to the end of each MDisk.
Chapter 15. IBM System Storage SAN Volume Controller attachment 535
Using MDisks only: For performance considerations, we suggest that for a single-tiered
pool, you create storage pools that use only MDisks, which have the same characteristics
in performance or reliability. For a multi-tiered storage pool, this consideration applies to
the respective tier.
The storage pool provides the capacity in the form of extents, which are used to create
volumes, also known as Virtual Disks (VDisks).
When creating SAN Volume Controller volumes or VDisks, the default option of striped
allocation is normally the best choice. This option helps to balance I/Os across all the
managed disks in a storage pool, which optimizes overall performance and helps to reduce
hot spots. Conceptually, this method is represented in Figure 15-1.
Storage Pool
VDisk is a collection of
Extents
(each 16 MiB to 8 GiB)
The virtualization function in the SAN Volume Controller maps the volumes seen by the
application servers onto the MDisks provided by the back-end controllers. I/O traffic for a
particular volume is, at any one time, handled exclusively by the nodes in a single I/O Group.
Thus, although a cluster can have many nodes within it, the nodes handle I/O in independent
pairs, which means that the I/O capability of the SAN Volume Controller scales well (almost
linearly), because additional throughput can be obtained by adding additional I/O Groups.
Figure 15-2 on page 537 summarizes the various relationships that bridge the physical disks
through to the virtual disks within the SAN Volume Controller architecture.
VD 1
VD 2
VD 3
VD 6
VD 7
VD 4
VD 5
Virtual Disks storage modifications. SVC
Type = 2145 manages the relation between
Virtualization Engine Virtual Disks and Managed
Disks.
Managed Disk Group High Perf Low Cost
MD 2
MD 3
MD 4
MD 5
MD 6
MD 1
MD 7
MD 8
Managed Disks Managed Disks are grouped in
Managed Disks Groups
depending on their
Fabric 1 characteristics – Storage Pools
LUN 3
LUN 1
LUN 2
LUN 4
are directly mapped to SVC
LUN 1
LUN 2
LUN 3
LUN 4
SCSI LUNs
cluster.
RAID RAID
controller 1 controller 2
RAID Array
Physical disks
Because most operating systems cannot resolve multiple paths back to a single physical
device, IBM provides a multipathing device driver. The multipathing driver supported by the
SAN Volume Controller is the IBM Subsystem Device Driver (SDD). SDD groups all available
paths to a virtual disk device and presents it to the operating system. SDD performs all the
path handling and selects the active I/O paths.
SDD supports the concurrent attachment of various DS8000 models, DS6800, ESS,
Storwize® V7000, and SAN Volume Controller storage systems to the same host system.
Where one or more alternate storage systems are to be attached, you can identify the
required version of SDD at this website:
http://www.ibm.com/support/docview.wss?uid=ssg1S7001350
You can use SDD with the native Multipath I/O (MPIO) device driver on AIX and on Microsoft
Windows Server 2003 and Windows Server 2008. For AIX MPIO, a Path Control Module
(SDDPCM) is provided to deliver I/O load balancing. The Subsystem Device Driver Device
Specific Module (SDDDSM) provides multipath I/O support based on the MPIO technology of
Microsoft. For newer Linux versions, a Device Mapper Multipath configuration file is available.
Chapter 15. IBM System Storage SAN Volume Controller attachment 537
15.1.3 SAN Volume Controller Advanced Copy Services
The SAN Volume Controller provides Advanced Copy Services so that you can copy volumes
(VDisks) by using FlashCopy and Remote Copy functions. These Copy Services are available
for all supported servers that connect to the SAN Volume Controller cluster.
FlashCopy makes an instant, point-in-time copy from a source VDisk volume to a target
volume. A FlashCopy can be made only to a volume within the same SAN Volume Controller.
Metro Mirror is a synchronous remote copy, which provides a consistent copy of a source
volume to a target volume. Metro Mirror can copy between volumes (VDisks) on separate
SAN Volume Controller clusters or between volumes within the same I/O Group on the same
SAN Volume Controller.
Global Mirror is an asynchronous remote copy, which provides a remote copy over extended
distances. Global Mirror can copy between volumes (VDisks) on separate SAN Volume
Controller clusters or between volumes within the same I/O Group on the same SAN Volume
Controller.
Important: SAN Volume Controller Copy Services functions are incompatible with the
DS8000 Copy Services.
For details about the configuration and management of SAN Volume Controller Copy
Services, see the Advanced Copy Services chapters of Implementing the IBM System
Storage SAN Volume Controller V6.3, SG24-7933, or SAN Volume Controller V4.3.0
Advanced Copy Services, SG24-7574.
A FlashCopy mapping can be created between any two VDisk volumes in a SAN Volume
Controller cluster. It is not necessary for the volumes to be in the same I/O Group or storage
pool. This functionality can optimize your storage allocation by using an auxiliary storage
system (with, for example, lower performance) as the target of the FlashCopy. In this case,
the resources of your high-performance storage system are dedicated for production. Your
low-cost (lower performance) storage system is used for a secondary application (for
example, backup or development).
An advantage of SAN Volume Controller remote copy is that we can implement these
relationships between two SAN Volume Controller clusters with different back-end disk
subsystems. In this case, you can reduce the overall cost of the disaster-recovery
infrastructure. The production site can use high-performance back-end disk systems, and the
recovery site can use low-cost back-end disk systems, even where the back-end disk
subsystem Copy Services functions are not compatible (for example, different models or
different manufacturers). This relationship is established at the volume level and does not
depend on the back-end disk storage system Copy Services.
In the following section, we present the SAN Volume Controller concepts and discuss the
performance of the SAN Volume Controller. In this section, we assume that there are no
bottlenecks in the SAN or on the disk system.
To determine the number of I/O Groups and to monitor the CPU performance of each node,
you can use Tivoli Storage Productivity Center (TPC). The CPU performance is related to I/O
performance and when the CPUs become consistently 70% busy, you must consider one of
these actions:
Adding more nodes to the cluster and moving part of the workload onto the new nodes
Moving VDisk volumes to another I/O Group, if the other I/O Group is not busy
Important: A VDisk volume can only be moved to another I/O Group if there is no I/O
activity on that volume. Any data in cache on the server must be destaged to disk; it is an
off-line operation from the host perspective. The SAN zoning and port masking might need
to be updated to give access.
To see how busy your CPUs are, you can use the Tivoli Storage Productivity Center
performance report, by selecting the CPU Utilization option.
Chapter 15. IBM System Storage SAN Volume Controller attachment 539
suggest that you start a Disk Magic sizing project to consider the correct number of node
pairs.
With the newly added I/O Group, the SAN Volume Controller cluster can potentially double
the I/O rate per second (IOPS) that it can sustain. A SAN Volume Controller cluster can be
scaled up to an eight-node cluster with which we quadruple the total I/O rate.
The SAN Volume Controller ports are more heavily loaded than the ports of a “native” storage
system, because the SAN Volume Controller nodes need to handle the I/O traffic of these
other components:
All the host I/O
The read-cache miss I/O (the SAN Volume Controller cache-hit rate is less than the rate of
a DS8000)
All write destage I/Os (if VDisk mirroring doubled)
All writes for cache mirroring
Traffic for remote mirroring
You must carefully plan the SAN Volume Controller port bandwidth.
For the DS8000, there is no controller affinity for the LUNs. So, a single zone for all SAN
Volume Controller ports and up to eight DS8000 host adapter (HA) ports must be defined on
each fabric. The DS8000 HA ports must be distributed over as many HA cards as available
and dedicated to SAN Volume Controller use if possible. Using two or three ports on each
DS8000 HA card provides the maximum bandwidth.
Configure a minimum of eight controller ports to the SAN Volume Controller per controller
regardless of the number of nodes in the cluster. Configure 16 controller ports for large
controller configurations where more than 48 DS8000 ranks are being presented to the SAN
Volume Controller cluster.
For the extent size, to maintain maximum flexibility, an extent size of 1 GiB (1024 MiB) is
suggested.
For additional information, see the SAN Volume Controller Best Practices and Performance
Guidelines, SG24-7521.
There are a number of workload attributes that influence the relative performance of RAID 5
compared to RAID 10, including the use of cache, the relative mix of read as opposed to write
operations, and whether data is referenced randomly or sequentially.
SAN Volume Controller does not need to influence your choice of the RAID type used. For
more details about the RAID 5 and RAID 10 differences, see 4.7, “Planning RAID arrays and
ranks” on page 103.
Chapter 15. IBM System Storage SAN Volume Controller attachment 541
divides each rank into 1 GiB extents (where 1 GiB = 230 bytes). A rank must be assigned to
an extent pool to be available for LUN creation.
The DS8000 processor complex (or server group) affinity is determined when the rank is
assigned. Assign the same number of ranks in a DS8000 to each of the processor
complexes. Additionally, if you do not need to use all of the arrays for your SAN Volume
Controller storage pool, select the arrays so that you use arrays from as many device
adapters (DAs) as possible, to best balance the load across the DAs also.
When adding more ranks, MDisks, into the same storage pool, start a manual re-striping by
using a scripting tool, as shown in section 5.6 of SAN Volume Controller Best Practices and
Performance Guidelines, SG24-7521.
Often, clients worked with at least two storage pools: one (or two) containing MDisks of all the
6+P RAID 5 ranks of the DS8000 and the other one (or more) containing the slightly larger
7+P RAID 5 ranks. This approach maintains equal load balancing across all ranks when the
SAN Volume Controller striping occurs, because each MDisk in a storage pool is the same
size then.
The SAN Volume Controller extent size is the stripe size used to stripe across all these
single-rank MDisks.
This approach delivered good performance and has its justifications. However, it also has a
few drawbacks:
There can always be natural skew, for instance, a small file of a few hundred KiB that is
heavily accessed. Even with a smaller SAN Volume Controller extent size, such as
256 MiB, this classical setup led in a few cases to ranks that are more loaded than other
ranks.
1
Starting with SAN Volume Controller 6.1, IBM introduced support for external managed disks (MDisks) larger than
2 TiB for certain types of storage systems. But it was not until SAN Volume Controller v6.2 that the DS8800 and
DS8700 were identified in the SAN Volume Controller 6.2 Interoperability Matrix as having support for MDisks greater
than 2 TiB. The SAN Volume Controller 6.3 restrictions document, which is available at
http://www.ibm.com/support/docview.wss?rs=591&uid=ssg1S1003903#_Extents, provides a table with the maximum
size of an MDisk dependent on the extent size of the storage pool.
An advantage of this classical approach is that it delivers more options for fault isolation and
control over where a certain volume and extent are located.
You need only one MDisk volume size with this approach, because plenty of space is
available in each large DS8000 extent pool. Often, clients choose 2 TiB (2048 GiB) MDisks
for this approach. Create many 2-TiB volumes in each extent pool until the DS8000 extent
pool is full, and provide these MDisks to the SAN Volume Controller to build the storage pools.
Two extent pools are still needed so that each DS8000 processor complex (even/odd) is
loaded.
You might think that with the classical approach, issues with overloaded ranks disappear due
to large pools and DS8000 storage pool striping, even though Easy Tier auto-balancing was
not used yet. When enlarging the extent pools, even without auto-rebalancing, re-striping is
simple. You use only one DS8000 command. And with Easy Tier auto-rebalancing now
available on DS8000 R6.2 and higher, we do not need to pay attention to load rebalancing
when enlarging pools. The auto-rebalancing handles this task.
You can also introduce storage tiering with solid-state drives (SSDs), Enterprise drives, and
nearline disks. DS8000 Easy Tier auto-tiering can help when creating a Tier 0 (SSDs) and a
combined Tier 1+ (Enterprise+nearline HDDs). SAN Volume Controller based Easy Tier can
perform the overall tiering between Tier 0 (SSD) and Tier 1+ (HDD). The DS8000 based Easy
Tier can tier the storage within Tier 1+, that is, between nearline and Enterprise drives. If you
use DS8000 Easy Tier, do not use 100% of your extent pools, you must leave space for Easy
Tier so that it can work.
To maintain the highest flexibility and for easier management, large DS8000 extent pools are
beneficial. However if the SAN Volume Controller DS8000 installation is dedicated to
shared-nothing environments, such as Oracle ASM, DB2 warehouses, or General Parallel
File System (GPFS), use the single-rank extent pools.
Chapter 15. IBM System Storage SAN Volume Controller attachment 543
With the DS8000 supporting volume sizes up to 16 TiB and SAN Volume Controller 6.2+
levels supporting these MDisk sizes, a classical approach is still possible when using large
disks, such as with an array of the 3 TB nearline disks (RAID 6). The volume size in this case
is determined by the rank capacity.
With the modern approach of using large multi-rank extent pools, more clients use a standard
volume MDisk size, such as 2 TiB for all MDisks, with good results.
We suggest that you assign SAN Volume Controller DS8000 LUNs of all the same size for
each storage pool. In this configuration, the workload applied to a Virtual Disk is equally
balanced on the Managed Disks within the storage pool.
A DS8000 LUN assigned as an MDisk can be expanded only if the MDisk is removed from the
storage pool first, which automatically redistributes the defined VDisk extents to other MDisks
in the pool, provided there is space available. The LUN can then be expanded, detected as a
new MDisk, and reassigned to a storage pool.
Nearline drives
Nearline (7200 rpm) drives are in general unsuited to use as SAN Volume Controller MDisks
for high-performance applications. However, if DS8000 Easy Tier is used, they might be
added to pools that already consist of many Enterprise HDDs that take the main part of the
load, or they can be part of a tiered concept.
Volumes can be added dynamically to the SAN Volume Controller. When the volume is added
to the volume group, run the command svctask detectmdisk on the SAN Volume Controller
to add it as a new MDisk.
Before you delete or unmap a volume allocated to the SAN Volume Controller, remove the
MDisk from the SAN Volume Controller storage pool, which automatically migrates any
extents for defined volumes to other MDisks in the storage pool, provided there is space
available. When it is unmapped on the DS8000, run the command svctask detectmdisk and
then run the maintenance procedure on the SAN Volume Controller to confirm its removal.
New DS8000 shipments usually include the IBM System Storage Productivity Center (SSPC),
which is a storage system management console that includes Tivoli Storage Productivity
Center Basic Edition, which provides these functions:
Storage Topology Viewer
The ability to monitor capacity, alert, report, and provision storage
Status dashboard
IBM System Storage DS8000 GUI integration
The Tivoli Storage Productivity Center Basic Edition provided with the SSPC can be
upgraded to the full Tivoli Storage Productivity Center Standard Edition license if required,
which, like its module Tivoli Storage Productivity Center for Disk, enables performance
monitoring.
15.4.1 Monitoring the SAN Volume Controller with TPC for Disk
To configure Tivoli Storage Productivity Center (TPC) for Disk to monitor IBM SAN Volume
Controller, see SAN Volume Controller Best Practices and Performance Guidelines,
SG24-7521.
Tivoli Storage Productivity Center offers many disk performance reporting options that
support the SAN Volume Controller environment and also the storage controller back end for
various storage controller types. The following storage components are the most relevant for
collecting performance metrics when monitoring storage controller performance:
Subsystem
Controller
Array
Managed Disk
Managed Disk Group, or storage pool
Port
With the SAN Volume Controller, you can monitor on the levels of the I/O Group and the SAN
Volume Controller node.
Chapter 15. IBM System Storage SAN Volume Controller attachment 545
SAN Volume Controller thresholds
Thresholds are used to determine watermarks for warning and error indicators for an
assortment of storage metrics. SAN Volume Controller has the following thresholds within its
default properties:
Volume (VDisk) I/O rate
Total number of virtual disk I/Os for each I/O Group
Volume (VDisk) bytes per second
Virtual disk bytes per second for each I/O Group
MDisk I/O rate
Total number of managed disk I/Os for each Managed Disk Group
MDisk bytes per second
Managed disk bytes per second for each Managed Disk Group
The default status for these properties is Disabled with the Warning and Error options set to
None. Enable a particular threshold only after the minimum values for warning and error
levels are defined.
Tip: In Tivoli Storage Productivity Center for Disk, default threshold warning or error values
of -1.0 are indicators that there is no suggested minimum value for the threshold and are
therefore entirely user-defined. You can choose to provide any reasonable value for these
thresholds based on the workload in your environment.
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS2618
15.4.3 The TPC Storage Tiering Reports and Storage Performance Optimizer
Starting with Tivoli Storage Productivity Center V4.2.2, Tivoli Storage Productivity Center
provides capabilities for reporting on storage tiering activity to support data placement and to
optimize resource utilization in a virtualized environment. The storage tiering reports use the
estimated capability and actual performance data for the IBM SAN Volume Controller and
offer storage administrators key insights:
Are the back-end systems optimally utilized?
Does moving a certain workload to low-cost storage affect service levels?
How do I level out performance in a certain pool?
Which data groups can be moved to an alternate tier of storage?
So, the Storage Tiering Reports combine storage virtualization and Tivoli Storage Productivity
Center information to help users make smart decisions. The reports provide the capability to
make “proactive” volume placement decisions. They take into account back-end storage
configuration characteristics for analytic modeling. Predefined reports in Cognos® on storage
pools (Managed Disk Group) or the virtual volumes provide details about the hottest and
coolest resources and as well as detailed performance and historic capacity reports.
The following paper describes the storage tiering reports of the SAN Volume Controller:
http://www.ibm.com/support/docview.wss?uid=swg27023263
15.5 Sharing the DS8000 between a server and the SAN Volume
Controller
The DS8000 can be shared between servers and a SAN Volume Controller. This sharing can
be useful if you want direct attachment for specific Open Systems servers or if you need to
share your DS8000 between the SAN Volume Controller and an unsupported server, such
System z. Also, this option might be appropriate for IBM i, which is only supported through
VIOS with SAN Volume Controller.
For the latest list of hardware that is supported for attachment to the SAN Volume Controller,
see this website:
http://www.ibm.com/systems/storage/software/virtualization/svc/interop.html
15.5.1 Sharing the DS8000 between Open Systems servers and the IBM SAN
Volume Controller
If you have a mixed environment that includes IBM SAN Volume Controller and Open
Systems servers, we suggest sharing the maximum of the DS8000 resources to both
environments.
If an extent pool has multiple ranks, it is possible to create all SAN Volume Controller volumes
by using the rotate volumes algorithm, which provides one larger MDisk per rank. Server
volumes can use the rotate extents algorithm if desired. You can also use the rotate extents
method with the SAN Volume Controller volumes for acceptable performance. If DS8000
Easy Tier is enabled and manages the extent pools, it auto-rebalances all DS8000 volumes,
whether they are SAN Volume Controller MDisks or direct server volumes.
Most clients choose a DS8000 extent pool pair (or pairs) for their SAN Volume Controller
volumes only, and other extent pool pairs for their directly attached servers. This approach is
a preferred practice, but you can fully share on the drive level if preferred.
I/O Priority Manager works on the level of full DS8000 volumes. So, when you have large
MDisks, which are the DS8000 volumes, I/O Priority Manager cannot prioritize between
various smaller VDisk volumes that are cut out of these MDisks. I/O Priority Manager enables
SAN Volume Controller volumes as a whole to be assigned different priorities compared to
other direct server volumes. For instance, if IBM i mission-critical applications with directly
attached volumes share extent pools with SAN Volume Controller MDisk volumes
(uncommon), the I/O Priority Manager can throttle the complete SAN Volume Controller
volumes in I/O contention to protect the IBM i application performance.
Chapter 15. IBM System Storage SAN Volume Controller attachment 547
IBM supports sharing a DS8000 between a SAN Volume Controller and an Open Systems
server. However, if a DS8000 port is in the same zone as a SAN Volume Controller port, that
same DS8000 port must not be in the same zone as another server.
15.5.2 Sharing the DS8000 between System z servers and the IBM SAN
Volume Controller
IBM SAN Volume Controller does not support System z server attachment. If you have a
mixed server environment that includes IBM SAN Volume Controller and System z servers,
you must share your DS8000 to provide a direct access to System z volumes and access to
Open Systems server volumes through the IBM SAN Volume Controller.
In this case, you must split your DS8000 resources between two environments. You must
create part of the ranks by using the count key data (CKD) format (used for System z access)
and the other ranks in Fixed Block (FB) format (used for IBM SAN Volume Controller access).
In this case, both environments get performance that is related to the allocated DS8000
resources.
A DS8000 port does not support a shared attachment between System z and IBM SAN
Volume Controller. System z servers use the Fibre Channel connection (FICON), and IBM
SAN Volume Controller supports Fibre Channel Protocol (FCP) connection only.
Because of the availability of cache-disabled VDisks in the SAN Volume Controller, you can
enable Copy Services in the underlying RAID array controller for these LUNs.
Cache-disabled VDisks are primarily used when virtualizing an existing storage infrastructure,
and you need to retain the existing storage system Copy Services. You might want to use
cache-disabled VDisks where there is significant intellectual capital in existing Copy Services
automation scripts. We suggest that you keep the use of cache-disabled VDisks to a
minimum for normal workloads.
Another case where you might need to use cache-disabled VDisks is with servers, such as
System z or non-VIOS IBM i, that are not supported by the SAN Volume Controller, but you
need to maintain a single Global Mirror session for consistency between all servers. In this
case, the DS8000 Global Mirror must be able to manage the LUNs for all server systems.
Because the SAN Volume Controller does not stripe VDisks, in this case it is an advantage to
use extent pools with multiple ranks to allow volumes to be created by using the rotate extents
algorithm, or managed extent pools. The guidelines for the use of different DA pairs for
FlashCopy source and target LUNs with affinity to the same DS8000 server also apply.
Cache-disabled VDisks can also be used to control the allocation of cache resources. By
disabling the cache for certain VDisks, more cache resources are available to cache I/Os to
other VDisks in the same I/O Group. This technique is effective where an I/O Group serves
VDisks that can benefit from cache and other VDisks where the benefits of caching are small
or non-existent.
The usage of the DS8000 Copy Services functions when having SAN Volume Controller in
place is rare. Usually, in SAN Volume Controller attachments, the DS8000 is provided without
any DS8000 Copy Services license features, only Oracle Enterprise Linux (OEL) and Easy
Tier. The SAN Volume Controller Copy Services features and algorithms are used for
replication to a secondary SAN Volume Controller site.
Follow the guidelines and procedures outlined in this section to make the most of the
performance available from your DS8000 storage systems and to avoid potential I/O
problems:
Use multiple host adapters on the DS8000. Where possible, use no more than two ports
on each card. Use a larger number of ports on the DS8000, usually 16 (SAN Volume
Controller maximum).
Unless you have special requirements, or if in doubt, build your MDisk volumes from large
extent pools on the DS8000.
If using a 1:1 mapping of ranks to DS8000 extent pools, use one, or a maximum of two
volumes on this rank, and adjust the MDisk volume size for this 1:1 mapping.
Create fewer and larger SAN Volume Controller storage pools and have multiple MDisks in
each pool.
For SAN Volume Controller releases before 6.3, do not create too few MDisk volumes. If
you have too few MDisk volumes, the older SAN Volume Controller releases do not use all
offered ports.
Keep many DS8000 arrays active.
Ensure that you have an equal number of extent pools and, as far as possible, spread the
volumes equally across the device adapters and the two processor complexes of the
DS8000 storage system.
In a storage pool, ensure that for a certain tier, all MDisks have the same capacity and
RAID/rpm characteristics.
Chapter 15. IBM System Storage SAN Volume Controller attachment 549
Do not mix HDD MDisks from different controllers in the same storage pool.
For Metro Mirror configurations, always use DS8000 MDisks with similar characteristics for
both the master VDisk volume and the auxiliary volume.
Spread the VDisk volumes across all SAN Volume Controller nodes, and check for
balanced preferred node assignments.
In the SAN, use a dual fabric.
Use multipathing software in the servers.
Consider DS8000 Easy Tier auto-rebalancing for DS8000 homogeneous capacities.
When using Easy Tier in the DS8000, consider a SAN Volume Controller extent size of
1 GiB (1024 MiB) minimum, not to put skew away from the DS8000 extents. You can only
consider smaller SAN Volume Controller extent sizes, such as 256 or 512 MiB, to use this
SSD space better when internal SSDs are present in the SAN Volume Controller nodes,
because then the SAN Volume Controller Easy Tier is used.
When using DS8000 Easy Tier, leave some small movement space empty in the extent
pools to help it start working.
Consider the right amount of cache, as explained in 2.2.2, “Determining the right amount
of cache storage” on page 33. Usually, SAN Volume Controller installations have a
DS8000 cache of not less than 128 GB.
Each SSD MDisk goes into one pool, determining how many storage pools can benefit from
SAN Volume Controller Easy Tier. The SSD size determines the granularity of the offered
Another argument against this concept is that the ports that handle the traffic to the SSD-only
storage system experience exceptionally high workloads.
When using the internal SSDs in the SAN Volume Controller nodes, only Easy Tier performed
by the SAN Volume Controller is possible for the inter-tier movements between SSD and HDD
tiers. The DS8000 intra-tier auto-rebalancing can be used and can monitor the usage of all
the HDD ranks and move loads intra-tier by DS8000 if some ranks are more loaded. If you
have Enterprise and nearline drives in the DS8000, you can also combine the DS8000
managed inter-tier Easy Tier (only for the 2-tier DS8000 HDD extent pool) with the SAN
Volume Controller managed overall Easy Tier between the SAN Volume Controller SSDs and
the 2-tier HDD pool, out of which you cut the HDD MDisks. Two tiers of HDDs justify DS8000
managed Easy Tier, at least for the HDD part.
When you have the SSDs in the DS8000, together with Enterprise HDDs and eventually also
nearline HDDs, on which level do you perform the overall inter-tier Easy Tiering? It can be
either done by the SAN Volume Controller, by setting the generic_ssd attribute for all the
DS8000 SSD MDisks. Or, leave the generic_hdd attribute for all MDisks, but allow DS8000
Chapter 15. IBM System Storage SAN Volume Controller attachment 551
Easy Tier to manage these MDisks, with 2-tier or 3-tier MDisks offered to the SAN Volume
Controller, and contain some SSDs (which are invisible to the SAN Volume Controller).
There are differences between the Easy Tier algorithms in the DS8000 and in SAN Volume
Controller. The DS8000 is in the third generation of Easy Tier, with additional functions
available, such as Extended-Cold-Demote or Warm-Demote. The warm demote checking is
reactive if certain SSD ranks or SSD device adapters suddenly are overloaded. The SAN
Volume Controller needs to work with different vendors and varieties of SSD space offered,
and use a more generic algorithm, which cannot learn easily whether the SSD rank of a
certain vendor’s disk system is approaching its limits.
As a rule when you use SSDs in the DS8000 and use many or even heterogeneous storage
systems, consider implementing cross-tier Easy Tier on the highest level, that is, managed by
the SAN Volume Controller. SAN Volume Controller Version 6.1 and higher can use larger
blocksizes, such as 60K and over, which do not work well for DS8000 Easy Tier, so we have
another reason to use SAN Volume Controller Easy Tier inter-tiering. However, observe the
system by using the STAT for SAN Volume Controller. If the SSD space gets overloaded,
consider either adding more SSDs as suggested by the STAT, or removing and reserving
some of the SSD capacity so that it is not fully utilized by SAN Volume Controller, by creating
smaller SSD MDisks and leaving empty space there.
If you have one main machine behind the SAN Volume Controller, leave the Easy Tier
inter-tiering to the DS8000 logic. Use the more sophisticated DS8000 Easy Tier algorithms
that consider the sudden overload conditions of solid-state drive (SSD) ranks or adapters.
DS8000 Easy Tier algorithms have more insight into the DS8000 thresholds and what
workload limit each component of the storage system can sustain. Choose a SAN Volume
Controller extent size of 1 GiB to not eliminate the skew for the DS8000 tiering, which is on
the 1 GiB extent level.
We advise that you use the most current level of SAN Volume Controller before you start SAN
Volume Controller managed Easy Tier and that you use recent SAN Volume Controller node
hardware.
Figure 16-1 illustrates the basic components of data deduplication for IBM System Storage
TS7600 ProtecTIER servers.
With data deduplication, data is read by the data deduplication product while it looks for
duplicate data. Different data deduplication products use different methods of breaking up the
data into elements, but each product uses a technique to create a signature or identifier for
each data element. After the duplicate data is identified, one copy of each element is
retained, pointers are created for the duplicate items, and the duplicate items are not stored.
The effectiveness of data deduplication depends on many variables, including the data rate of
data change, the number of backups, and the data retention period. For example, if you back
up the same incompressible data once a week for six months, you save the first copy and do
not save the next 24. This example provides a 25:1 data deduplication ratio. If you back up an
incompressible file on week one, back up the same file again on week two, and never back it
up again, you have a 2:1 deduplication ratio.
The IBM System Storage TS7650G is a preconfigured Virtualization solution of IBM systems.
The IBM ProtecTIER data deduplication software is designed to improve backup and
recovery operations. The solution is available in a single-node or two-node cluster
configurations designed to meet the disk-based data protection needs of a wide variety of IT
environments and data centers. The TS7650G ProtecTIER Deduplication Gateway can scale
to repositories in the petabyte (PB) range, and all DS8000 models are supported behind it.
Your DS8000 can become a Virtual Tape Library (VTL). The multi-node concepts help
achieve higher throughput and availability for the backup, and replication concepts are
available.
ProtecTIER access patterns though can have a high random-read content, with some
random-write ratio for the metadata. Therefore, Enterprise drives with their higher rpm speeds
and other RAID types outperform nearline drives also for ProtecTIER attachments. You can
format the 2 TB 7200 min1 Serial Advanced Technology Attachment (SATA) drives of a
DS8700 in RAID 10. However, because Easy Tier automatically tiers storage between
Enterprise and nearline drives, we suggest that you use a drive mix of Enterprise and
nearline drives and let Easy Tier optimize between both tiers for you. Leave a small
movement space empty so that the Easy Tier can start. The two tiers can have different RAID
formats. With tiering in the DS8000, the ProtecTIER data that is more random-write-oriented
automatically is promoted to the most suitable drive tier. For the overall sizing and choice of
drives between Enterprise and nearline HDDs, consult your IBM representative.
You can obtain additional information about IBM DB2 and IMS at these websites:
http://www.ibm.com/software/data/db2/zos/family/
http://www.ibm.com/software/data/db2/linux-unix-windows/
http://www.ibm.com/software/data/ims/
OLTP systems process the day-to-day operation of businesses and, therefore, have strict
user response and availability requirements. They also have high throughput requirements
and are characterized by large numbers of database inserts and updates. They typically
serve hundreds, or even thousands, of concurrent users.
DSS systems typically deal with substantially larger volumes of data than OLTP systems due
to their role in supplying users with large amounts of historical data. While 100 GB of data is
considered large for an OLTP environment, a large DSS system might be 1 TB of data or
more. The increased storage requirements of DSS systems can also be attributed to the fact
that they often contain multiple, aggregated views of the same data.
While OLTP queries are mostly related to one specific business function, DSS queries are
often substantially more complex. The need to process large amounts of data results in many
CPU-intensive database sort and join operations. The complexity and variability of these
types of queries must be given special consideration when estimating the performance of a
DSS system.
Data tablespaces can be divided in two groups: System tablespaces and user tablespaces.
Both of these tablespaces have identical data attributes. The difference is that system
tablespaces are used to control and manage the DB2 subsystem and user data. System
tablespaces require the highest availability and special considerations. User data cannot be
accessed if the system data is not available.
In addition to data tablespaces, DB2 requires a group of traditional datasets not associated to
tablespaces that are used by DB2 to provide data availability: The backup and recovery
datasets.
The following sections describe the objects and datasets that DB2 uses.
TABLE
All data managed by DB2 is associated to a table. The table is the main object used by DB2
applications.
TABLESPACE
A tablespace is used to store one or more tables. A tablespace is physically implemented with
one or more datasets. Tablespaces are VSAM linear datasets (LDS). Because tablespaces
can be larger than the largest possible VSAM dataset, a DB2 tablespace can require more
than one VSAM dataset.
INDEX
A table can have one or more indexes (or can have no index). An index contains keys. Each
key points to one or more data rows. The purpose of an index is to get direct and faster
access to the data in a table.
DATABASE
The database is a DB2 representation of a group of related objects. Each of the previously
named objects must belong to a database. DB2 databases are used to organize and manage
these objects.
STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases,
tablespaces, or index spaces when using DB2 managed objects. DB2 uses STOGROUPs for
disk allocation of the table and index spaces.
Application tablespaces and index spaces are VSAM LDSs with the same attributes as DB2
system tablespaces and index spaces. System and application data differ only because they
have different performance and availability requirements.
You can intermix tables and indexes and also system, application, and recovery datasets on
the DS8000 ranks. The overall I/O activity is then more evenly spread, and I/O skews are
avoided.
The results in Figure 17-1 were measured by a DB2 I/O benchmark. They show random 4 KB
read throughput and response times. The SSD response times are low across the curve.
They are lower than the minimum HDD response time for all data points.
Figure 17-1 DB2 on count key data (CKD) random read throughput/response time curve
http://www.ibm.com/systems/z/os/zos/downloads/flashda.html
The FLASHDA user guide is available at this website:
http://publibz.boulder.ibm.com/zoslib/pdf/flashda.pdf
Based on the output report of these tools, you can select which hot volumes might benefit
most when migrated to SSDs. The tool output also provides the hot data at the dataset level.
Based on this data, the migration to the SSD ranks can be done by dataset by using the
appropriate z/OS tools.
VSAM data striping addresses this problem with two modifications to the traditional data
organization:
The records are not placed in key ranges along the volumes; instead, they are organized
in stripes.
Parallel I/O operations are scheduled to sequential stripes in different volumes.
By striping data, the VSAM control intervals (CIs) are spread across multiple devices. This
format allows a single application request for records in multiple tracks and CIs to be satisfied
by concurrent I/O requests to multiple volumes.
The result is improved data transfer to the application. The scheduling of I/O to multiple
volumes to satisfy a single application request is referred as an I/O path packet.
We can stripe across ranks, device adapters, servers, and the DS8000s.
In a DS8000 with storage pool striping, the implementation of VSAM striping still provides a
performance benefit. Because DB2 uses two engines for the list prefetch operation, VSAM
striping increases the parallelism of DB2 list prefetch I/Os. This parallelism exists with respect
to the channel operations as well as the disk access.
If you plan to enable VSAM I/O striping, see DB2 9 for z/OS Performance Topics,
SG24-7473-00.
Measurements that are oriented to determine how large volumes can affect DB2 performance
show that similar response times can be obtained by using larger volumes compared to using
the smaller 3390-3 standard-size volumes. See 14.7.2, “Larger volume compared to smaller
volume performance” on page 508 for a discussion.
Examples of DB2 applications that benefit from MIDAWs are DB2 prefetch and DB2 utilities.
As more data is prefetched, more disks are employed in parallel. Therefore, high throughput
is achieved by employing parallelism at the disk level. In addition to enabling one sequential
stream to be faster, AMP also reduces disk thrashing when there is disk contention.
The information presented in this section is further discussed in detail in (and liberally
borrowed from) the book, IBM ESS and IBM DB2 UDB Working Together, SG24-6262. Many
of the concepts presented are applicable to the DS8000. We highly suggest this book.
However, based on client solution experiences using SG24-6262, there are two corrections
that we want to point out:
In IBM ESS and IBM DB2 UDB Working Together, SG24-6262, section 3.2.2, “Balance
workload across ESS resources,” suggests that a data layout policy must be established
that allows partitions and containers within partitions to be spread evenly across ESS
resources. It further suggests that you can choose either a horizontal mapping, in which
every partition has containers on every available ESS rank, or a vertical mapping in which
DB2 partitions are isolated to specific arrays, with containers spread evenly across those
ranks. We now suggest the vertical mapping approach. The vertical isolated storage
approach is typically easier to configure, manage, and diagnose if problems arise in
production.
Another data placement consideration suggests that it is important to place frequently
accessed files in the space allocated from the middle of an array. This suggestion was an
error in the original publication. The intent of the section was to discuss how the
placement considerations commonly used with non-RAID older disk technology have less
significance in ESS environments.
Vertical data mapping approach: Based on experience, we now suggest a vertical data
mapping approach (shared nothing between data partitions). We also want to emphasize
that you must not try to micromanage data placement on storage.
The database object that maps the physical storage is the tablespace. Figure 17-2 illustrates
how DB2 UDB is logically structured and how the tablespace maps the physical object.
Instances
Databases
Tablespaces are where tables are stored:
SMS or DMS
Tablespaces Each container Each container
is a directory is a fixed,
in the file space pre-allocated
Tables of the operating file or physical
Indexes system. device such as
a disk.
long data
/fs.rb.T1.DA3a1
/fs.rb.T1.DA3b1
Instances
An instance is a logical database manager environment where databases are cataloged and
configuration parameters are set. An instance is similar to an image of the actual database
manager environment. You can have several instances of the database manager product on
the same database server. You can use these instances to separate the development
environment from the production environment, tune the database manager to a particular
environment, and protect sensitive information from a particular group of users.
For database partitioning features (DPF) of the DB2 Enterprise Server Edition (ESE), all data
partitions reside within a single instance.
Databases
A relational database structures data as a collection of database objects. The primary
database object is the table (a defined number of columns and any number of rows). Each
database includes a set of system catalog tables that describe the logical and physical
structure of the data, configuration files that contain the parameter values allocated for the
database, and recovery logs.
Database partitions
A partition number in DB2 UDB terminology is equivalent to a data partition. Databases with
multiple data partitions and that reside on a symmetric multiprocessor (SMP) system are also
called multiple logical partition (MLN) databases.
Partitions are identified by the physical system where they reside as well as by a logical port
number with the physical system. The partition number, which can be 0 - 999, uniquely
defines a partition. Partition numbers must be in ascending sequence (gaps in the sequence
are allowed).
The configuration information of the database is stored in the catalog partition. The catalog
partition is the partition from which you create the database.
Partitiongroups
A partitiongroup is a set of one or more database partitions. For non-partitioned
implementations (all editions except for DPF), the partitiongroup is always made up of a
single partition.
Partitioning map
When a partitiongroup is created, a partitioning map is associated to it. The partitioning map,
in conjunction with the partitioning key and hashing algorithm, is used by the database
manager to determine which database partition in the partitiongroup stores a specific row of
data. Partitioning maps do not apply to non-partitioned databases.
Containers
A container is the way of defining where on the storage device the database objects are
stored. Containers can be assigned from filesystems by specifying a directory. These
containers are identified as PATH containers. Containers can also reference files that reside
within a directory. These containers are identified as FILE containers, and a specific size must
be identified. Containers can also reference raw devices. These containers are identified as
DEVICE containers, and the device must exist on the system before the container can be
used.
All containers must be unique across all databases; a container can belong to only one
tablespace.
Tablespaces
A database is logically organized in tablespaces. A tablespace is a place to store tables. To
spread a tablespace over one or more disk devices, you specify multiple containers.
For partitioned databases, the tablespaces reside in partitiongroups. In the create tablespace
command execution, the containers themselves are assigned to a specific partition in the
partitiongroup, thus maintaining the shared nothing character of DB2 UDB DPF.
When creating a table, you can choose to have certain objects, such as indexes and large
object (LOB) data, stored separately from the rest of the table data, but you must define this
table to a DMS tablespace.
Indexes are defined for a specific table and assist in the efficient retrieval of data to satisfy
queries. They can also be used to assist in the clustering of data.
Large objects (LOBs) can be stored in columns of the table. These objects, although logically
referenced as part of the table, can be stored in their own tablespace when the base table is
defined to a DMS tablespace. This approach allows for more efficient access of both the LOB
data and the related table data.
Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These
discrete blocks are called pages, and the memory reserved to buffer a page transfer is called
an I/O buffer. DB2 UDB supports various page sizes, including 4 k, 8 k, 16 k, and 32 k.
When an application accesses data randomly, the page size determines the amount of data
transferred. This size corresponds to the size of the data transfer request to the DS8000,
which is sometimes referred to as the physical record.
Sequential read patterns can also influence the page size that is selected. Larger page sizes
for workloads with sequential read patterns can enhance performance by reducing the
number of I/Os.
Extents
An extent is a unit of space allocation within a container of a tablespace for a single
tablespace object. This allocation consists of multiple pages. The extent size (number of
pages) for an object is set when the tablespace is created:
An extent is a group of consecutive pages defined to the database.
The data in the tablespaces is striped by extent across all the containers in the system.
Buffer pools
A buffer pool is main memory allocated on the host processor to cache table and index data
pages as they are read from disk or modified. The purpose of the buffer pool is to improve
system performance. Data can be accessed much faster from memory than from disk;
therefore, the fewer times that the database manager needs to read from or write to disk (I/O),
the better the performance. Multiple buffer pools can be created.
Sequential prefetch reads consecutive pages into the buffer pool before they are needed by
DB2. List prefetches are more complex. In this case, the DB2 optimizer optimizes the retrieval
of randomly located data.
The amount of data that is prefetched determines the amount of parallel I/O activity.
Ordinarily, the database administrator defines a prefetch value large enough to allow parallel
use of all of the available containers.
Page cleaners
Page cleaners are present to make room in the buffer pool before prefetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data is
updated in a table, many data pages in the buffer pool might be updated but not written into
disk storage (these pages are called dirty pages). Because prefetchers cannot place fetched
data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk
storage and become clean pages so that prefetchers can place fetched data pages from disk
storage.
Logs
Changes to data pages in the buffer pool are logged. Agent processes, which are updating a
data record in the database, update the associated page in the buffer pool and write a log
record to a log buffer. The written log records in the log buffer are flushed into the log files
asynchronously by the logger.
To optimize performance, the updated data pages in the buffer pool and the log records in the
log buffer are not written to disk immediately. The updated data pages in the buffer pool are
written to disk by page cleaners and the log records in the log buffer are written to disk by the
logger.
Parallel operations
DB2 UDB extensively uses parallelism to optimize performance when accessing a database.
DB2 supports several types of parallelism, including query and I/O parallelism.
Query parallelism
There are two dimensions of query parallelism: inter-query parallelism and intra-query
parallelism. Inter-query parallelism refers to the ability of multiple applications to query a
database at the same time. Each query executes independently of the other queries, but they
are all executed at the same time. Intra-query parallelism refers to the simultaneous
processing of parts of a single query, by using intra-partition parallelism, inter-partition
parallelism, or both:
Intra-partition parallelism subdivides what is considered a single database operation, such
as index creation, database loading, or SQL queries, into multiple parts, many or all of
which can be run in parallel within a single database partition.
Inter-partition parallelism subdivides what is considered a single database operation, such
as index creation, database loading, or SQL queries, into multiple parts, many or all of
which can be run in parallel across multiple partitions of a partitioned database on one
machine or on multiple machines. Inter-partition parallelism applies to DPF only.
I/O parallelism
When there are multiple containers for a tablespace, the database manager can use parallel
I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O devices
simultaneously. Parallel I/O can result in significant improvements in throughput.
DB2 implements a form of data striping by spreading the data in a tablespace across multiple
containers. In storage terminology, the part of a stripe that is on a single device is a strip. The
DB2 term for strip is extent. If your tablespace has three containers, DB2 writes one extent to
container 0, the next extent to container 1, the next extent to container 2, and then back to
container 0. The stripe width (a generic term not often used in DB2 literature) is equal to the
number of containers, or three in this case.
Containers for a tablespace are ordinarily placed on separate physical disks, allowing work to
be spread across those disks, and allowing disks to operate in parallel. Because the DS8000
logical disks are striped across the rank, the database administrator can allocate DB2
containers on separate logical disks that reside on separate DS8000 arrays. This approach
takes advantage of the parallelism both in DB2 and in the DS8000. For example, four DB2
containers that reside on four DS8000 logical disks on four different 7+P ranks have data
spread across 32 physical disks.
If you want optimal performance from the DS8000, do not treat it like a black box. Establish a
storage allocation policy that allocates data by using several DS8000 ranks. Understand how
DB2 tables map to underlying logical disks, and how the logical disks are allocated across the
DS8000 ranks. One way of making this process easier to manage is to maintain a modest
number of DS8000 logical disks.
As a result, you can balance activity across the DS8000 resources by following these rules:
Span the DS8000 storage units.
Span ranks (RAID arrays) within a storage unit.
Engage as many arrays as possible.
Figure 17-3 on page 574 illustrates this technique for a single tablespace that consists of
eight containers.
3 4
5 6
7 8
HMC 2
S0 2
S1 0
0
0/1 1/0
2/3 3/2
Figure 17-3 Allocating DB2 containers by using a “spread your data” approach
Look again at Figure 17-3. In this case, we stripe across arrays, disk adapters, clusters, and
DS8000s, which can all be done by using the striping capabilities of the DB2 container and
shared nothing concept. This approach eliminates the need to employ AIX logical volume
striping.
Page size
Page sizes are defined for each tablespace. There are four supported page sizes: 4 K, 8 K, 16
K, and 32 K. The following factors affect the choice of page size:
The maximum number of records per page is 255. To avoid wasting space on a page, do
not make page size greater than 255 times the row size plus the page overhead.
The maximum size of a tablespace is proportional to the page size of its tablespace. In
SMS, the data and index objects of a table have limits, as shown in Table 17-1. In DMS,
these limits apply at the tablespace level.
Table 17-1 Page size relative to tablespace size
Page size Maximum data/index object size
4 KB 64 GB
8 KB 128 GB
16 KB 256 GB
32 KB 512 GB
Select a page size that can accommodate the total expected growth requirements of the
objects in the tablespace.
For OLTP applications that perform random row read and write operations, a smaller page
size is preferable, because it wastes less buffer pool space with unwanted rows. For DSS
applications that access large numbers of consecutive rows at a time, a larger page size is
better, because it reduces the number of I/O requests that are required to read a specific
number of rows.
Tip: Experience indicates that page size can be dictated to a certain degree by the type of
workload. For pure OLTP workloads, we suggest a 4 KB page size. For a pure DSS
workload, we suggest a 32 KB page size. For a mixture of OLTP and DSS workload
characteristics, we suggest either an 8 KB page size or a 16 KB page size.
Extent size
If you want to stripe across multiple arrays in your DS8000, assign a LUN from each rank to
be used as a DB2 container. During writes, DB2 writes one extent to the first container and
the next extent to the second container until all eight containers are addressed before cycling
back to the first container. DB2 stripes across containers at the tablespace level.
Because the DS8000 stripes at a fairly fine granularity (256 KB), selecting multiples of 256 KB
for the extent size ensures that multiple DS8000 disks are used within a rank when a DB2
prefetch occurs. However, keep your extent size below 1 MB.
I/O performance is fairly insensitive to the selection of extent sizes, mostly because the
DS8000 employs sequential detection and prefetch. For example, even if you select an extent
size, such as 128 KB, which is smaller than the full array width (it accesses only four disks in
the array), the DS8000 sequential prefetch keeps the other disks in the array busy.
Prefetch size
The tablespace prefetch size determines the degree to which separate containers can
operate in parallel.
Prefetch size is tunable. We mean that prefetch size can be altered after the tablespace is
defined and data loaded, which is not true for extents and page sizes that are set at
tablespace creation time and cannot be altered without redefining the tablespace and
reloading the data.
Tip: The prefetch size must be set so that as many arrays as wanted can be working on
behalf of the prefetch request. For other than the DS8000, the general suggestion is to
calculate prefetch size to be equal to a multiple of the extent size times the number of
containers in your tablespace. For the DS8000, you can work with a multiple of the extent
size times the number of arrays underlying your tablespace.
The DS8000 supports a high degree of parallelism and concurrency on a single logical disk.
As a result, a single logical disk the size of an entire array achieves the same performance as
many smaller logical disks. However, you must consider how logical disk size affects both the
host I/O operations and the complexity of your systems administration.
Smaller logical disks provide more granularity, with their associated benefits. But smaller
logical disks also increase the number of logical disks seen by the operating system. Select a
DS8000 logical disk size that allows for granularity and growth without proliferating the
number of logical disks.
Take into account your container size and how the containers map to AIX logical volumes and
DS8000 logical disks. In the simplest situation, the container, the AIX logical volume, and the
DS8000 logical disk are the same size.
Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs. Our suggestion is that you create no fewer than two logical disks in an array, and
the minimum logical disk size needs to be 16 GB. Unless you have a compelling reason,
standardize a unique logical disk size throughout the DS8000.
Smaller logical disk sizes have the following advantages and disadvantages:
Advantages of smaller size logical disks:
– Easier to allocate storage for different applications and hosts.
– Greater flexibility in performance reporting.
Disadvantages of smaller size logical disks
Small logical disk sizes can contribute to the proliferation of logical disks, particularly in
SAN environments and large configurations. Administration gets complex and confusing.
Larger logical disk sizes have the following advantages and disadvantages:
Advantages of larger size logical disks:
– Simplifies understanding of how data maps to arrays.
Examples
Assume a 6+P array with 146 GB disk drives. You want to allocate disk space on your
16-array DS8000 as flexibly as possible. You can carve each of the 16 arrays into 32 GB
logical disks or logical unit numbers (LUNs), resulting in 27 logical disks per array (with a little
left over). This design yields a total of 16 x 27 = 432 LUNs. Then, you can implement 4-way
multipathing, which in turn makes 4 x 432 = 1728 hdisks visible to the operating system.
This approach creates an administratively complex situation, and, at every reboot, the
operating system queries each of those 1728 disks. Reboots might take a long time.
Alternatively, you create just 16 large logical disks. With multipathing and attachment of four
Fibre Channel ports, you have 4 x 16 = 128 hdisks visible to the operating system. Although
this number is large, it is more manageable, and reboots are much faster. After overcoming
that problem, you can then use the operating system logical volume manager (LVM) to carve
this space into smaller pieces for use.
There are problems with this large logical disk approach as well, however. If the DS8000 is
connected to multiple hosts or it is on a SAN, disk allocation options are limited when you
have so few logical disks. You must allocate entire arrays to a specific host, and if you want to
add additional space, you must add it in array-size increments.
17.5.6 Multipathing
Use the DS8000 multipathing along with DB2 striping to ensure the balanced use of Fibre
Channel paths.
Multipathing is the hardware and software support that provides multiple avenues of access
to your data from the host computer. You need to provide at least two Fibre Channel paths
from the host computer to the DS8000. Paths are defined by the number of host adapters on
the DS8000 that service the LUNs of a certain host system, the number of Fibre Channel host
bus adapters on the host system, and the SAN zoning configuration. The total number of
paths includes consideration for the throughput requirements of the host system. If the host
system requires more than (2 x 200) 400 MBps throughput, two host bus adapters are not
adequate.
The DS8000 multipathing requires the installation of multipathing software. For AIX, you have
two choices: Subsystem Device Driver Path Control Module (SDDPCM) or the IBM
Subsystem Device Driver (SDD). For AIX, we suggest SDDPCM. We describe these products
in Chapter 9, “Performance considerations for UNIX servers” on page 327 and Chapter 8,
“Host attachment” on page 311.
There are several benefits you receive from using multipathing: higher availability, higher
bandwidth, and easier management. A high availability implementation is one in which your
application can still access data by using an alternate resource if a component fails. Easier
performance management means that the multipathing software automatically balances the
workload across the paths.
IMS Database Manager provides functions for preserving the integrity of databases and
maintaining the databases. It allows multiple tasks to access and update the data, while
ensuring the integrity of the data. It also provides functions for reorganizing and restructuring
the databases.
The IMS databases are organized internally by using a number of IMS internal database
organization access methods. The database data is stored on disk storage by using the
normal operating system access methods.
During IMS execution, all information necessary to restart the system in the event of a failure
is recorded on a system log dataset. The IMS logs are made up of the following information.
The OLDS are made of multiple datasets that are used in a wraparound manner. At least
three datasets must be allocated for the OLDS to allow IMS to start, while an upper limit of
100 datasets is supported.
Only complete log buffers are written to OLDS to enhance performance. If any incomplete
buffers need to be written out, they are written to the write ahead datasets (WADS).
When IMS processing requires writing a partially filled OLDS buffer, a portion of the buffer is
written to the WADS. If IMS or the system fails, the log data in the WADS is used to terminate
the OLDS, which can be done as part of an emergency restart, or as an option on the IMS
Log Recovery Utility.
The WADS space is continually reused after the appropriate log data is written to the OLDS.
This dataset is required for all IMS systems, and must be pre-allocated and formatted at IMS
startup when first used.
When using a DS8000 with storage pool striping, define the WADS volumes as 3390-Mod.1
and allocate them consecutively so that they are allocated to different ranks.
If you want optimal performance from the DS8000, do not treat it like a “black box.”
Understand how your IMS datasets map to underlying volumes, and how the volumes map to
RAID arrays.
You can intermix IMS databases and log datasets on the DS8000 ranks. The overall I/O
activity is more evenly spread, and I/O skews are avoided.
Measurements to determine how large volumes can affect IMS performance show that similar
response times can be obtained when using larger volumes as when using smaller 3390-3
standard-size volumes.
Figure 17-4 on page 581 illustrates the device response times when using thirty-two 3390-3
volumes compared to four large 3390-27 volumes on an ESS-F20 that uses FICON channels.
Even though we performed the benchmark on an ESS-F20, the results are similar on the
DS8000. The results show that with the larger volumes, the response times are similar to the
standard size 3390-3 volumes.
3390-3
1
3390-27
0.5
0
2905 4407
Total I/O rate (IO/sec)
Also, this section is intended to focus on Oracle I/O characteristics. Some memory or CPU
considerations are needed, but we understand that these considerations are done at the
appropriate level according to your system specifications and planning.
Reviewing the following considerations can help you understand the Oracle I/O demand and
your DS8800/DS8700 planning for its use.
Although the instance is an important part of the Oracle components, our focus is on the
datafiles. OLTP workloads can benefit from SSDs combined with Easy Tier automatic mode
management to optimize performance. Furthermore, you also need to discuss segregation
and resource-sharing aspects when performing separate levels of isolation on the storage for
different components. Typically, in an Oracle database, you separate redo logs and archive
In a database, the disk part is considered the slowest component in the whole infrastructure.
You must plan to avoid reconfigurations and time-consuming performance problem
investigations when future problems, such as bottlenecks, might occur. However, as with all
I/O subsystems, good planning and data layout can make the difference between having
excellent I/O throughput and application performance, and having poor I/O throughput, high
I/O response times, and correspondingly poor application performance.
In many cases, I/O performance problems can be traced directly to “hot” files that cause a
bottleneck on some critical component, for example, a single physical disk. This problem can
occur even when the overall I/O subsystem is fairly lightly loaded. When bottlenecks occur,
storage or data base administrators might need to identify and manually relocate the high
activity data files that contributed to the bottleneck condition. This problem solving tends to be
a resource-intensive and often frustrating task. As the workload content changes with the
daily operations of normal business cycles, for example, hour by hour through the business
day or day by day through the accounting period, bottlenecks can mysteriously appear and
disappear or migrate over time from one datafile or device to another.
In 4.7.1, “RAID-level performance considerations” on page 103, we reviewed the RAID levels
and their performance aspects. It is important to discuss the RAID levels, because some
datafiles can benefit from certain RAID levels, depending on their workload profile as shown
in Figure 4-6 on page 109. However, advanced storage architectures, for example, cache and
advanced cache algorithms, or even multi-tier configurations with Easy Tier automatic
management can make RAID level considerations less important.
For instance, with 15K rpm Enterprise disks and a significant amount of cache available on
the storage system, some environments might have similar performance on RAID 10 and
RAID 5, although mostly workloads with a high percentage of random write activity and high
I/O access densities generally benefit from RAID 10. RAID 10 benefits clients in single-tier
pools. RAID 10 takes advantage of Easy Tier automatic intra-tier performance management
(auto-rebalance) and constantly optimizes data placement across ranks based on rank
utilization in the extent pool.
However, by using hybrid pools with SSDs and Easy Tier automode cross-tier performance
management that promotes the hot extents to SSDs on a subvolume level, you can
additionally boost database performance and take advantage of the capacity on the
solid-state drive (SSD) tier and automatically adapt to changing workload conditions.
On previous DS8300/DS8100 systems, you benefited from using storage pool striping (rotate
extents) and striping on the storage level. You can create your redo logs and spread them
across as many extent pools and ranks as possible to avoid contention. On a DS8800/8700
system with Easy Tier, data placement and workload spreading in extent pools is automatic,
even across different storage tiers.
You still can divide your workload across your planned extent pools (hybrid or homogeneous)
and consider segregation on the storage level by using different storage classes or RAID
levels or by separating tablespaces from logs across different extent pools with regard to
failure boundary considerations.
However, if you consider striping on the AIX LVM level or the database level, for example,
Oracle Automatic Storage Management (ASM), you need to consider the best approaches
possible if you use it with Easy Tier and multi-tier configurations. Keep your physical partition
(PP) size or stripe size at a high value to have enough skew with Easy Tier to efficiently
promote hot extents.
AIX LVM also features different mount options if you consider logical filesystems instead of
raw devices. For the AIX LVM options, see 9.2.4, “IBM Logical Volume Manager” on
page 343. Next, we show you different mount options for logical filesystems as the preferred
practices with Oracle databases. These mount options can be used for filesystems, as
described in “Mount options” on page 340:
Direct I/O (DIO):
– Data is transferred directly from the disk to the application buffer. It bypasses the file
buffer cache and avoids double caching (filesystem cache + Oracle System Global
Area (SGA)).
– Emulates a raw device implementation.
Concurrent I/O (CIO):
– Implicit use of DIO.
– No inode locking: Multiple threads can perform reads and writes on the same file at the
same time.
– Performance achieved by using CIO is comparable to raw devices.
– Avoid double caching: Some data is already cached in the Application layer (SGA).
– Provides faster access to the back-end disk and reduces the CPU utilization.
– Disables the inode-lock to allow several threads to read and write the same file (CIO
only).
– Because data transfer is bypassing the AIX buffer cache, Journaled File System 2
(JFS2) prefetching and write-behind cannot be used. These functions can be handled
by Oracle.
Comparing DIO, CIO, and raw devices, CIO is likely to perform in a similar manner to raw
devices, and raw is likely to show the best results. Additionally, when using JFS2, consider
using the INLINE log for filesystems so that it can have the log striped and not be just placed
in a single AIX physical partition (PP).
Other options that are supported by Oracle include the Asynchronous I/O (AIO), General
Parallel File System (GPFS), ASM, and raw device formats:
AIO:
– Allows multiple requests to be sent without having to wait until the disk subsystem
completes the physical I/O.
– Use of asynchronous I/O is advised no matter what type filesystem and mount option
you implement (JFS, JFS2, CIO, or DIO).
GPFS:
– When implementing Oracle Real Application Clusters (RAC) environments, many
clients prefer to use a clustered filesystem. GPFS is the IBM clustered filesystem
offering for Oracle RAC on AIX. Other Oracle files, such as the ORACLE_HOME
executable libraries, archive log directories that do not need to be shared between
instances. These files can either be placed on local disk, for example, by using JFS2
filesystems, or a single copy can be shared across the RAC cluster by using GPFS.
GPFS can provide administrative advantages, such as maintaining only one physical
ORACLE_HOME, which ensures that archive logs are always available (even when
nodes are down) when recoveries are required.
– When used with Oracle, GPFS automatically stripes files across all of the available
disks within a GPFS using a 1 MiB stripe size. Therefore, GPFS provides data and I/O
distribution characteristics that are similar to PP spreading, LVM (large granularity)
striping, and ASM course-grained striping techniques.
ASM:
– ASM is a database filesystem that provides cluster filesystem and volume manager
capabilities. ASM is an alternative to conventional filesystem and LVM functions.
– Integrated into the Oracle database at no additional cost for single or RAC databases.
– With ASM, the management of Oracle data files is the same for the DBA on all
platforms (UNIX, Linux, or Windows).
– Datafiles are striped across all ASM disks, and I/O is spread evenly to prevent hot
spots and maximize performance.
– Online add/drop of disk devices with automatic online redistribution of data.
– An ASM-managed database has approximately the same performance as a database
that is implemented in raw devices.
We review the Copy Services functions and give suggestions about preferred practices for
configuration and performance:
Copy Services introduction
FlashCopy
Metro Mirror
Global Copy
Global Mirror
z/OS Global Mirror
Metro/Global Mirror
There are two primary types of Copy Services functions: Point-in-Time Copy and Remote
Mirror and Copy. Generally, the Point-in-Time Copy functions are used for data duplication,
and the Remote Mirror and Copy functions are used for data migration and disaster recovery.
Table 18-1 is a reference chart for the Copy Services. The following copy operations are
available for each function:
Point-in-Time Copy:
– FlashCopy
– FlashCopy SE
Remote Mirror and Copy:
– Global Mirror
– Metro Mirror
– Global Copy
– Three-site Metro/Global Mirror with Incremental Resync
z/OS Global Mirror, previously known as Extended Remote Copy (XRC)
z/OS Metro/Global Mirror across three sites with Incremental Resync
z/OS Global Mirror z/OS Global Mirror Extended Remote Copy (XRC)
z/OS Metro/Global Mirror z/OS Metro/Global Mirror Three-site solution that uses
Sync PPRC and XRC
See the Interoperability Matrixes for the DS8000 and ESS to confirm which products are
supported on a particular disk subsystem.
For detailed information about the DS8000 Copy Services, see the following IBM Redbooks
publications:
DS8000 Copy Services for IBM System z, SG24-6787
IBM System Storage DS8000: Copy Services in Open Environments, SG24-6788
18.2 FlashCopy
FlashCopy can help reduce or eliminate planned outages for critical applications. FlashCopy
is designed to allow read and write access to the source data and the copy almost
immediately following the FlashCopy volume pair establishment.
Standard FlashCopy uses a normal volume as target volume. This target volume must be the
same size (or larger) than the source volume, and the space is allocated in the storage
subsystem.
IBM FlashCopy SE uses track space-efficient (TSE) volumes as FlashCopy target volumes. A
TSE target volume has a virtual size that is equal to or greater than the source volume size.
However, space is not allocated for this volume when the volume is created and the
FlashCopy is initiated. Only when updates are made to the source volume, any original tracks
of the source volume to be modified are copied to the TSE volume. Space in the repository is
allocated for just these tracks (or for any write to the target itself).
Additionally, thin provisioning support for FlashCopy was introduced with LMC 7.6.2.xx.xx.
FlashCopy is supported to use extent space-efficient (ESE) volumes as source and target
volumes. At the time of writing this book, this enhancement is valid for Open Systems (FB
volumes) only.
An ESE volume has a virtual size, and the size of the ESE target volume can be equal to or
greater than the source volume size. When an ESE logical volume is created, the volume has
no real capacity allocated on the extent pool but only metadata used to manage space
allocation. An ESE volume extent is dynamically allocated when a write operation in the
extent occurs. In a FlashCopy relationship, an ESE source volume extent is then allocated
either when its corresponding extent in the source volume is allocated or when a write
operation occurs directly to this extent in the target volume.
There are several points to consider when you plan to use FlashCopy that might help you
minimize any impact that the FlashCopy operation can have on host I/O performance.
Write
Read and write to both source
and target possible. Optional
T0 physical copy progresses in
background
Optionally, you can suppress this background copy task by using the nocopy option.
FlashCopy SE supports the nocopy option only. From LMC 7.6.2.xx, you can also use
FlashCopy on a thin-provisioned (ESE) volume.
The nocopy option is efficient, for example, if you are making a temporary copy just to take a
backup to tape. With the nocopy option, the full background copy does not take place and the
actual copy of a track on the target volume occurs only following an update of that track on
either the source or target volume. Furthermore, with the nocopy option, the FlashCopy
relationship remains until explicitly withdrawn or until all the tracks are copied to the target
volume.
FlashCopy SE is designed for temporary copies, such as this instance. Copy duration
generally does not last longer than 24 hours unless the source and target volumes have little
write activity. FlashCopy SE is optimized for use cases where a small percentage of the
source volume is updated during the life of the relationship. If much more than 20% of the
source is expected to change, there might be trade-offs in performance as opposed to space
efficiency. In this case, standard FlashCopy might be considered as a good alternative.
FlashCopy on thin-provisioned (ESE) volumes is also efficient only if the ESE volume can be
used as both source and target volume. Because only the data of allocated extents is copied.
FlashCopy on thin-provisioned volume can also be used with the nocopy option.
FlashCopy has several options. Not all options are available to all user interfaces. It is
important from the beginning to know the purpose for the target volume. Knowing this
purpose, the FlashCopy options can be identified and the environment that supports the
selected options can be chosen.
We examine when to use copy as opposed to no copy and where to place the FlashCopy
source and target volumes/LUNs. We also describe when and how to use incremental
FlashCopy, which you definitely need to evaluate for use in most applications.
Important: This chapter is valid for System z volumes and Open Systems LUNs. In the fol-
lowing sections of the present chapter, we use only the terms volume or volumes, but the
text is equally valid if the terms LUN and LUNs are used, unless otherwise noted.
It is always best to locate the FlashCopy target volume on the same DS8000 processor
complex as the FlashCopy source volume, so that you can take advantage of code
optimization to reduce overhead when source and target are on the same processor complex.
It is also a preferred practice to locate the FlashCopy target volume on different ranks or even
different device adapter (DA) pairs than the source volume, particularly when background
copy is used.
Another available choice is whether to place the FlashCopy target volumes on the same ranks
as the FlashCopy source volumes. In general, it is best not to place these two volumes on the
same rank for the best performance. However, if source and target volumes need to be in the
same non-managed, homogenous multi-rank extent pool and use rotate extents (storage pool
striping) as an extent allocation method (EAM), consider consecutively created volumes as
source and target volumes.
Tip: To find the relative location of your volumes, you can use the following procedure:
1. Use the lsfbvol command to learn which extent pool contains the relevant volumes.
2. Use the showfbvol -rank command to learn in which rank the relevant volumes are
allocated.
3. Use the lsrank command to display both the device adapter (DA) and the rank for each
extent pool.
4. To determine which processor complex contains your volumes, look at the extent pool
ID. Even-numbered extent pools are always from Server 0, and odd-numbered extent
pools are always from Server 1.
With FlashCopy nocopy relationships, the DS8000 performs copy-on-write for each first
change to a source volume track. If the disks of the target volume are slower than the disks of
the source volume, copy-on-write might slow down production I/O. A full copy FlashCopy
produces a high write activity on the disk drives of the target volume.
Therefore, it is always a preferred practice to use target volumes on ranks with the same
characteristics as the source volumes.
Finally, you can achieve a small performance improvement by using identical rank geometries
for both the source and target volumes. If the source volumes are on a rank with a 7+P
configuration, the target volumes are also on a rank configured as 7+P.
The FlashCopy establish phase is the period when the microcode is preparing the bitmaps
that are necessary to create the FlashCopy relationship so that the microcode can correctly
process later reads and writes to the related volumes. It takes only a few seconds to establish
the FlashCopy relationships for tens to hundreds or more volume pairs. The copy is then
immediately available for both read and write access. During this logical FlashCopy period, no
writes are allowed to the source and target volume. However, this period is short. After the
logical relationship is established, normal I/O activity is allowed to both source and target
volumes according to the options selected.
Finally, the placement of the FlashCopy source and target volumes affects the establish
performance. Table 18-3 on page 590 shows a summary of the recommendations.
If many volumes are established, do not expect to see all pairs actively copying data as soon
as their logical FlashCopy relationship is completed. The DS8000 microcode has algorithms
that limit the number of active pairs copying data. This algorithm tries to balance active copy
pairs across the DS8000 DA resources. Microcode gives higher preference to application
activity than copy activity.
Tip: When creating many FlashCopy pairs, we suggest that all commands are submitted
simultaneously, and you allow the DS8000 microcode to manage the internal resources. If
using the DS8000 command-line interface (DSCLI), we suggest that you use single
commands for many devices, rather than many commands with each device.
Full box copy: The term full box copy implies that all rank resources are involved in the
copy process. Either all or nearly all ranks have both source and target volumes, or half the
ranks have source volumes and half the ranks have target volumes.
For full box copies, still place the source and target volumes in different ranks. When all ranks
are participating in the FlashCopy, you can still place the source and target volumes in
different ranks by performing a FlashCopy of volumes on rank R0 onto rank R1 and volumes
on rank R1 onto rank R0, for example. Additionally, if there is heavy application activity in the
source rank, performance is less affected if the background copy target was in another rank
that has lighter application activity.
Important: If storage pool striping is used when allocating volumes, all ranks are more or
less equally busy. Therefore, there is less need to be concerned about the data placement.
But, ensure that you still keep the source and the target on the same processor complex.
If the FlashCopy relationship is established with the -nocp (no copy) option, only first write
updates to the tracks in the source volume, or the target volume forces a copy from the source
to the target. This forced copy is also called a copy-on-write.
Copy-on-write: The term copy-on-write describes a forced copy from the source to the
target, because a write to the source occurred. This forced copy occurs on the first write to
a track. Because the DS8000 writes to nonvolatile cache, there is typically no direct
response time delay on host writes. A write to the source results in a copy of the track.
Consider all business requirements: The suggestions discussed in this chapter only
consider the performance aspects of a FlashCopy implementation. But FlashCopy
performance is only one aspect of an intelligent system design. You must consider all
business requirements when designing a total solution. These additional requirements,
together with the performance considerations, guide you when choosing FlashCopy
options, such as copy or no copy and incremental, as well as when choosing source and
target volume location.
The placement of the source and target volumes significantly affects the application
performance. In addition to the placement of volumes, the selection of copy or no copy is also
an important consideration about the effect on the application performance. Typically, the
choice of copy or no copy depends primarily on how the FlashCopy is to be used and for what
interval of time the FlashCopy relationship exists. From a purely performance point of view,
the choice of whether to use copy or no copy depends on the type of workload. The general
FlashCopy nocopy
In a FlashCopy nocopy relationship, a copy-on-write is done whenever a write to a source
track occurs for the first time after the FlashCopy is established. This type of FlashCopy is
ideal when the target volumes are needed for a short time only, for example, to run the
backup jobs. FlashCopy nocopy adds only a minimal workload on the back-end adapters and
disk drives. However, it affects most of the writes to the source volumes as long as the
relationship exists. When you plan to keep your target volumes for a long time, this choice
might not be the best solution.
Incremental FlashCopy
Another important performance consideration is whether to use incremental FlashCopy.
Use incremental FlashCopy when you perform FlashCopies always to the same target
volumes on regular time intervals. Without the nocopy option, the first FlashCopy is a full
copy, but later FlashCopy operations copy only the tracks of the source volume that are
modified since the last FlashCopy.
Incremental FlashCopy has the least effect on applications. During normal operations, no
copy-on-write is done (as in a nocopy relationship). And during a resync, the load on the back
end is lower compared to a full copy. There is only a small overhead for the maintenance of
out-of-sync bitmaps for the source and target volumes.
Restriction: At the time of writing this book (DS8000 LMC R6.2), thin-provisioned (ESE)
volumes have these restrictions:
ESE volumes are only supported for Open Systems FB volumes.
ESE volume support for Copy Services is currently limited to FlashCopy.
Data from source volumes is copied to space-efficient target volumes. The data is written to a
repository, and there is a mapping mechanism to map the physical tracks to the logical tracks.
See Figure 18-2. Each time that a track in the repository is accessed, it must go through this
mapping process. The attributes of the volume that hosts the repository are important when
planning a FlashCopy SE environment.
Track table of
NVS NVS repository
destagi ng
:
Because of space efficiency, data is not physically ordered in the same sequence on the
repository disks as it is on the source. Processes that might access the source data in a
sequential manner might not benefit from sequential processing when accessing the target.
Another possibility is to consider RAID 10 for the repository, although that goes somewhat
against space efficiency (you might be better off using standard FlashCopy with RAID 5 than
FlashCopy SE with RAID 10). However, there might be cases where trading off some of the
space efficiency gains for a performance boost justifies RAID 10. If RAID 10 is used at the
source, consider it also for the repository.
RAID: All destages to the repository volume are random destages. RAID 10 performs
better than RAID 5, and RAID 5 performs better than RAID 6. There is no advantage in
using RAID 6 for the repository other than resilience. Only consider RAID 6 on systems
where RAID 6 is used as standard throughout the whole DS8000 and the write activity is
low.
The repository always uses storage pool striping when in a multi-rank extent pool. With
storage pool striping, the repository space is striped across multiple RAID arrays in an extent
pool, which helps to balance the volume skew that might appear on the sources. It is
generally best to use at least four RAID arrays in the multi-rank extent pool intended to hold
the repository.
Finally, try to use at least the same number of disk spindles on the repository as the source
volumes. Avoid severe “fan in” configurations, such as 32 ranks of source disk being mapped
to an eight rank repository. This type of configuration likely has performance problems unless
the update rate to the source is modest.
It is possible to share the repository with production volumes on the same extent pool, but use
caution, because contention between the repository and the production volumes can affect
performance. In this case, the repository for one extent pool can be placed in a different
extent pool so that source and target volumes are on different ranks but on the same
processor complex.
Expect a high random write workload for the repository. To prevent the repository from
becoming overloaded, take the following precautions:
Avoid placing standard source and repository target volumes in the same extent pool.
Have the repository in an extent pool with several ranks (a repository is always striped) on
the same rank group or storage server as the source volumes.
Use fast 15K rpm disk drives for the repository ranks.
Consider using RAID 10 instead of RAID 5, because RAID 10 can sustain a higher
random write workload.
Do not use RAID 6 for the repository unless the write activity for the source volumes is low.
Because FlashCopy SE does not need much capacity if your update rate is not too high, you
might want to make several FlashCopies from the same source volume. For example, you
might want to make a FlashCopy several times a day to set checkpoints to protect your data
against viruses or for other reasons.
There are no restrictions on the amount of virtual space or the number of SE volumes that
can be defined for either z/OS or Open Systems storage.
It is typically used for applications that cannot suffer any data loss in the event of a failure.
As data is transferred synchronously, the distance between primary and secondary disk
subsystems determines the effect on application response time. Figure 18-3 illustrates the
sequence of a write update with Metro Mirror.
4
Server
Write acknowledge
write
1 Write to secondary
2
LUN or 3 LUN or
volume volume
Write complete
Primary Secondary
(source)
acknowledgment (target)
When the application performs a write update operation to a primary volume, this process
happens:
1. Write to primary volume (DS8000 cache)
2. Write to secondary (DS8000 cache)
3. Signal write complete on the secondary DS8000
4. Post I/O complete to host server
The Fibre Channel connection between primary and secondary subsystems can be direct,
through a Fibre Channel SAN switch, via a SAN router using Fibre Channel over Internet
The logical Metro Mirror paths are transported over physical links between the disk
subsystems. The physical link includes the HA in the primary DS8000, the cabling, switches
or directors, any wide band or long-distance transport devices (DWDM, channel extenders, or
WAN), and the HAs in the secondary disk subsystem. Physical links can carry multiple logical
Metro Mirror paths as shown in Figure 18-4 on page 598.
Although one Fibre Channel (FC) link has sufficient bandwidth for most Metro Mirror
environments, for redundancy reasons, we suggest that you configure at least two FC links
between each primary and secondary disk subsystem. For better performance, use as many
as the supported maximum of eight links. These links must take diverse routes between the
DS8000 locations.
Dedicating Fibre Channel ports for Metro Mirror use ensures no interference from host I/O
activity, which we suggest with Metro Mirror, because it is time-critical and must not be
affected by host I/O activity. The Metro Mirror ports that are used provide connectivity for all
LSSs within the DS8000 and can carry multiple logical Metro Mirror paths.
Distance
The distance between your primary and secondary DS8000 subsystems affects the response
time overhead of the Metro Mirror implementation. With the requirement of diverse
connections for availability, it is common to have certain paths that are longer distance than
others. Contact your IBM Field Technical Sales Specialist (FTSS) to assist you in assessing
your configuration and the distance implications if necessary.
The maximum supported distance for Metro Mirror is 300 km (186.4 miles). There is
approximately a 1 ms overhead per 100 km (62 miles) for write I/Os (this relationship between
latency and physical distance might differ when you use a wide area network (WAN)).
Distances of over 300 km (186.4 miles) are possible and supported by RPQ. The DS8000
Interoperability Matrix provides the details of SAN, network, and DWDM supported devices.
Due to network configuration variability, the client must work with the channel extender vendor
to determine the appropriate configuration to meet its requirements.
Figure 18-4 shows an example where we have a 1:1 mapping of source to target LSSs, and
where the three logical paths are accommodated in one Metro Mirror link:
LSS1 in DS8000 1 to LSS1 in DS8000 2
LSS2 in DS8000 1 to LSS2 in DS8000 2
LSS3 in DS8000 1 to LSS3 in DS8000 2
Alternatively, if the volumes in each of the LSSs of DS8000 1 map to volumes in all three
secondary LSSs in DS8000 2, there are nine logical paths over the Metro Mirror link (not fully
illustrated in Figure 18-4). We suggest a 1:1 LSS mapping.
DS8000 1 DS8000 2
3-9 logical paths
LSS 1 LSS 1
LSS 3 LSS 3
Metro Mirror
paths
1 logical path 1 logical path
For Metro Mirror, consistency requirements are managed through use of the consistency
group or Critical Mode option when you define Metro Mirror paths between pairs of LSSs.
Volumes or LUNs, which are paired between two LSSs whose paths are defined with the
consistency group option, can be considered part of a consistency group.
Consistency is provided with the extended long busy (for z/OS) condition or queue full (for
Open Systems) condition. These conditions are triggered when the DS8000 detects a
condition where it cannot update the Metro Mirror secondary volume. The volume pair that
first detects the error goes into the extended long busy or queue full condition, so that it does
not perform any I/O. For z/OS, a system message is issued (IEA494I state change message);
for Open Systems, a Simple Network Management Protocol (SNMP) trap message is issued.
These messages can be used as triggers for automation purposes to provide data
Bandwidth
Before establishing your Metro Mirror solution, you must determine your peak bandwidth
requirement. Determining your peak bandwidth requirement helps to ensure that you have
enough Metro Mirror links in place to support that requirement.
To avoid any response time issues, establish the peak write rate for your systems and ensure
that you have adequate bandwidth to cope with this load and to allow for growth. Remember
that only writes are mirrored across to the target volumes after synchronization.
There are tools to assist you, such as Tivoli Storage Productivity Center (TPC) or the
operating system-dependent tools, such as iostat. Another method, but not so exact, is to
monitor the traffic over the FC switches by using FC switch tools and other management
tools, and remember that only writes are mirrored by Metro Mirror. You can also understand
the proportion of reads to writes by issuing the datapath query devstats command on
Subsystem Device Driver (SDD)-attached servers.
A single 8 Gb Fibre Channel link of DS8800 can provide approximately 800 MBps throughput
for the Metro Mirror establish. A single 4 Gb Fibre Channel link of DS8700 can provide
approximately 400 MBps throughput for the Metro Mirror establish. This capability scales up
linearly with additional links up to seven links. The maximum of eight links for an LSS pair
provides a throughput of approximately 3,200 MBps with 4 Gbps Fibre Channel links.
Two-link minimum: A minimum of two links is suggested between each DS8000 pair for
resilience. The remaining capacity with a failed link can maintain synchronization.
LSS design
Because the DS8000 makes the LSS a topological construct, which is not tied to a physical
array as in the ESS, the design of your LSS layout can be simplified. It is now possible to
assign LSSs to applications, for example, without concern about the under-allocation or the
over-allocation of physical disk subsystem resources. Assigning LSSs to applications can
also simplify the Metro Mirror environment, because it is possible to reduce the number of
commands that are required for data consistency.
Volume allocation
As an aid to planning and management of your Metro Mirror environment, we suggest that
you maintain a symmetrical configuration in both physical and logical elements. As well as
making the maintenance of the Metro Mirror configuration easier, maintaining a symmetrical
configuration in both physical and logical elements helps you to balance the workload across
the DS8000. Figure 18-5 on page 600 shows a logical configuration. This idea applies equally
to the physical aspects of the DS8000. You need to attempt to balance the workload and
apply symmetrical concepts to all aspects of your DS8000, which has the following benefits:
Ensure even performance: The secondary site volumes must be created on ranks with
DDMs of the same capacity and speed as the primary site.
Simplify management: It is easy to see where volumes are mirrored and processes can be
automated.
Reduce administrator overhead: There is less administrator overhead due to automation
and the simpler nature of the solution.
Figure 18-5 shows this idea in a graphical form. DS8000 #1 has Metro Mirror paths defined to
DS8000 #2, which is in a remote location. On DS8000 #1, volumes defined in LSS 00 are
mirrored to volumes in LSS 00 on DS8000 #2 (volume P1 is paired with volume S1, P2 with
S2, and P3 with S3). Volumes in LSS 01 on DS8000 #1 are mirrored to volumes in LSS 01 on
DS8000 #2. Requirements for additional capacity can be added in a symmetrical way also by
the addition of volumes into existing LSSs, and by the addition of new LSSs when needed (for
example, the addition of two volumes in LSS 03 and LSS 05 and one volume to LSS 04 make
these LSSs have the same number of volumes as the other LSS. Additional volumes can then
be distributed evenly across all LSSs, or LSSs can be added.
Consider an asymmetrical configuration where the primary site has volumes defined on ranks
comprised of 146 GB DDMs. The secondary site has ranks comprised of 300 GB DDMs.
Because the capacity of the destination ranks is double that of the source ranks, it seems
feasible to define twice as many LSSs per rank on the destination side. However, this
situation, where four primary LSSs on four ranks feed into four secondary LSSs on two ranks,
creates a performance bottleneck on the secondary rank and slows down the entire Metro
Mirror process.
We also suggest that you maintain a symmetrical configuration in both physical and logical
elements between primary and secondary storage systems in a Metro Mirror relationship
when using Easy Tier automatic mode. This approach ensures that the same level of
optimization and performance can be achieved on the secondary system after the production
workload is switched to the secondary site and after Easy Tier successfully completes
learning about the production workload and finishes data relocation on the secondary
system, which requires additional time after the failover.
For more information, see 18.8, “Considerations for Easy Tier and remote replication” on
page 626.
Bandwidth: Consider the bandwidth that you need to mirror all volumes. The amount of
bandwidth might not be an issue if there are many volumes with a low write I/O rate.
Review data from Tivoli Storage Productivity Center for Disk if it is available.
You can choose not to mirror all volumes (for example, swap devices for Open Systems or
temporary work volumes for z/OS can be omitted). In this case, you must carefully control
what data is placed on the mirrored volumes (to avoid any capacity issues) and what data is
placed on the non-mirrored volumes (to avoid missing any required data). You can place all
mirrored volumes in a particular set of LSSs, in which all volumes are Metro Mirror enabled,
and direct all data that requires mirroring to these volumes.
For testing purposes, additional volumes can be configured at the remote site. These
volumes can be used to take a FlashCopy of a consistent Metro Mirror image on the
secondary volume and then allow the synchronous copy to restart while testing is performed.
To create a consistent copy for testing, the host I/O needs to be quiesced, or automation
code, such as Geographically Dispersed Parallel Sysplex™ (GDPS) and Tivoli Storage
Productivity Center for Replication, needs to be used to create a consistency group on the
primary disks so that all dependent writes are copied to the secondary disks. You can also
use consistent FlashCopy on the Metro Mirror secondary devices without suspending the
pairs.
18.3.3 Scalability
The DS8000 Metro Mirror environment can be scaled up or down as required. If new volumes
are added to the DS8000 that require mirroring, they can be dynamically added. If additional
Metro Mirror paths are required, they also can be dynamically added.
Important: The mkpprcpath command is used to add Metro Mirror paths. If paths are
already established for the LSS pair, they must be included in the mkpprcpath command
together with any additional path or the existing paths are removed.
This function is appropriate for remote data migration, off-site backups, and the transmission
of inactive database logs at virtually unlimited distances. See Figure 18-7.
2 Write acknowledge
Server
write
1 3
Primary Secondary
(source) (target)
4
Write to secondary
LUN or LUN or
volume (non-synchronously)
volume
The primary volume remains in the Copy Pending state while the Global Copy session is
active. This status only changes if a command is issued or the links between the storage
subsystems are lost.
A path must be established between the source LSS and target LSS over a Fibre Channel
link. The major difference is the distance over which Global Copy can operate. Because it is a
Global copy: The consistency group is not specified on the establish path command.
Data on Global Copy secondaries is not consistent so there is no need to maintain the
order of dependent writes.
The decision about when to use Global Copy depends on a number of factors:
The recovery of the system does not need to be current with the primary application
system.
There is a minor impact to application write I/O operations at the primary location.
The recovery uses copies of data created by the user on tertiary volumes.
Distances beyond FCP limits are required: 300 km (186 miles) for FCP links (RPQ for
greater distances).
You can use Global Copy as a tool to migrate data between data centers.
Distance
The maximum (supported) distance for a direct Fibre Channel connection is 10 km (6.2
miles). If you want to use Global Copy over longer distances, you can use the following
connectivity technologies to extend this distance:
Fibre Channel routers using Fibre Channel over Internet Protocol (FCIP)1
Dense Wavelength Division Multiplexers (DWDM) on fiber
A simple way to envision DWDM is to consider that at the primary end, multiple fiber optic
input channels, such as Fibre Channel, FICON, or Gbit Ethernet, are combined by the DWDM
into a single fiber optic cable. Each channel is encoded as light of a different wavelength. You
might think of each channel as an individual color; the DWDM system is transmitting a
rainbow. At the receiving end, the DWDM fans out the different optical channels. DWDM, by
the nature of its operation, provides the full bandwidth capability of the individual channel.
Because the wavelength of light is from a practical perspective infinitely divisible, DWDM
technology is only limited by the sensitivity of its receptors for the total possible aggregate
bandwidth. You must contact the multiplexer vendor regarding hardware and software
prerequisites when using the vendor’s products in a DS8000 Global Copy configuration.
1 Fibre Channel routers that use FCIP over wide area network (WAN) lines are also referred to as channel extenders.
See “Creating a consistent point-in-time copy” in the “Global Copy options and configuration”
chapter in DS8000 Copy Services for IBM System z, SG24-6787, and IBM System Storage
DS8000: Copy Services in Open Environments, SG24-6788.
You can estimate the Global Copy application impact to be similar to the impact of the
application when working with Metro Mirror suspended volumes. For the DS8000, there is
additional work to do with the Global Copy volumes compared to the suspended volumes,
because with Global Copy, the changes must be sent to the remote DS8000. But this impact
is negligible overhead for the application compared with the typical synchronous overhead.
There are no host system resources consumed by Global Copy volume pairs, excluding any
management solution, because the Global Copy is managed by the DS8000 subsystem.
18.4.3 Scalability
The DS8000 Global Copy environment can be scaled up or down as required. If new volumes
that require mirroring are added to the DS8000, they can be dynamically added. If additional
Global Copy paths are required, they also can be dynamically added.
Addition of capacity
The logical nature of the LSS makes a Global Copy implementation on the DS8000 easier to
plan, implement, and manage. However, if you need to add more LSSs to your Global Copy
environment, your management and automation solutions must be set up to add this capacity.
Host
2 Acknowledge
write
Host write
1
FlashCopy
B (automatically)
A
Write to secondary
(asynchronously)
C
Automatic cycle in active session
The DS8000 manages the sequence to create a consistent copy at the remote site
(Figure 18-10 on page 608):
Asynchronous long-distance copy (Global Copy) with little to no impact to application
writes.
Momentarily pause for application writes (fraction of a millisecond to a few milliseconds).
Create point-in-time consistency group across all primary subsystems in out-of-sync
(OOS) bitmap. New updates are saved in the Change Recording bitmap.
Restart application writes and complete the write (drain) of point-in-time consistent data to
the remote site.
Stop the drain of data from the primary after all consistent data is copied to the secondary.
Logically FlashCopy all data to C volumes to preserve consistent data.
Restart Global Copy writes from the primary.
Automatic repeat of sequence from once per second to hours (this choice is selectable).
FlashCopy
Global Copy
A B C
The data at the remote site is current within 3 - 5 seconds, but this RPO depends on the
workload and bandwidth available to the remote site.
Using this copy for recovery: The copy created with the consistency group is a power-fail
consistent copy, not necessarily an application-based consistent copy. When you use this
copy for recovery, you might need to perform additional recovery operations, such as the
fsck command in an AIX filesystem.
This section explains performance aspects when planning and configuring for Global Mirror
together with the potential impact to application write I/Os caused by the process used to form
a consistency group.
We also consider distributing the target Global Copy and target FlashCopy volumes across
various ranks to balance the load over the entire target storage server and minimize the I/O
load for selected busy volumes.
If the primary DS8000 is already configured, you can measure the current performance by
using tools, such as Tivoli Storage Productivity Center for Disk, to fix any performance
bottlenecks caused by the configuration before the remote copy is established.
The PPRC links must use dedicated DS8000 HA ports to avoid any conflict with host I/O. If
any subordinate storage subsystems are included in the Global Mirror session, the FC links to
those subsystems must also use dedicated HA ports.
The cost of providing these links can be high if they are sized to provide this RPO under all
circumstances. If it is acceptable to allow the RPO to increase slightly during the highest
workload times, you might be able to reduce the bandwidth and the cost of the link. In many
instances, the highest write rate occurs overnight during backup processing, and increased
RPO can be tolerated.
The difference in bandwidth and costs for maintaining an RPO of a few seconds might be
double that of maintaining an RPO of a few minutes at peak times. Recovery to the latest
consistency group is immediate after the peak passes; there is no catch-up time.
The FlashCopy used as a part of this Global Copy operation is running in nocopy mode and
causes additional internally triggered I/Os within the target storage server for each write I/O to
the FlashCopy source volume, that is, the Global Copy target volume. This I/O is to preserve
the last consistency group.
Each Global Copy write to its secondary volume during the time period between the formation
of successive consistency groups causes an actual FlashCopy write I/O operation on the
target DS8000 server. Figure 18-11 summarizes approximately what happens between two
consistency group creation points when the application writes are received.
Figure 18-11 Global Copy with write hit at the remote site
The following steps show the FlashCopy write I/O operation on the target DS8000 server
(follow the numbers in Figure 18-11):
1. The application write I/O completes immediately to volume A1 at the local site.
2. Global Copy nonsynchronously replicates the application I/O and reads the data at the
local site to send to the remote site.
3. The modified track is written across the link to the remote B1 volume.
4. FlashCopy nocopy sees that the track is about to change.
5. The track is written to the C1 volume before the write to the B1 volume.
This process is an approximation of the sequence of internal I/O events. There are
optimization and consolidation effects that make the entire process efficient.
Figure 18-11 showed the normal sequence of I/Os within a Global Mirror configuration. The
critical path is between points (2) and (3). Usually (3) is simply a write hit in NVS in B1, and
some time later and after (3) completes, the original FlashCopy source track is copied from
B1 to C1.
If NVS is overcommitted in the secondary storage server, there is a potential impact on the
Global Copy data replication operation performance. See Figure 18-12 on page 611.
Figure 18-12 Application write I/O within two consistency group points
Figure 18-12 summarizes roughly what happens when NVS in the remote storage server is
overcommitted. A read (3) and a write (4) to preserve the source track and write it to the C
volume are required before the write (5) can complete. Eventually, the track gets updated on
the B1 volume to complete the write (5). But usually, all writes are quick writes to cache and
persistent memory and happen in the order as outlined in Figure 18-10 on page 608.
You can obtain a more detailed explanation of this processing in DS8000 Copy Services for
IBM System z, SG24-6787, and IBM System Storage DS8000: Copy Services in Open
Environments, SG24-6788.
Default values: In most environments, use the default values. These default values are
maximum intervals, and in practice, the actual interval is usually shorter.
The default for the maximum coordination time is 50 ms, which is a small value compared to
other I/O timeout values, such as the missing-interrupt handler (MIH) (30 seconds) or Small
Computer System Interface (SCSI) I/O timeouts. Even in error situations where we might
trigger this timeout, Global Mirror protects production performance rather than affecting
production in an attempt to form consistency groups in a time where there might be error
recovery or other problems occurring.
coordination drain
time time
Serialize all
Perform
1 Global Copy 2 Drain data from local to remote site 3 FlashCopy
primary volumes
A2 B2
Primary Secondary C2
Tertiary
Figure 18-13 Coordination time and how it impacts application write I/Os
The coordination time, which you can limit by specifying a number of milliseconds, is the
maximum impact to the application write I/Os that you allow when forming a consistency
group. The intention is to keep the coordination time value as small as possible. The default of
50 ms might be high in a transaction processing environment. A valid number might also be in
the single digit range. The required communication between the Master storage server and
potential Subordinate storage servers is in-band over PPRC paths between the Master and
Subordinates. This communication is highly optimized and you can minimize the potential
application write I/O impact to 3 ms, for example. There must be at least one PPRC FC link
between a Master storage server and each Subordinate storage server, although for
redundancy, we suggest that you use two PPRC FC links.
One of the key design objectives for Global Mirror is to not affect the production applications.
The consistency group formation process involves holding production write activity to create
dependent write consistency across multiple devices and multiple disk subsystems. This
process must be fast enough that the impact is small. With Global Mirror, the process of
forming a consistency group is designed to take 1 - 3 ms. If we form consistency groups every
3 - 5 seconds, the percentage of production writes affected and the degree of impact is small.
The following example shows the type of impact that might be seen from consistency group
formation in a Global Mirror environment.
We assume 24000 I/Os per second with a 3:1 R/W ratio. We perform 6000 write I/Os per
second. Each write I/O takes 0.5 ms, and it takes 3 ms to create a consistent set of data.
Approximately 0.0035 x 6000 = 21 write I/Os are affected by the creation of consistency.
If each of these 21 I/Os experiences a 3 ms delay, and this delay happens every 3 seconds,
we have an average response time (RT) delay of (21x 0.003)/18000 = 0.0035 ms.
A 0.0035 ms average impact to a 0.5 ms write is a 0.7% increase in response time, and
normal performance reporting tools do not detect this level of impact.
The previous consistency group is still available on the C devices so the effect of this situation
is that the RPO increases for a short period. The primary disk subsystem evaluates when it is
possible to continue to form consistency groups and restarts consistency group formation at
this time.
The default for the maximum drain time is 30 seconds, which allows a reasonable time to
send a consistency group while ensuring that if there is a non-fatal network or
communications issue that we do not wait too long before evaluating the situation and
potentially dropping into Global Copy mode until the situation is resolved. In this way, we
again protect the production performance rather than attempting (and possibly failing) to form
consistency groups at a time when forming consistency groups might be inappropriate.
If we are unable to form consistency groups for 8 hours, by default, Global Mirror forms a
consistency group without regard to the maximum drain time. It is possible to change this time
if this behavior is undesirable in a particular environment.
The actual replication process usually does not affect the application write I/O. There is a
slight chance that the same track within a consistency group is updated before this track is
replicated to the secondary site within the specified drain period. When this unlikely event
happens, the affected track is immediately (synchronously) replicated to the secondary
storage server before the application write I/O modifies the original track. In this exceptional
case, the application write I/O is affected, because it must wait for the write to complete at the
remote site as in a Metro Mirror synchronous configuration.
Later writes to this same track do not experience any delay, because the tracks are already
replicated to the remote site.
However, because it also increases the time between successive FlashCopies, increasing
this value is not necessary and might be counterproductive in high-bandwidth environments,
because frequent consistency group formation reduces the overhead of Copy on Write
processing.
The default for the consistency group interval is 0 seconds, so Global Mirror continuously
forms consistency groups as fast as the environment allows. In most situations, we suggest
leaving this parameter at the default and allowing Global Mirror to form consistency groups as
There are only production volumes on the production site. Configure the DS8000 for the best
performance as we discussed in Chapter 4, “Logical configuration performance
considerations” on page 87. At the same time, the storage server needs to be able to handle
both production and replication workloads. In general, create a balanced configuration that
uses all device adapters, ranks, and processor complexes.
At the recovery site, you have to consider Global Mirror volumes only, but there are two types:
target Global Copy and target FlashCopy. Where the B volume is used for production in a
failover situation, the DDM size can be double that of the production site DDM size. Global
Mirror still gives an identical number of spindles and capacity, because the FlashCopy volume
is not in use in this situation. Where a fourth volume is used to facilitate disaster recovery
testing without removing the Global Mirror copy facility for the duration of the test, only 50%
more drives are required if you use double-capacity DDMs.
With Easy Tier automatic mode controlling the extent distribution in managed extent pools,
consider allocating the B and C volumes on the recovery site in the same multi-rank extent
pool. Easy Tier spreads busy extents across ranks and optimizes overall extent distribution
across ranks and tiers in the extent pool based on the workload profile. The workload is
balanced across the resources in the extent pool and performance bottlenecks are avoided.
This approach provides optimal performance.
You can still separate workloads by using different extent pools by using the principles of
workload isolation as described in 4.8, “Planning extent pools” on page 115 or use a manual
approach as described in “Remote DS8000 configuration” without Easy Tier automatic mode.
Through a one-to-one mapping from local to remote storage server, you achieve the same
configuration at the remote site for the B volumes and the C volumes. Figure 18-14 on
page 615 proposes to spread the B and C volumes across different ranks at the remote
storage server so that the FlashCopy target is on a different rank than the FlashCopy source.
R an k 1 R an k 1
F C P lin k s A Primary
Primary
A FC P FCP
Primary
B2
Primary C1
A2
Primary
Primary
targ e t tertiary
source
copy pending
copy pending
R an k 2 R an k 2
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
targ e t tertiary
source
copy pending
ost copy pending
R an k 3 R an k 3
L ocal site R em ote site
Figure 18-14 Remote storage server configuration: All ranks contain equal numbers of volumes
The goal is to put the same number of each volume type into each rank. The volume types
that we describe refer to B volumes and C volumes within a Global Mirror configuration. To
avoid performance bottlenecks, spread busy volumes over multiple ranks. Otherwise, hot
spots can be concentrated on single ranks when you put the B and C volumes on the same
rank. We suggest that you spread B and C volumes as Figure 18-14 suggests.
With mixed DDM capacities and different speeds at the remote storage server, consider
spreading B volumes over the fast DDMs and over all ranks. Basically, follow a similar
approach as Figure 18-14 suggests. You might keep busy B volumes and C volumes on the
faster DDMs.
If the DDMs used at the remote site are double the capacity but the same speed as those
DDMs used at the production site, an equal number of ranks can be formed. In a failover
situation when the B volume is used for production, it provides the same performance as the
production site, because the C volume is not then in use.
Important: Keep the FlashCopy target C volume on the same processor complex as the
FlashCopy source B volume.
Rank 1 Rank 1
Primary
FCP links A Primary
A FCP FCP
Primary
B2
Primary C1
A2
Primary
Primary
target tertiary
source
copy pending
copy pending
Rank 2 Rank 2
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
target tertiary
source
copy pending
Host copy pending
Rank 3 Rank 3
Local site
Primary
A Primary
Primary
Host D1
Primary D2
Primary
A Primary
Remote Primary
D3
Primary D4
site
Rank 4
Figure 18-15 shows the three Global Mirror volumes and the addition of D volumes that you
can create for test purposes. We suggest, as an alternative, a rank with larger and slower
DDMs. The D volumes can be read from another host, and any other I/O to the D volumes
does not affect the Global Mirror volumes in the other ranks. A nocopy relationship between B
and D volumes reads the data from B when coming through the D volume. So, you might
consider a physical COPY when you create D volumes on a different rank, which separates
additional I/O to the D volumes from I/O to the ranks with the B volumes.
If you plan to use the D volumes as the production volumes at the remote site in a failover
situation, the D volume ranks must be configured in the same way as the A volume ranks and
use identical DDMs. You must make a full copy to the D volume for both testing and failover.
When using Tivoli Storage Productivity Center for Replication, the Copy Sets for Global Mirror
Failover/Failback w/ Practice are defined in this way. The Tivoli Storage Productivity Center for
Replication volume definitions are listed:
A volume defined as H1 volume (Host site 1)
B volume defined as I2 volume (Intermediate site 2)
C volume defined as J2 volume (Journal site 2)
D volume defined as H2 volume (Host site 2)
Rank 1 Rank 1
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
target tertiary
source
copy pending
Host copy pending
Rank 3 Rank 3
Local site Remote site
Figure 18-16 Remote disk subsystem with space-efficient FlashCopy target volumes
FlashCopy SE is optimized for use cases where less than 20% of the source volume is
updated during the life of the relationship. In most cases, Global Mirror is configured to
schedule consistency group creation at an interval of a few seconds, which means that a
small amount of data is copied to the FlashCopy targets. From this point of view, Global Mirror
is a suggested area of application for FlashCopy SE.
In contrast, Standard FlashCopy generally has superior performance to FlashCopy SE. The
FlashCopy SE repository is critical regarding performance. When provisioning a repository,
storage pool striping automatically is used with a multi-rank extent pool to balance the load
across the available disks. In general, we suggest a minimum of four RAID arrays in the
extent pool. Depending on the logical configuration of the DS8000, you might also consider
the use of multiple space-efficient repositories for the FlashCopy target volume in a Global
Mirror environment, at least one on each processor complex. The repository extent pool can
also contain additional non-repository volumes.
Contention can arise if the extent pool is shared. After the repository is defined, you cannot
expand it so it is important that you plan to ensure that it is large enough. If the repository fills,
the FlashCopy SE relationship fails and the Global Mirror is not able to successfully create
consistency groups.
When many volumes are used with Global Mirror, it is important that you configure sufficient
cache memory to provide for the best possible overall function and performance. To
16 GB 4500
32 GB 4500
64 GB 9000
128 GB 18000
256 GB 32640
384 GB 32640
Important: The suggestions are solely based on the number of Global Mirror volume
relationships; the capacity of the volumes is irrelevant. One way to avoid exceeding these
suggestions is to use fewer, larger volumes with Global Mirror.
With a maximum number of 65280 volumes on a single DS8000 storage system, it is not
possible to have more than 32640 Global Mirror secondary devices (B volumes) on the
secondary system due to the requirement for the additional FlashCopy journal devices
(C volumes). If you have even an extra set of FlashCopy volumes (D volumes) for testing
purposes, this limit is 21760.
Volumes can be added to a session in any state, for example, simplex or pending. Volumes
that have not completed their initial copy phase stay in a join pending state until the first initial
copy is complete. If a volume in a session is suspended, it causes consistency group
formation to fail.
We suggest that you add only Global Copy source volumes that completed their initial copy or
first pass, although the microcode stops volumes from joining the Global Mirror session until
the first pass is complete. Also, we suggest that you wait until the initial copy is complete
before you create the FlashCopy relationship between the B and the C volumes.
Important: You cannot add a Metro Mirror source volume to a Global Mirror session.
Global Mirror supports only Global Copy pairs. When Global Mirror detects a volume that,
for example, is converted from Global Copy to Metro Mirror, the following formation of a
consistency group fails.
When you add a rather large number of volumes at one time to an existing Global Mirror
session, then the available resources for Global Copy within the affected ranks can be used
by the initial copy pass. To minimize the impact to the production servers when you add many
volumes, consider adding the volumes to an existing Global Mirror session in stages.
Suspending a Global Copy pair that belongs to an active Global Mirror session affects the
formation of consistency groups. When you intend to remove Global Copy volumes from an
active Global Mirror session, follow these steps:
1. Remove the desired volumes from the Global Mirror session.
2. Withdraw the FlashCopy relationship between the B and C volumes.
3. Terminate the Global Copy pair to bring volume A and volume B into simplex mode.
Important: When you remove A volumes without pausing Global Mirror, you might see this
situation reflected as an error condition with the showmigr -metrics command, indicating
that the consistency group formation failed. However, this error condition does not mean
that you lost a consistent copy at the remote site, because Global Mirror does not take the
FlashCopy (B to C) for the failed consistency group data. This message indicates that just
one consistency group formation failed, and Global Mirror retries the sequence.
When you add an LSS to an active session and this LSS belongs to a storage disk subsystem
that already has another LSS that belongs to this Global Mirror session, you can add the LSS
to the session without stopping and starting the session again. This situation is true for either
the master or for a subordinate storage disk subsystem.
If you use Tivoli Storage Productivity Center (TPC) for Replication, the new subsystem can be
added by using the GUI or Copy Services Manager CLI. The paths must then be added for
the new LSS pairs. Tivoli Storage Productivity Center for Replication adds only one path if the
paths are not already defined. The copy sets can then be added to the session after the new
subsystem is recognized by Tivoli Storage Productivity Center for Replication.
Important: When using Tivoli Storage Productivity Center for Replication to manage Copy
Services, do not use the DSCLI to make any configuration changes. Make changes only
with the Copy Services Manager CLI (CSM CLI) or the Tivoli Storage Productivity Center
for Replication GUI.
For a schematic overview of z/OS Global Mirror processing, see Figure 18-17, which
illustrates a simplified view of the z/OS Global Mirror components and the data flow logic.
When a z/OS Global Mirror pair is established, the host system DFSMSdfp software starts to
time stamp all later write I/Os to the primary volumes, which provides the basis for managing
data consistency across multiple logical control units (LCUs). If these primary volumes are
shared by systems running on different CECs, an IBM Sysplex Timer® is required to provide
a common time reference for these timestamps. If all the primary systems are running in
different logical partitions (LPARs) within the same CEC, the system time-of-day clock can be
used.
z/OS Global Mirror is implemented in a cooperative way between the DS8000s on the primary
site and the DFSMSdfp host system software component System Data Mover (SDM).
For a complete and detailed description of all z/OS Global Mirror performance and tuning
options, see “z/OS Global Mirror performance options” in DS8000 Copy Services for IBM
System z, SG24-6787. In the following sections, we present considerations for z/OS Global
Mirror.
You have two options, depending on your available space and your configuration:
Dedicate ranks or even disk storage systems to the Journals to avoid any interference
from other workloads with the Journal I/O.
Share resources between both the Journal volumes and the secondary target volumes.
Spread the updates to secondary target volumes and the updates to the Journal volumes
across the maximum available resources in an ordered and balanced configuration to
balance out the workload and avoid any potential hot spots.
If the control datasets are allocated on the secondary (or target) disk subsystem, you need to
consider the impact of the I/O activity from the mirrored volumes on the same disk
subsystem. Experience shows that placing the Journal datasets over many ranks and sharing
those ranks with secondary targets works well. Also, placing the Control and State datasets
with the secondary targets works well. Sharing the resources is the most common approach
by current IBM z/OS Global Mirror clients today. We strongly suggest that you monitor the
performance of all volumes to ensure that the environment is healthy.
z/OS Global Mirror provides the flexibility of allowing System Data Mover (SDM) operations to
be tailored to installation requirements and also supports the modification of key parameters,
either from the PARMLIB dataset or through the XSET command.
For a detailed description of z/OS Global Mirror tuning parameters, see “z/OS Global Mirror
tuning parameters” in DS8000 Copy Services for IBM System z, SG24-6787.
With the zGM single reader implementation, you must carefully plan to balance the primary
volumes’ update rates for all zGM volumes in an LSS against the SDM update drain rate,
because all updates for a physical session on an LSS are read by the SDM through a single
SDM reader. If the updates occur at a faster rate than the rate which the SDM can offload
those updates, record sets accumulate in the cache. When the cache fills up, the storage
subsystem coupled with the SDM begins to execute the algorithms to start pacing or
device-level blocking. Pacing or device-level blocking affects the performance of the host
application. If the effect of pacing and device-level blocking is insufficient, zGM eventually
suspends.
When you set up a zGM configuration during the bandwidth study, the MBps update rate for
each volume is determined and the volumes are placed in an LSS based on their update
rates and the associated SDM reader offload rate. Sometimes, more than one physical zGM
session is required to be able to drain the updates for all the volumes that reside in an LSS. In
this case, SDM must manage multiple physical sessions for the LSS.
With zGM multiple reader support, the SDM can now drain the record set updates off using
multiple fixed utility base addresses or a single base address with aliases assigned to it.
Through the single physical zGM session on an LSS, multiple reader paths can be used by
the SDM to drain all the updates, which can reduce the number of base addresses required
per LSS for zGM fixed utility addresses and can be even more dynamic in nature when
HyperPAV is used. This support enables the SDM to offload the record updates on an LSS
through multiple paths against the same sidefile while maintaining the record set Time
Sequenced Order. zGM multiple reader support permits the SDM to balance the updates
across multiple readers, enabling simpler planning for zGM.
SDM can manage a combination of physical sessions with single or multiple readers
depending whether the multiple reader support is installed and active or not active for each
subsystem involved in the zGM logical session. The DS8000 function is called z/OS Global
Mirror Multiple Reader. Multiple reader support can also help to simplify the move to larger
devices. And, it can reduce the sensitivity of zGM in draining updates as workload
characteristics change or capacity growth occurs. Less manual effort is required to manage
the SDM offload process.
For more information about multiple reader, see “Multiple Reader (enhanced readers)” in
DS8000 Copy Services for IBM System z, SG24-6787.
Figure 18-18 on page 623 shows the performance improvement when running a workload
that is performing 4 KB sequential writes to a single volume. At a 3200 km (1988.3 miles)
0 5 10 15 20 25 30 35
Figure 18-19 shows the comparison when running a 27 KB sequential write workload to a
single volume. We compare the MB per second throughput. Even though the improvement is
not as dramatic as on the 4 KB sequential write workload, we still see that the multiple reader
provides better performance compared to the single reader.
0 5 10 15 20 25
***
4
normal application I/Os
Global Mirror netw ork Global Mirror FlashCopy
asynchronous incremental
Metro Mirror long distance NOCOPY
1
2
A B C
3 b
a c
D
Metro Mirror network
synchronous Global Mirror
short distance
IBM offers services and solutions for the automation and management of the Metro Mirror
environment, which include GDPS for System z and Tivoli Storage Productivity Center for
Replication. You can obtain more details about GDPS at the following website:
http://www.ibm.com/systems/z/advantages/gdps
In a normal configuration, the synchronous copy is from A to B with the Global Mirror to C. If
site B is lost, the links must already be in place from A to C to maintain the Global Mirror
function. This link must provide the same bandwidth as the B to C link.
z/OS Global
Mirror
Metro
F lashCo py
Mirror
when
requ ired
P X’
P’ X
DS8000 DS8000 DS8000 X”
Metro Mirror Metro Mirror/ z/OS Global Mirror
Secondary z/OS Global Mirror Secondary
Primary
In the example that is shown in Figure 18-22, the System z environment in the Local Site is
normally accessing the DS8000 disk in the Local Site. These disks are mirrored back to the
Intermediate Site with Metro Mirror to another DS8000. At the same time, the Local Site disk
has z/OS Global Mirror pairs established to the Remote Site to another DS8000, which can
be at continental distances from the Local Site.
In a remote replication setup, such as Metro Mirror, the workload differs considerably between
the primary and secondary system during normal replication. Because Easy Tier monitors the
reads and writes of the production workload on the primary storage system but only the write
activity on the secondary storage system (because no reads occur there), it is likely that the
extent distribution and performance optimization achieved by Easy Tier differ between the
primary and secondary storage system.
At the time of writing this book, the learning and optimization done on the primary DS8000
system is not sent to the secondary DS8000 system to provide the same extent distribution
and optimization across the ranks on both systems. In a disaster situation with a failover of
However, we suggest that you maintain a symmetrical configuration in both physical and
logical elements between primary and secondary storage systems in a Metro Mirror
relationship when you use Easy Tier automatic mode. This approach ensures that the same
level of optimization and performance can be achieved on the secondary system after the
production workload is switched to the secondary site.
Important: In a three-tier extent pool configuration, the cross-tier extent migration occurs
only between two adjacent tiers. After a failover from the primary to a secondary, some
extents that are considered hot and allocated in the Nearline tier might need more than two
days to be migrated to an solid-state drive (SSD) tier.
With Global Mirror and Easy Tier automatic mode controlling the extent distribution in
managed extent pools, consider allocating the B and C volumes on the recovery site in the
same multi-rank extent pool. Easy Tier spreads busy extents across ranks and optimizes
overall extent distribution across ranks and tiers in the extent pool based on the workload
profile. The workload is balanced across the resources in the extent pool and performance
bottlenecks can be avoided, which provides optimal performance. However, you can still
separate workloads by using different extent pools by using the principles of workload
isolation as described in 4.8, “Planning extent pools” on page 115 or a manual approach as
described in ““Remote DS8000 configuration” on page 614” without Easy Tier automatic
mode.
Part 5 Appendixes
This part includes the following topics:
Performance management process
Benchmarking
Planning and documenting your logical configuration
Microsoft Windows server performance log collection
This power is the potential of the DS8000 but careful planning and management are essential
to realize that potential in a complex IT environment. Even a well-configured system is subject
to the following changes over time that affect performance:
Additional host systems
Increasing workload
Additional users
Additional DS8000 capacity
A typical case
To demonstrate the performance management process, we look at a typical situation where
DS8000 performance is an issue.
Users begin to open incident tickets to the IT Help Desk claiming that the system is slow and
therefore is delaying the processing of orders from their clients and the submission of
invoices. IT Support investigates and detects that there is contention in I/O to the host
systems. The Performance and Capacity team is involved and analyzes performance reports
together with the IT Support teams. Each IT Support team (operating system, storage,
database, and application) issues its report defining the actions necessary to resolve the
problem. Certain actions might have a marginal effect but are faster to implement; other
actions might be more effective but need more time and resources to put in place. Among the
actions, the Storage Team and Performance and Capacity Team report that additional storage
capacity is required to support the I/O workload of the application and ultimately to resolve
the problem. IT Support presents its findings and recommendations to the company’s
Business Unit, requesting application downtime to implement the changes that can be made
immediately. The Business Unit accepts the report but says that it has no money for the
purchase of new storage. They ask the IT department how they can ensure that the additional
storage can resolve the performance issue. Additionally, the Business Unit asks the IT
department why the need for additional storage capacity was not submitted as a draft
proposal three months ago when the budget was finalized for next year, knowing that the
system is one of the most critical systems of the company.
Incidents, such as this one, make us realize the distance that can exist between the IT
department and the company’s business strategy. In many cases, the IT department plays a
key role in determining the company’s strategy. Therefore, consider these questions:
How can we avoid situations like those just described?
How can we make performance management become more proactive and less reactive?
What are best practices for performance management?
What are the key performance indicators of the IT infrastructure and what do they mean
from the business perspective?
Are the defined performance thresholds adequate?
How can we identify the risks in managing the performance of assets (servers, storage
systems, and applications) and mitigate them?
To better align the understanding between the business and the technology, we use as a
guide the Information Technology Infrastructure Library (ITIL) to develop a process for
performance management as applied to DS8000 performance and tuning.
Purpose
The purpose of performance management is to ensure that the performance of the IT
infrastructure matches the demands of the business. The following activities are involved:
Define and review performance baselines and thresholds
Collect performance data from the DS8000
Check whether the performance of the resources is within the defined thresholds
Analyze performance using collected DS8000 performance data and tuning suggestions
Define and review standards and IT architecture related to performance
Analyze performance trends
Size new storage capacity requirements
Certain activities relate to the operational activities, such as the analysis of performance of
DS8000 components, and other activities relate to tactical activities, such as the performance
analysis and tuning. Other activities relate to strategic activities, such as storage capacity
sizing. We can split the process into three subprocesses:
Operational performance subprocess
Analyze the performance of DS8000 components (processor complexes, device adapters
(DAs), host adapters (HAs), and ranks) and ensure that they are within the defined
thresholds and service-level objectives (SLOs) and service-level agreements (SLAs).
Tactical performance subprocess
Analyze performance data and generate reports for tuning recommendations and the
review of baselines and performance trends.
Strategic performance subprocess
Analyze performance data and generate reports for storage sizing and the review of
standards and architectures that relate to performance.
When assigning the tasks, you can use a Responsible, Accountable, Consulted, and
Informed (RACI) matrix to list the actors and the roles that are necessary to define a process
or subprocess. A RACI diagram, or RACI matrix, is used to describe the roles and
responsibilities of various teams or people to deliver a project or perform an operation. It is
useful in clarifying roles and responsibilities in cross-functional and cross-departmental
projects and processes.
With Tivoli Storage Productivity Center, you can set performance thresholds for two major
categories:
Status change alerts
Configuration change alerts
You might also need to compare the DS8000 performance with the users’ performance
requirements. Often, these requirements are explicitly defined in formal agreements between
IT management and user management. These agreements are referred to as service-level
agreements (SLA) or service-level objectives (SLO). These agreements provide a framework
for measuring IT resource performance requirements against IT resource fulfillment.
Performance SLA
A performance SLA is a formal agreement between IT Management and User
representatives concerning the performance of the IT resources. Often, these SLAs provide
goals for end-to-end transaction response times. For storage, these types of goals typically
relate to average disk response times for different types of storage. Missing the technical
goals described in the SLA results in financial penalties to the IT service management
providers.
Performance SLO
Performance SLOs are similar to SLAs with the exception that misses do not carry financial
penalties. While SLO misses do not carry financial penalties, misses are a breach of contract
in many cases and can lead to serious consequences if not remedied.
Having reports that show you how many alerts and how many misses in SLOs/SLAs occurred
over time is important. The reports tell how effective your storage strategy is (standards,
architectures, and policy allocation) in the steady state. In fact, the numbers in those reports
are inversely proportional to the effectiveness of your storage strategy. The more effective
your storage strategy, the fewer performance threshold alerts are registered and the fewer
SLO/SLA targets are missed.
It is not necessary to implement SLOs or SLAs for you to discover the effectiveness of your
current storage strategy. The definition of SLO/SLA requires a deep and clear understanding
of your storage strategy and how well your DS8000 is running. That is why, before
implementing this process, we suggest that you start with the tactical performance process:
Generate the performance reports
Define tuning suggestions
Review the baseline after implementing tuning recommendations
Generate performance trends reports
Then, redefine the thresholds with fresh performance numbers. The failure to redefine the
thresholds with fresh performance numbers causes you to spend time dealing with
performance incident tickets with false-positive alerts and not spend the time analyzing the
performance and suggesting tuning for your DS8000. Let us look at the characteristics of this
process.
Inputs
The following inputs are necessary to make this process effective:
Performance trends reports of DS8000 components: Many people ask for the IBM
recommended thresholds. In our opinion, the best recommended thresholds are those
thresholds that fit your environment. The best thresholds depend on the configuration of
Important: When defining a DS8000 related SLA or SLO, ensure that the goals are based
on empirical evidence of performance within the environment. Application architects with
applications that are highly sensitive to changes in I/O throughput or response time need to
consider the measurement of percentiles or standard deviations as opposed to average
values over an extended period. IT management must ensure that the technical
requirements are appropriate for the technology.
In cases where contractual penalties are associated with production performance SLA or
SLO misses, be careful in the management and implementation of the DS8000. Even in the
cases where no SLA or SLO exists, users have performance expectations that are not
formally communicated. In these cases, they let IT management know when the performance
of the IT resources is not meeting their expectations. Unfortunately, by the time they
communicate their missed expectations, they are often frustrated, and their ability to manage
their business is severely affected by performance issues.
Although there might not be any immediate financial penalties associated with missed user
expectations, prolonged negative experiences with underperforming IT resources result in low
user satisfaction.
Figure A-1 is an example of a RACI matrix for the operational performance subprocess, with
all the tasks, actors, and roles identified and defined:
Provide performance trends report: This report is an important input for the operational
performance subprocess. With this data, you can identify and define the thresholds that
best fit your DS8000. Consider how the workload is distributed between of the internal
components of the DS8000: HAs, processor complexes, DAs, and ranks. This analysis
avoids the definition of thresholds that generate false-positive performance alerts and
ensure that you monitor only what is relevant to your environment.
Define the thresholds to be monitored and their respective values, severity, queue to open
the ticket, and additional instructions: In this task, using the baseline performance report,
you can identify and set the relevant threshold values. You can use Tivoli Storage
Productivity Center to create alerts when these thresholds are exceeded. For example,
you can configure Tivoli Storage Productivity Center to send the alerts through Simple
Network Management Protocol (SNMP) traps to Tivoli Enterprise Console (TEC) or
through email. However, the opening of an incident ticket needs to be performed by the
Monitoring team that needs to know the severity to set, on which queue to open the ticket,
Implement performance monitoring and alerting: After you define the DS8000 components
to monitor, set their corresponding threshold values. For detailed information about how to
configure Tivoli Storage Productivity Center, see the IBM Tivoli Storage Productivity
Center documentation:
http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/index.jsp
Publish the documentation to the IT Management team: After you implement the
monitoring, send the respective documentation to those people who need to know.
Performance troubleshooting
If an incident ticket is open for performance issues, you might be asked to investigate. The
following tips can help during your problem determination.
Tip: We suggest the tactical performance subprocess as the starting point for the
implementation of a performance management process.
Outputs
Performance reports with tuning recommendations and performance trends reports are the
outputs that are generated by this process.
Just keeping the IT systems up and running is not enough. The IT Manager and Chief
Information Officer (CIO) need to show business benefit for the company. Usually, this benefit
means providing the service at the lowest cost but also showing a financial advantage that the
services provide. This benefit is how the IT industry grew over the years while it increased
productivity, reduced costs, and enabled new opportunities.
You need to check with your IT Manager or Architect to learn when the budget is set and start
three to four months before this date. You can then define the priorities for the IT infrastructure
for the coming year to meet the business requirements.
Inputs
The following inputs are required to make this process effective:
Performance reports with tuning recommendations
Performance trends reports
Outputs
The following outputs are generated by this process:
Standards and architectures: Documents that specify:
– Naming convention for the DS8000 components: ranks, extent pools, volume groups,
host connections, and LUNs.
– Rules to format and configure the DS8000: Arrays, RAID, ranks, extent pools, volume
groups, host connections, logical subsystems (LSSs), and LUNs.
– Policy allocation: When to pool the applications or host systems on the same set of
ranks. When to segment the applications or hosts systems in different ranks. Which
type of workload must use RAID 5, RAID 6, or RAID 10? Which type of workload must
use solid-state drives (SSDs) or DDMs of 146 GB/15K rpm, 450 GB/10K rpm, or
900 GB/10K rpm?
Sizing of new or existing DS8000: According to the business demands, what are the
recommended capacity, cache, and host ports for a new or existing DS8000?
Plan configuration of new DS8000: What is the planned configuration of the new DS8000
based on your standards and architecture and according to the workload of the systems
that will be deployed?
Figure A-4 is an example of the RACI matrix for the strategic performance subprocess with all
the tasks, actors, and roles identified and defined:
Define priorities of new investments: In defining the priorities of where to invest, you must
consider these four objectives:
– Reduce cost: The simplest example is storage consolidation. There might be several
storage systems in your data center that are nearing the ends of their useful lives. The
costs of maintenance are increasing, and the storage subsystems use more energy
than new models. The IT Architect can create a case for storage consolidation but
needs your help to specify and size the new storage.
– Increase availability: There are production systems that need to be available 24x7. The
IT Architect needs to submit a new solution for this case to provide data mirroring. The
IT Architect requires your help to specify the new storage for the secondary site and to
provide figures for the necessary performance.
– Mitigate risks: Consider a case where a system is running on an old storage model
without a support contract from the vendor. That system started as a pilot with no
importance. Over time, that system presented great performance and is now a key
application for the company. The IT Architect needs to submit a proposal to migrate to
a new storage system. Again, the IT Architect needs your help to specify the new
storage requirements.
– Business units’ demands: Depending on the target results that each business unit must
meet, the business units might require additional IT resources. The IT Architect
requires information about the additional capacity that is required.
Define and review standards and architectures: After you define the priorities, you might
need to review the standards and architecture. New technologies appear so you might
need to specify new standards for new storage models. Or maybe, after a period analyzing
the performance of your DS8000, you discover that for a certain workload, you might need
to change a standard.
Size new or existing DS8000: Modeling tools, such as Disk Magic, which is described in
6.1.6, “Disk Magic modeling” on page 179, can gather multiple workload profiles based on
host performance data into one model and provide a method to assess the impact of one
or more changes to the I/O workload or DS8000 configuration.
Plan configuration of new DS8000: Configuring the DS8000 to meet the specific I/O
performance requirements of an application reduces the probability of production
performance issues. To produce a design to meet these requirements, Storage
Management needs to know:
– I/Os per second
– Read to write ratios
– I/O transfer size
– Access type: Sequential or random
For help in translating application profiles to I/O workload, see Chapter 5, “Understanding
your workload” on page 157.
After the I/O requirements are identified, documented, and agreed upon, the DS8000
layout and logical planning can begin. For additional detail and considerations for planning
for performance, see Chapter 4, “Logical configuration performance considerations” on
page 87.
For existing applications, you can use Disk Magic to analyze an application I/O profile.
Details about Disk Management are in Chapter 6, “Performance planning tools” on
page 175.
Appendix B. Benchmarking
Benchmarking storage systems is complex due to all of the hardware and software that are
used for storage systems. In this appendix, we discuss the goals and the ways to conduct an
effective storage benchmark.
To conduct a benchmark, you need a solid understanding of all of the parts of your
environment. This understanding includes the storage system requirements and also the
storage area network (SAN) infrastructure, the server environments, and the applications.
Emulating the actual environment, including actual applications and data, along with user
simulation, provides efficient and accurate analysis of the performance of the storage system
tested. The characteristic of a performance benchmark test is that results must be
reproducible to validate the integrity of the test.
Performance is not the only component to consider in benchmark results. Reliability and
cost-effectiveness must be considered. Balancing benchmark performance results with
reliability, functionalities, and TCO of the storage system provides a global view of the storage
product value.
The popularity of these benchmarks depends on how meaningful the workload is compared
to the main and new workloads that companies deploy today. If the generic benchmark
workloads are representative of your production, you can use the different benchmark results
to identify the product you implement in your production environment. But, if the generic
benchmark definition is not representative or does not include your requirements or
restrictions, running a dedicated benchmark designed to be representative of your workload
provides information to help you choose the right storage system.
The OLTP category typically has many users, who all access the same disk storage
subsystem and a common set of files. The requests are typically spread across many files;
therefore, the file sizes are typically small and randomly accessed. Typical applications
consist of a network file server or disk subsystem that is accessed by a sales department that
enters order information.
To identify the specificity of your production workload, you can use monitoring tools that are
available at the operating system level.
The first way to generate the workload, which is the most complex, is to set up the production
environment, including the applications software and the application data. In this case, you
must ensure that the application is well-configured and optimized on the server operating
system. The data volume also must be representative of the production environment.
Depending on your application, workload can be generated by using application scripts or an
external transaction simulation tool. These tools provide a simulation of users accessing your
application. You use workload tools to provide application stress from end-to-end. To
configure an external simulation tool, you first record a standard request from a single user
and then generate this request several times. This process can provide an emulation of
hundreds or thousands of concurrent users to put the application through the rigors of real-life
user loads and measure the response times of key business processes. Examples of
available software include IBM Rational® Software and Mercury Load Runner TM.
Important: Each workload test must be defined with a minimum time duration in order to
eliminate any side effects or warm-up period, such as populating cache, which can
generate incorrect results.
Monitoring can have an impact on component performance. In that case, implement the
monitoring tools in the first sequence of tests to understand your workload and then disable it
to eliminate any impact, which can distort performance results.
During a benchmark, each scenario must be run several times to understand how the
different components perform by using monitoring tools, to identify bottlenecks, and then, to
test different ways to get an overall performance improvement by tuning each component.
We start with an example out of the Capacity Magic tool, to describe how we plan the extent
pool and divide the machine between mainframe and Open Systems pools. We provide
examples about how to structure the multi-tier Open Systems pools into either large 3-tier
pools. We provide an example with more isolation into a higher-performance tier and a
lower-performance tier that are separate from each other.
We also discuss the choice of the performance tools to track the configuration, and we
present in more detail the DS8000 Query and Reporting tool (DSQTOOL).
A general idea is to evenly distribute the data across as many hardware resources as
possible without oversaturating those hardware resources through an imbalanced
configuration. Easy Tier performs a major part of this balancing automatically for you. Still,
you might need to review your system and your choices after the sizing is complete and the
machine is available.
You must size for both performance and for capacity needs. Most often, the System z load
and the Open Systems load are on the new box. Clients often consolidate several older
storage systems into one new storage system. You must plan the extent pools separately for
System z and Open Systems, so for both environments, you need to consider SSD,
Enterprise, and nearline ranks.
Figure C-1 shows a Capacity Magic view of a 2-frame 4-way DS8800, that contains a mix of
SSD, Enterprise and nearline drives, for both count key data (CKD) and Open Systems.
Capacity Magic is always a good start for any planning, because it shows in detail the number
of spares used on the various RAID arrays, the net capacities that we can expect, and which
DA pair is used for each rank.
This view can be transferred and brought into a Lotus® Symphony™ or MS Excel
spreadsheet, as shown in Figure C-2.
This machine has a total net capacity of 188 TB, as shown by Capacity Magic. The detailed
report folder shows us that this capacity is split into 35 TB net for mainframe and 153 TB for
Open Systems.
Of the four SSD ranks, we use two ranks for the mainframe, balancing between the two
processor complexes, and we make two 2-tier pools for CKD. Each pool has one 300 GB SSD
rank, and six 450 GB/10K ranks. The SSD capacity ratio is about 10%. Our sizing shows
previously (an assumption in this case) that this SSD ratio can be combined with this drive
type 450 GB/10K, and total capacity. Of the 14 mainframe ranks, each extent pool gets seven
ranks.
For Open Systems, which show the red “O” in the corner of each array in Figure C-1 on
page 652), of the 153 TB net, we see in the Capacity Magic detail report that 83 TBs are
nearline storage (7200 min1/RAID 6) and 70 TB consist of a mix of about 5% SSDs and
about 95% Enterprise drives, which in this context are 300 GB/15K drives, in RAID 5. Again,
this scenario assumes that our sizing showed that this ratio is an adequate ratio of SSDs to
combine with the 300 GB/15K drives.
In total, there are 44 ranks for Open Systems. We have several options to create extent pools
from these ranks.
One option is to create two large Open Systems extent pools. Each extent pool contains 22
ranks (one SSD rank, 18 Enterprise ranks, and three nearline ranks). This design provides
two 3-tier extent pools of about 77 TBs each. Easy Tier has the maximum flexibility to
promote and demote extents between the three tiers in each pool.
Using a spreadsheet, this configuration looks like Figure C-3 on page 654.
Another option is to leave the nearline space isolated. With 83 TB net, the nearline space is
larger than the combined SSD+ENT space of 70 TB for Open Systems, so we use the
nearline space as backup space. For this setup, we create 4 Open Systems extent pools (2
pairs). One extent pool pair has 19 ranks (1 SSD rank and 18 Enterprise ranks), and one
extent pool pair with each pool containing three nearline ranks. The cross-tier Easy Tier is
only active in the SSD+ENT “production” pool pair, but at the same time, the intra-tier Easy
Tier auto-rebalancing is active in the nearline pools.
The third option to build two 2-tier pool pairs. One tier is the “Gold” tier, and one tier is the
“Silver” tier. Only the Gold tier contains SSDs. All applications that we place in the Silver tier
can only be on either Enterprise or nearline ranks. Assume that from our 66 TB net in
300 GB/15K drives, we split them into 22 TB together with the SSD space, which creates a
Gold tier of around 25 TB net. The remaining 44 TB of these drives are with the nearline
drives. Then, the result is four open extent pools/two pairs:
One Gold pair: Each with 1 SSD rank plus 6 Enterprise ranks
One Silver pair: Each with 12 Enterprise ranks and 3 nearline ranks
Figure C-4 Gold and Silver configuration when using the spreadsheet
With the manual Easy Tier functions, we can move volumes around between Silver and Gold.
If we see that a Silver volume does not perform well, we can move it to Gold online. We can
also change our allocated capacities later in the operation if we see that the capacity for Gold
was too small. If there are too many ranks in the Silver pool pair, we can unconfigure ranks
online and move them out of the Silver pool pair. We can put them in the Gold pool pair to
enable more applications to participate in the SSD tiering.
Also, in FlashCopy layouts, it might be advantageous to separate source ranks and target
ranks into different extent pool pairs. This approach can also lead to creating more than one
extent pool pair.
In Figure C-4, you can see that we achieved load separation by DA pair to a large extent. The
mainframe loads its own DA pairs. On the Open Systems side, one DA pair (4) is exclusive to
the Gold tier, and three DA pairs (5, 6, and 7) are exclusive to the Silver tier. Only DA pair 2
serves disks from both the Gold and Silver open tiers. Because the Silver tier disks of this pair
are nearline with little load, we can see that DA pair 2 mainly serves the solid-state drives
(SSDs) with their many IOPS. Therefore, we consider that it belongs to the Gold tier.
Performance tools
We described the use of the various performance tools, such as Tivoli Storage Productivity
Center (TPC) for Disk, for performance monitoring and problem analysis in earlier chapters.
Tivoli Storage Productivity Center also can store performance data, over a longer time, in a
database, so that historical trending is possible.
You can use the TPC Reporter to take regular snapshots of your DS8000, for example,
monthly snapshots. The PDF reports contain the performance data of this month, the
configuration of the various arrays and the drive types for these arrays, their respective
utilizations, and the number of HBAs. The PDF files are easy to store. Use the PDF files later
as references to see how loads grow over time and how certain pools become overloaded
slowly to facilitates upgrade planning.
The Storage Tier Advisor Tool (STAT) provides information about the amount of hot data and
which rank types are best to add next and with what expected results. You can collect the raw
data for the STAT in regular intervals for each machine and store it for reference.
You can use the DS8000 command-line interface (DSCLI) commands to see an overview of
the configuration and performance. Use the lsperfgrprpt command of the I/O Priority
Manager for simple performance tracking. The other DSCLI list commands either show the
configuration (ls- commands), including the rank-to-DA mapping when using the lsarray -l,
or use additional performance counters with the -metrics options of showfbvol and
showrank. The results of DSCLI outputs can be saved at any time for later reference.
The DS8000 products provide both DS8000 command-line interface (DSCLI) and GUI (DS
Storage Manager) tools for configuration management. A combination of the tools allows
both novice and highly skilled storage administrators to accomplish the routine tasks involved
in day-to-day storage management and configuration reporting.
To help automate the process, VBScript and Excel macro programs were written to provide a
quick and easy-to-use interface to DS8000 storage servers through DSCLI, which is passed
to an Excel macro to generate a summary workbook with detailed configuration information.
The tool is launched from a desktop icon on a Microsoft Windows system (probably a storage
administrator desktop computer). A VBScript program is included to create the icon with a link
The DSQTOOL is a combination of VBScript and Excel macro programs that interface with
the DS8000 Command Line Interface (DSCLI) utility. The DSQTOOL uses list and show
commands to query and report on system configurations. The design point of the programs is
to automate a repeatable process of creating configuration documentation for a specific
DS8000.
You can obtain this tool, along with other tools, such as DS8CAPGEN, from this IBM link:
http://congsa.ibm.com/~dlutz/
You can see two DSQTOOL samples of our machine in Figure C-5 and Figure C-6 on
page 658. Look at the tabs at the bottom. This tool collects a significant amount of DS8000
configuration information to help you track the changes in your machine over time.
You can open the Performance console by clicking Start Programs Administrative
Tools Performance or by typing perfmon on the command line.
Also, you can start the Performance console from the Microsoft Management Console. Click
File OpenGo to the Windows\System32 folder and select perfmon.msc. See Figure D-1.
On the Performance Logs and Alerts window, which is shown in Figure D-2, you can collect
performance data manually or automatically from local or remote systems. Saved data can be
displayed in the System Monitor or data can be exported to a spreadsheet or database.
Figure D-3 Windows Server 2003 Disk Performance log General tab
6. In the Run As field, enter the account with sufficient rights to collect the information about
the server to be monitored, and then click Set Password to enter the relevant password.
7. In the Log Files tab, shown in Figure D-4 on page 663, you set the type of the saved file,
the suffix that is appended to the file name, and an optional comment. You can use two
types of suffixes in a file name: numbers or dates. The log file types are listed in Table D-1
on page 663. If you click Configure, you can also set the location, file name, and file size
for a log file. The Binary File format takes the least amount of space and is suggested for
most logging.
Text file - TSV Tab-delimited log file (TSV extension). Use this
format to export the data to a spreadsheet
program.
8. In the Schedule tab that is shown in Figure D-5 on page 664, you specify when this log is
started and stopped. You can select the option box in the start log and stop log sections to
manage this log manually by using the Performance console shortcut menu. You can
configure to start a new log file or run a command when this log file closes. This capability
is useful when collecting data for a long period and with many values. You can set the
schedule for each hour and get one file per hour.
9. After clicking OK, the logging starts automatically. If for any reason, it does not start,
right-click the Log Settings file in the perfmon window and click Start. See Figure D-6 on
page 665.
10.To stop the counter log, click the Stop the selected log icon on the toolbar or the Stop in
the menu when you right-click.
This log settings file in HTML format can then be opened with any web browser. You can also
use the pop-up menu to start, stop, and save the logs, as shown in Figure D-6 on page 665.
3. At the System Monitor Properties dialog, select the Data tab. You now see any counters
that you specified when setting up the Counter Log as shown in Figure D-8. If you only
selected counter objects, the Counters section is empty. To add counters from an object,
click Add, and then select the appropriate counters.
Figure D-9 Performance monitor window with the entire period of collected data
Figure D-9 shows the data for the entire period of collection. In this example, the data is
for five days. So, you can see the trend, but not the details.
5. To be able to examine data for a smaller period (few minutes), you can use the Zoom
function of the Performance Monitor console. See Figure D-10.
Select the interval you want (see Figure D-10) to examine and click Zoom or press Ctrl+Z.
The result is shown in Figure D-11 on page 668.
To zoom out, you can move the slider ends to make it wider. Click Zoom or press Ctrl+Z,
and the graph zooms out.
6. Highlight the counter that you want to see more clearly than the other counters. Click
Highlight or press Ctrl+H (Figure D-12).
8. Right-click the value and then click Properties. You see the graph options for this value.
You can select scaling options. For response times, set the scaling option to 1000, so it
appears in the graph field. For the other values, for IOPS numbers, for example, you might
select 0.0001 or 0.001 to make them appear in the graph area, because IOPS are
measured in thousands and exceed the values of the graph area. You can also select the
color, width, and the style of the line for each value.
4. In the Monitor View, click Add, select all the key PhysicalDisk counters for all instances,
and click Add.
5. You see a window similar to Figure D-15.
7. As shown in Figure D-20, you are asked to select the performance counters that you want
to log. Click Add to see a window similar to Figure D-21 on page 674.
8. Select all instances of the disks except the Total, and manually select all the individual
counters identified in Table 10-2 on page 409. Select the computer that you want, as well.
Click OK.
9. The Create new Data Collector Set wizard window opens. In the Sample interval list box,
set how frequently you capture the data. When configuring the interval, specify an interval
that can provide enough granularity so that you can identify the issue. The rules are the
same as for Windows Server 2003 data collection step 5 on page 662. Click Next.
Important: Different spreadsheet software has different capabilities to work with large
amounts of data. When configuring the interval, remember both the problem profile and
the planned collection duration in order not to capture more data than can be
reasonably analyzed. For data collection during several hours, you might want several
files, one file for each hour, rather than one large file for several hours. Data collected
for one hour with a 1 second interval gives you 3600 entries for each value. The total
number of entries must be multiplied by the number of values collected.
10.You are prompted to enter the location of the log file and the file name as shown in
Figure D-22 on page 675. In this example, the default file name is shown. After entering
the file name and directory, click Finish.
Figure D-23 Windows Server 2008 Performance Monitor with new Data Collector Set
For more information about Windows Server 2008 Performance Monitor, see this link:
http://technet.microsoft.com/en-us/library/cc771692%28WS.10%29.aspx
By default, all the scripts print standard output. If you want to place the output in a file, redirect
the output to a file.
Scripts: The purpose of these scripts is not to provide a toolkit that addresses every
possible configuration scenario; rather, it is to demonstrate several of the available
possibilities.
Dependencies
To execute the scripts described in this section, you need to prepare your system.
Software
You need install Active Perl 5.6.x or later. At the time of writing this book, you can download
PERL from the following website:
http://www.activestate.com/activeperl/downloads
Script location
We place the script in the same directory as the performance and capacity configuration data.
Script creation
To run these scripts, copy the entire contents of the script into a file and name the file with a
“.pl” suffix.
Sample scripts
We explain the three scripts.
perfmon-essmap.pl
The purpose of this script (Example E-1 on page 679) is to correlate the disk performance
data of Windows server with Subsystem Device Driver (SDD) datapath query essmap data.
The rank column is invalid for DS8000s with multi-rank extent pools.
Several of the columns are hidden in order to fit the example output into the provided space.
To run the scripts, you need to collect performance and configuration data. This section
describes the logical steps that are required for post-processing the perfmon and essmap
data. The other scripts have a similar logical flow:
1. Collect Windows server performance statistics as described in Appendix D, “Microsoft
Windows server performance log collection” on page 659.
2. Collect Windows server SDD or Subsystem Device Driver Device Specific Module
(SDDDSM) output as described in 10.8.4, “Collecting configuration data” on page 413.
3. Place the output of the performance data and the configuration data in the same directory.
4. Open a shell.
5. Run the script. The script takes two input parameters. The first parameter is the file name
for the Windows perfmon data and the second parameter is the file for the datapath query
essmap output. Use the following syntax:
perfmon-essmap.pl Disk_perf.csv essmap.txt > DiskPerf_Essmap.csv
6. Open the output file (DiskPerf_Essmap.csv) in Excel.
Important: If your file is blank or contains only headers, you need to determine whether
the issue is with the input files, or if you damaged the script when you created the script.
For the interpretation and analysis of the Windows data, see 10.8.6, “Analyzing performance
data” on page 419.
########################################################################
# Read in essmap and create hash with hdisk as key and LUN SN as value #
########################################################################
while (<DATAPATH>) {
if (/^$/) { next; }# Skip empty lines
if (/^Disk /) { next; }# Skip empty lines
if (/^--/) { next; }# Skip empty lines
@line = split(/\s+|\t/,$_);# Build temp array of current line
$lun = $line[4];# Set lun ID
$path = $line[1];# Set path
$disk = $line[0];# Set disk#
$hba = $line[2];# Set hba port - use sdd gethba.exe to get wwpn
$size = $line[7];# Set size in gb
$lss = $line[8];# Set ds lss
$vol = $line[9];# Set DS8K volume
$rank = $line[10];# Set rank - DOES NOT WORK FOR ROTATE VOLUME OR ROTATE extent
$c_a = $line[11]; # Set the Cluster and adapter accessing rank
$dshba = $line[13];# Set shark hba - this is unusable with perfmon which isn't aware of paths
$dsport = $line[14];# Set shark port physical location - this is unusable with perfmon which isn't aware
of paths
$lun{$disk} = $lun;# Set the LUN in hash with disk as key for later lookup
$disk{$lun} = $disk;# Set vpath in hash with lun as key
$lss{$lun} = $lss;# Set lss in hash with lun as key
$rank{$lun} = $rank;# Set rank in hash with lun as key
$dshba{$lun} = $dshba;# Set dshba in hash with lun as key - this is unusable with perfmon which isn't
aware of paths
$dsport{$lun} = $dsport;# Set dsport in hash with lun as key - this is unusable with perfmon which isn't
aware of paths
if (length($lun) > 8) {
$ds = substr($lun,0,7);# Set the DS8K serial
} else {
$ds = substr($lun,3,5);# Set the ESS serial
}
$ds{$lun} = $ds; # set ds8k in hash with LUN as key
}
################
# Print Header #
################
print "DATE,TIME,Subsystem Serial,Rank,LUN,Disk,Disk Reads/sec, Avg Read RT(ms),Disk Writes/sec,Avg Write
RT(ms),Avg Total Time,Avg Read Queue Length,Avg Write Queue Length,Read KB/sec,Write KB/sec\n";
##################################################################################################
# Read in perfmon and create record for each hdisk and split the first column into date and time #
##################################################################################################
while (<PERFMON>) {
if (/^$/) { next; } # Skip empty lines
if (/^--/) { next; } # Skip empty lines
if (/PDH-CSV/) {
@header = split(/,/,$_); # Build header array
shift(@header); # Remove the date element
unshift(@header,"Date","Time"); # Add in date, time
next; # Go to next line
}
@line = split(/\t|,/,$_); # Build temp array for current line
@temp = split(/\s|\s+|\"/,$line[0]);# Split the first element into array
### Print out the data here - key is date, time, disk
while (($date,$times) = each(%diskrrt)) {# Loop through each date-time hash
while (($time,$disks) = each(%$times)) {# Nested Loop through each time-disk hash
while (($disk,$value) = each(%$disks)) {# Nest loop through disk-value hash
$diskrrt = $diskrrt{$date}{$time}{$disk};# Set shortnames for easier print
$diskwrt = $diskwrt{$date}{$time}{$disk};
$diskreads = $diskreads{$date}{$time}{$disk};
$diskwrites = $diskwrites{$date}{$time}{$disk};
$total_time = ($diskrrt*$diskreads)+($diskwrt*$diskwrites);
$diskrql = $diskrql{$date}{$time}{$disk};
$diskwql = $diskwql{$date}{$time}{$disk};
$diskrkbs = $diskrkbs{$date}{$time}{$disk};
$diskwkbs = $diskwkbs{$date}{$time}{$disk};
$lun = $lun{$disk}; # Lookup lun for current disk
iostat_aix53_essmap.pl
The purpose of this script (Example E-2) is to correlate AIX 5.3 iostat -D data with SDD
datapath query essmap output for analysis. Beginning in AIX 5.3, the iostat command
provides the ability to continuously collect read and write response times. Prior to AIX 5.3,
filemon was required to collect disk read and write response times.
TIME STORAGEDS HBA DS PORT LSS RANK LUN VPATH HDISK #BUSY KBPS KB READ KB WRITE
0:00:00 75P2831 R1-B1-H3-ZA 30 29 ffe3 75P28311D0A vpath13 hdisk51 - 0.61 0.14 0.47
0:00:00 75P2831 R1-B1-H3-ZA 30 23 ffe9 75P2831170A vpath12 hdisk50 0.80 8.10 8.10 -
0:00:00 75P2831 R1-B5-H3-ZB 431 23 ffe9 75P2831170A vpath12 hdisk52 0.60 8.60 8.60 -
0:00:00 75P2831 R1-B5-H3-ZB 431 29 ffe3 75P28311D0A vpath13 hdisk53 - 0.95 0.22 0.73
0:00:00 75P2831 R1-B6-H1-ZA 500 29 ffe3 75P28311D0A vpath13 hdisk55 - 0.54 0.21 0.33
0:00:00 75P2831 R1-B6-H1-ZA 500 23 ffe9 75P2831170A vpath12 hdisk54 0.80 8.90 8.90 -
0:00:00 75P2831 R1-B8-H1-ZA 700 29 ffe3 75P28311D0A vpath13 hdisk57 0.10 0.52 0.12 0.40
0:00:00 75P2831 R1-B8-H1-ZA 700 23 ffe9 75P2831170A vpath12 hdisk56 0.60 7.60 7.60 -
1:01:00 75P2831 R1-B1-H3-ZA 30 29 ffe3 75P28311D0A vpath13 hdisk51 - 4.80 3.80 1.00
Several of the columns are hidden to fit the example output into the provided space.
### Read in pcmpath and create hash with hdisk as key and LUN SN as value #############
$file = $ARGV[0]; # Set iostat -D for 1st arg
$essmap = $ARGV[1]; # Set 'datapath query essmap' output to 2nd arg
### Read in essmap and create hash with hdisk as key and LUN SN as value
sub readessmap($essmap) {
open(ESSMAP,$essmap) or die "cannot open $essmap\n";
while (<ESSMAP>) {
if (/^$/) { next; } # Skip empty lines
if (/^--/) { next; } # Skip empty lines
if (/^Disk/) { next; } # Skip header
@line = split(/\s+|\t/,$_); # Build temp array
$lun = $line[4]; # Set lun
$hdisk = $line[1]; # set hdisk
$vpath = $line[0]; # set vpath
$hba = $line[3]; # set hba
$lss = $line[8]; # set lss
$rank = $line[10]; # set rank
$dshba = $line[13]; # set shark hba
$dsport = $line[14]; # set shark port
$vpath{$lun} = $vpath; # Set vpath in hash
$lss{$lun} = $lss; # Set lss in hash
$rank{$lun} = $rank; # Set rank in hash
$dshba{$hdisk} = $dshba; # Set dshba in hash
$dsport{$hdisk} = $dsport; # Set dsport in hash
$lun{$hdisk} = $lun; # Hash with hdisk as key and lun as value
if (length($lun) > 8) {
$ds = substr($lun,0,7); # Set ds serial to first 7 chars
} else {
$ds = substr($lun,3,5); # or this is ESS and only 5 chars
}
$ds{$lun} = $ds; # set the ds serial in a hash
}
}
### Read in iostat and create record for each hdisk
sub readiostat($iostat) {
### Print Header
print "TIME,STORAGE SN,DS HBA,DS PORT,LSS,RANK,LUN,VPATH,HDISK,#BUSY,KBPS,KB READ PS,KB WRITE
PS,TPS,RPS,READ_AVG_SVC,READ_MIN_SVC,READ_MAX_SVC,READ_TO,WPS,WRITE_AVG_SVC,WRITE_MIN_SVC,WRITE_MAX_SVC,WRI
TE_TO,AVG QUE,MIN QUE, MAX QUE,QUE SIZE\n";
$time = 0; # Set time variable to 0
$cnt = 0; # Set count to zero
$newtime = 'Time_1'; # Set a relative time stamp
open(IOSTAT,$iostat) or die "cannot open $iostat\n";# Open iostat file
while (<IOSTAT>) { # Read in iostat file
if ($time == 0) {
if (/^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun/) {# This only works if a time stamp was in file
$date_found = 1; # Set flag for
@line = split(/\s+|\s|\t|,|\//,$_);# build temp array
$date = $line[1] . " " . $line[2] . " " . $line[5];# Create date
$time = $line[3]; # Set time
$newtime = $time; # Set newtime
$interval = 60; # Set interval to 60 seconds
iostat_sun-mpio.pl
The purpose of this script (Example E-3) is to reformat Solaris iostat -xn data so that it can
be analyzed in a spreadsheet. The logical unit number (LUN) identification works properly
with Solaris systems running MPxIO only. In the iostat -xn output with MPxIO, there is only
one disk shown per LUN in the output. There are no flags required. Use this syntax:
sub gettime() {
my $time = $_[0];
my $interval = $_[1];
my $cnt = $_[2];
my $hr = substr($time,0,2);
my $min = substr($time,3,2);
my $sec = substr($time,6,2);
$hrsecs =$hr * 3600;
$minsecs = $min * 60;
my $addsecs = $interval * $cnt;
my $totsecs = $hrsecs + $minsecs + $sec + $addsecs;
$newhr = int($totsecs/3600);
$newsecs = $totsecs%3600;
$newmin = int($newsecs/60);
$justsecs = $newsecs%60;
Index 691
Fibre Channel and FICON host adapters 46 IMS 559, 578
host adapters 34, 46 logging 578
I/O enclosures 34–35 performance considerations 578–579
modeling your workload 5 WADS 579
multiple paths to open systems servers 48 IMS in a z/OS environment 578, 581
multiple paths to zSeries servers 48 index 561–562, 570
performance numbers 4 indexspace 562
processor complex 26–27 installation planning
processor memory 27, 34 host attachment 312
RIO-G interconnect and I/O enclosures 34 instance 568–569
RIO-G loop 35 Intelligent Write Caching (IWC) 32
spreading host attachments 48 IntelliMagic Direction 176
storage server challenge 4 inter-tier management 84
switched disk architecture 36 intra-tier data relocation 123
tools to aid in hardware planning 49 intra-tier management 84
whitepapers 49 intra-tier performance management 120
HCD (hardware configuration definition) 48 IOCDS 48
High performance FICON (zHPF) 512 ionice 460
High Performance FileSystem (HFS) 382 IOP/SAP 522, 524–525
homogeneous extent pools 84 IOPM 456
host adapter (HA) 7 IOSQ time 503, 523
host adapters 34, 46 iostat command 470, 473
Fibre Channel 46, 49, 147 IWC 32
FICON 46, 49, 147
host attachment 74–75
description and characteristics of a SAN 313 J
FC adapter 318 Journal File System 458, 461
Fibre Channel 312 Journaled File System (JFS) 383
Fibre Channel topologies 313 journaling
FICON 320 mode 459, 461–462
single path mode 318 Journals 621
supported Fibre Channel 312
host attachments 74 L
host connection 74 large form factor 107, 113
HP MirrorDisk/UX 385 large volume support
hybrid 81 planning volume size 510
hybrid extent pools 84 LCU 72
HyperPAV 501, 503 legacy AIO 355–356
LFF 107, 113
I links
I/O Fibre Channel 597
priority queuing 507 Linux 445–446
I/O elevator 447, 449, 460 load balancing
anticipatory 458 SDD 318
Complete Fair Queuing 457–458, 460 locality of reference 448
deadline 457–458 log buffers 578
NOOP 458 logging 578
I/O elevator model 449 logical configuration
I/O enclosure 34–35 disks in a SAN 314
I/O latency 27 logical control unit 72
I/O Priority Manager 17, 80, 90, 115, 456 logical device
I/O priority queuing 10 choosing the size 508
I/O scheduler 449, 469 configuring in a SAN 314
I/O subsystem architecture 447 planning volume size 510
IBM FlashCopy SE 66 logical paths 598
IBM TotalStorage Multipath Subsystem Device Driver see Logical Session 622
SDD logical subsystem see LSS
IBM TotalStorage XRC Performance Monitor 624 Logical Volume Manager 454
implement the logical configuration 151 logical volumes 57, 59, 64
importance value 23 logs 563, 568, 571, 578
M
managed 69, 80–81, 84, 122 O
Managed Disk Group 540 OLDS 579
Managed Disks Group 535 OLTP 560, 573, 575
Manual volume rebalance 16 open systems servers 169
manual volume rebalance 62–63, 129 characterize performance 395
maximum drain time 613 dynamic buffer cache 383
metric 240, 262, 265 filemon 357, 366
Metro Mirror 596 lvmstat 357, 360–361
adding capacity in new DS6000s 602 performance logs 413
addition of capacity 602 removing disk bottlenecks 420
bandwidth 597, 599 topas 361–362
distance 596–597 tuning I/O buffers 339
Fibre Channel links 597 verify the storage subsystem 392
logical paths 598 overhead 590, 593–594, 596
LSS design 599
performance 597, 600–601 P
scalability 602 page cache 447
symmetrical configuration 599 page cleaners 571
volumes 596, 598, 601 page size 570–571, 575
Metro Mirror configuration considerations 597 pages 570–572
Metro/Global Mirror Parallel 572
performance 617, 625 Parallel Access Volume (PAV) 510
Microsoft Cluster Services (MSCS) 428 Parallel Access Volumes see PAV
micro-tiering 63, 80, 120, 123 parallel operations 572
MIDAW 566 partitiongroups 569
MIDAWs 566 partitioning map 569
modeling your workload 5 path failover 319
monitor host workload 169 PAV 9, 500–503, 522
open systems servers 169 PAV and Multiple Allegiance 504
monitoring pbuf 354
DS8000 workload 168 pdflush 447, 449
monitoring tools PEND time 504, 506, 524
Disk Magic 176–177 performance 519, 585, 589, 595, 605, 608, 610
Performance console 411–413 DB2 considerations 563
mount command 461–462 DB2 UDB considerations 567, 572
MPIO 297 Disk Magic tool 176–177
MRU policy 431 FlashCopy overview 588–589
multi-pathing 577 IMS considerations 578–579
multipathing planning for UNIX systems 328
DB2 UDB environment 577 size of cache 33
SDD 407 tuning Windows systems 397–398
Multiple Allegiance 9, 500, 504, 510 Performance Accelerator feature 45
Multiple Allegiance (MA) 504
Index 693
Performance console 411–413 Redbooks website
performance data collection 253, 260, 275 Contact us xx
performance data collection task 260 ReiserFS 459, 461
performance group 18 repository 66
performance logs 413 Resource Management Facility (RMF) 522
performance monitor 260 resource sharing 80
performance numbers 4 RIO-G loop 35
performance policies 18 RMF 602, 609, 614
plan Address Groups and LSSs 140 RMF Loader 182
plan RAID Arrays and Ranks 103 RMF Magic 521, 609
planning RMFPack 182
logical volume size 510 RMZ 620
UNIX servers for performance 328 rotate extents 61, 68, 72, 79, 81, 84, 115, 125–126
planning and monitoring tools rotate volumes 68, 79, 81, 126
Disk Magic 176 Rotated volume 68
Disk Magic for open systems 198 rotated volume 68
Disk Magic for zSeries 181
Disk Magic modeling 177, 179
Disk Magic output information 179 S
workload growth projection 179 SAN 48
planning considerations 605 cabling for availability 313
port group 74 zoning example 314
ports 243, 261 SAN Volume Controller (SVC) 536
posix AIO 355 SAR
POWER5 7–8 display previously captured data 391
prefetch size 575 real-time sampling and display 390
priority queuing 507 sar summary 392
processor complex 26–27, 72, 130 sar command 389, 391, 474
processor memory 27, 34 sar summary 392
pstat 355 SARC 8
saturation 20
scalability 602, 606
Q SCSI
QoS 18, 22 See Small Computer System Interface
Quality of Service 18, 22 SCSI reservation 436–437
quality of service 80 SDD 9, 297, 407, 452
queue depth settings 452 DB2 UDB environment 577
QuickInit 66 Sequential Prefetching in Adaptive Replacement Cache
see SARC
server affinity 61, 63, 71
R setting I/O schedulers 457
RAID 39 SFF 108, 113
RAID 10 595 single path mode 318
drive failure 54 single-rank extent pool 78–79, 118
implementation 54 single-tier extent pool 79, 84
RAID 5 595 Small Computer System Interface 450
drive failure 52 small form factor 108, 113
implementation 52–53 Smart Rebuild 53
RAID 6 53, 58 SMI-S standard
rank depopulation 90 IBM extensions 241
rank group 63, 71–72 ports 243
rank groups 62 solid-state drive 7, 38, 138
ranks 59, 61, 65 solid-state drives 114
RAS space efficient volumes 66
spare creation 56 spares 39, 56
Raw Device Mapping (RDM) 427, 437 floating 56
RDM 427–428, 437 spatial ordering 32
real-time sampling and display 390 spindle 595
recovery data sets 562 spreading host attachments 48
Red Hat 446, 461 SSD 7, 38, 114, 138
Redbooks Web site 699 STAT 17, 85
Index 695
xSeries servers
Linux 445–446
Z
z/ OS Workload Manager 22
z/OS
planning guidelines 519
z/OS Global Mirror 586, 619–620
dataset placement 621
IBM TotalStorage XRC Performance Monitor 624
tuning parameters 621
z/OS Global Mirror Multiple Reader 622
z/OS Workload Manager 500–501
zGM multiple reader 622
zSeries servers
overview 500
PAV 500–501, 503, 522
static and dynamic PAVs 501
zWLM 22
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this book.
Other publications
These publications are also relevant as further information sources:
IBM System Storage DS8800 and DS8700 Introduction and Planning Guide, GC27-2297
IBM System Storage DS8000 Host Systems Attachment Guide, GC27-2298
IBM System Storage DS Command-Line Interface User's Guide for the DS6000 series
and DS8000 series, GC53-1127
IBM System Storage DS Open Application Programming Interface Reference, GC35-0516
IBM System Storage Multipath Subsystem Device Driver User’s Guide, GC52-1309
IBM System Storage DS8800 Performance Whitepaper, WP102025
IBM System Storage DS8700 Performance Whitepaper, WP101614
IBM System Storage DS8700 Performance with Easy Tier, WP101675
IBM System Storage DS8700 and DS8800 Performance with Easy Tier 2nd Generation,
WP101961
IBM System Storage DS8800 and DS8700 Performance with Easy Tier 3rd Generation,
WP102024
IBM DS8000 Storage Virtualization Overview Including Storage Pool Striping, Thin
Provisioning, Easy Tier, WP101550
Online resources
These websites and URLs are also relevant as further information sources:
IBM Disk Storage Feature Activation (DSFA)
http://www.ibm.com/storage/dsfa
Documentation for the DS8000
http://www.ibm.com/systems/storage/disk/ds8000/index.html
IBM System Storage Interoperation Center (SSIC)
http://www.ibm.com/systems/support/storage/config/ssic/index.jsp
IBM Announcement letters (for example, search for R6.2)
http://www.ibm.com/common/ssi/index.wss
IBM Techdocs Library - The IBM Technical Sales Library
http://www.ibm.com/support/techdocs/atsmastr.nsf/Web/Techdocs
DS8800 Performance
Monitoring and Tuning
®
Understand the This IBM Redbooks publication provides guidance about how to
configure, monitor, and manage your IBM System Storage DS8800 and INTERNATIONAL
performance and
DS8700 storage systems to achieve optimum performance. It TECHNICAL
features of the
describes the DS8800 and DS8700 performance features and SUPPORT
DS8800 architecture characteristics, including IBM System Storage Easy Tier and DS8000 ORGANIZATION
I/O Priority Manager. It also describes how they can be used with the
Configure the DS8800 various server platforms that attach to the storage system. Then, in
to fully exploit its separate chapters, we detail specific performance recommendations
capabilities and discussions that apply for each server environment, as well as for
database and DS8000 Copy Services environments. BUILDING TECHNICAL
Use planning and INFORMATION BASED ON
We also outline the various tools available for monitoring and PRACTICAL EXPERIENCE
monitoring tools with measuring I/O performance for different server environments, as well
the DS8800 as describe how to monitor the performance of the entire DS8000 IBM Redbooks are developed
storage system. by the IBM International
Technical Support
This book is intended for individuals who want to maximize the Organization. Experts from
performance of their DS8800 and DS8700 storage systems and IBM, Customers and Partners
investigate the planning and monitoring tools that are available. from around the world create
timely technical information
The IBM System Storage DS8800 and DS8700 storage system based on realistic scenarios.
features, as described in this book, are available for the DS8700 with Specific recommendations
Licensed Machine Code (LMC) level 6.6.2x.xxx or higher and the are provided to help you
DS8800 with Licensed Machine Code (LMC) level 7.6.2x.xxx or higher. implement IT solutions more
effectively in your
For information about optimizing performance with the previous environment.
DS8000 models, DS8100 and DS8300, see the following IBM
Redbooks publication: DS8000 Performance Monitoring and Tuning,
SG24-7146.
For more information:
ibm.com/redbooks