SlideShare a Scribd company logo
VMWARE VSPHERE 4
ADVANCED TROUBLESHOOTING

ERIC SLOOF - NTPRO.NL
INTRODUCTION




• VMware Certified Instructor
• Blogger @ NTPRO.NL
BLOGGER @ NTPRO.NL
AGENDA

1.   Introduction by Scott Drummonds
2.   CPU troubleshooting
3.   Memory troubleshooting
4.   Storage troubleshooting
5.   Network troubleshooting
6.   Troubleshooting tools
INTRODUCTION

Scott Drummonds
Technical Director, vSpecialists, APJ at EMC

The performance space is massive. It’s nearly
impossible to keep up with everything that is
happening in this space. With the benefit of close
contact with VMware's performance engineering team
I was barely able to hold the reins on that massive
beast. The secret is not to try and learn every little
thing out there, but to develop a strong handle on
troubleshooting using esxtop, vCenter and vscsiStats.
Everything comes from there.
CPU TROUBLESHOOTING – CPU READY TIME




The vSphere Client Graph refreshes every 20 seconds
1000 Milliseconds / 20.000 Milliseconds = 5 %
34 Milliseconds / 20.000 Milliseconds = 0,17 % <~ no worries 
CPU TROUBLESHOOTING – CPU READY TIME




A %RDY figure of 17.97% means that the virtual machine spent
17.97% of its last sample period waiting for available CPU
resources. Esxtop’s default refresh interval is 5 seconds.
The PCPU AVG value in this example is 100%.
CPU TROUBLESHOOTING - FLOWCHART
                                       %RDY > 10%   NO     Other
VM CPU Ready Time
                                        2000 mSec        Problem:
                                                         - Memory
                                                         - Storage
                                                         - Network
                          No Problem        YES



            YES

     Used time ~
                                         %PCPU
     ready time                                           Hmmm
                        YES            USED > 90%   NO
     with spikes

            NO

  Host CPU Saturation
CPU TROUBLESHOOTING - MAX LIMITED

%MLMTD - The max limited time is the percentage of time the VM
world was ready to run but deliberately wasn't scheduled
because that would violate the VM’s "CPU limit" settings.

                            %RDY includes %MLMTD

                            For CPU contention, use "%RDY -
                            %MLMTD“. 99.75 – 99,73 = 0.02
                            So there’s no contention despite of
                            the high ready time.
CPU TROUBLESHOOTING - MAX LIMITED

           VMKernel deliberately didn't run

                            Yes


                      %MLMTD
Hmmm
                       > 0%




                            No




               SMP virtual machine?
           Check %CSTP - Co-Scheduling
CPU TROUBLESHOOTING – CO SCHEDULING

               At any particular point in time,
               each virtual cpu may be
               scheduled, descheduled,
               preempted, or blocked waiting for
               some event.

               Without co scheduling, the VCPUs
               associated with an SMP VM
               would be scheduled independently,
               breaking the guest's assumptions
               regarding uniform progress.

               VMware uses the term "skew" to
               refer to the difference in execution
               rates between two or more VCPUs
               associated with an SMP VM.
CPU TROUBLESHOOTING – CO SCHEDULING




Type “e” to show all the worlds associated with a single
virtual machine. The %CSTP metric indicates co scheduling.
CPU TROUBLESHOOTING - RECAP

o If ready time <= 5%, there’s no problem.

o If ready time is 5% <=> 10%, there might be an issue.

o If ready time is => 10% there’s a performance issue.

o Check if the virtual machine’s CPU is not limited.

o Check if there’s CPU over commitment all the time,
  occasional spikes are no problem.

o If it’s an SMP virtual machine check if the application is
  multithreading and actually using the resources.

o If the ESX host is saturated reduce the number of virtual
  machines.
MEMORY TROUBLESHOOTING




For each running virtual machine, the ESX host reserves
physical memory for the virtual machine’s reservation
(if any) and for its virtualization overhead. Because of the
memory management techniques the ESX host uses,
your VMs can use more memory than there’s physically
available…
MEMORY TROUBLESHOOTING – PAGE SHARING




                Transparent Page Sharing




Transparent page sharing (TPS) reclaims memory by
consolidating redundant pages with identical content.
This helps to free memory that a virtual machine would
otherwise (not) be using. Page sharing will show up in
esxtop at modern Intel/AMD processors only when host
memory is overcommitted.
MEMORY TROUBLESHOOTING – PAGE SHARING




Guest physical memory is not “freed”, the memory is moved
to the “free” list. The ESX host has no access to the guest’s
“free” list and the ESX host cannot “reclaim” the memory
freed up by the guest.
Sharing happens with other virtual machines on the same
host but also within virtual machines.
MEMORY TROUBLESHOOTING - BALLOONING




Ballooning reclaims memory by artificially increasing the
memory pressure inside the guest and will become a
performance issue when the guest OS is paging active
memory to its own page file. Ballooning offers a better
performance than ESX swapping or ESX memory
compression.
MEMORY TROUBLESHOOTING - BALLOONING




The MCTLTGT (target) value set by VMkernel for the VM’s
memory balloon size, in conjunction with MCTLSZ (size)
metric, is used by VMkernel to inflate and deflate the balloon
for a virtual machine.
If MCTLTGT > MCTLSZ the VMkernel inflates the balloon.
If MCTLTGT < MCTLSZ the VMkernel deflates balloon.
MEMORY TROUBLESHOOTING - LIMIT




Don’t configure VM memory limits, set an appropriate VM
memory size instead! Virtual machines deployed from a
template with a configured memory limit will become
ballooning ghosts after adding more configured memory.
Even though there’s enough memory available at host level
you will see ballooning with a maximum of 65%.
MEMORY TROUBLESHOOTING - LIMIT
This is an example of a virtual
machine configured with 1024
MB of memory and no limit.
Before 20:15 there’s no
memory limit configured after
20:15 the limit is set 512 MB.
As soon as the VM is trying to
access memory above 512
MB - ballooning kicks in.
MEMORY TROUBLESHOOTING - RESERVATION




                        Ballooning
           RES         Compression   RES

                          SWAP


Be careful with configuring a high VM reservation. As soon
as a virtual machine has used or touched it’s reserved
memory, the other virtual machines can’t use it anymore.
The VM reservation is also used for calculating the slot size
in an HA cluster with “number host failures allowed”. Only
reserve what is really used and needs to be guaranteed.
MEMORY TROUBLESHOOTING – COMPRESSION




                        Compression


Memory compression reclaims memory by compressing the pages
that need to be swapped out. If the swapped out pages can be
compressed and stored in a compression cache located in the main
memory, the next access to the page only causes a page
decompression, which can be an order of magnitude faster than the
disk access. This means the number of future synchronous swap-in
operations will be reduced. The compression ratio must be + 50%.
MEMORY TROUBLESHOOTING – COMPRESSION




o The CACHESZ value (10% of the VM memory) is the
  compression cache size.
o The CACHEUSD value is the compression cache
  currently used.
o ZIP/s and UNZIP/s are the compressions and
  uncompressing actions per second.
MEMORY TROUBLESHOOTING – SWAP




o SWCUR is the current amount of guest physical memory
  swapped out to the virtual machine's swap file by the
  VMkernel. Swapped memory stays on disk until the
  virtual machine needs it.

o If SWTGT > SWCUR, the VMkernel can start swapping
  when necessary.
o If SWTGT < SWCUR, the VMkernel stops swapping
  memory.
MEMORY TROUBLESHOOTING - SWAP




                        Ballooning

                       Compression
                          SWAP


High swap-in latency, which can be tens of milliseconds, can
severely degrade guest performance. If available configure
local SSD storage for your virtual machine swap file location.
There’s a -12% degradation with local SSD versus
-69% for Fiber Channel and -83% for local SATA storage.
MEMORY TROUBLESHOOTING – SWAP




o   SWPWT is the percentage of time that the virtual machine
    is waiting for memory to be swapped in.
    This value shouldn’t be above 5%.

o   SWR/s is the rate at which memory is swapped from
    (SSD) disk into active memory.
o   SWW/s is the rate at which memory is being swapped from
    active memory and written to (SSD) disk.
MEMORY TROUBLESHOOTING - RECAP

o Be careful with setting virtual machine memory
  reservations. When memory is touched by the VM, the other
  virtual machines can’t use the memory anymore. Only
  configure what the virtual machine really needs.
o Don’t set memory limits, set an appropriate virtual machine
  memory size instead.
o Do not disable page sharing or the balloon driver. Ballooning
  is OK as long as the guest OS isn’t using it’s own page file
  for active memory swapping.
o The use of large pages results in reduced memory
  management overhead and can therefore increase
  hypervisor performance. But take into consideration that
  using large pages (2MB) TSP might not occur until memory
  over commitment is high enough to require the large pages
  to be broken into small pages.
STORAGE TROUBLESHOOTING – THE STACK

                    Application
        Guest       File System
                    I/O Drivers    HD Tune Pro

                   VMM

                       VSCSI       GAVG/cmd
                       VMFS
        VMKernel
                    Core Storage   KAVG/cmd
                       Driver
                                   DAVG/cmd
STORAGE TROUBLESHOOTING – THE METRICS
DAVG - This is the latency seen at
the device driver level. It includes
the roundtrip time between the
HBA and the storage.

KAVG - This counter tracks the
latency due to the ESX Kernel's
command.

GAVG - This is the round-trip
latency that the guest sees for all
IO requests sent to the virtual
storage device.
STORAGE TROUBLESHOOTING – IBM DS3400

IBM-DS3400 with 2
arrays and 18 logical
drives – RAID 5

ISP2432-based 4Gb
Fiber Channel to PCI
Express HBA
STORAGE TROUBLESHOOTING – IOMEGA IX2
Iomega StorCenter ix2
with 500 GB - RAID 1

1 Gigabit Ethernet
Jumbo frame support
iSCSI target or CIFS/NFS
STORAGE TROUBLESHOOTING - (CONS/S)




The SCSI reservation conflict counter - CONS/s will become non-zero
when a host tries to do SCSI reservation on a LUN which has a SCSI
reservation in progress. This happens only when two hosts try to do
metadata operation on the same LUN at the same exact time.
STORAGE TROUBLESHOOTING - (CONS/S)




SCSI reservation is held for a very short period (few hundred
microseconds) so the chances of getting a conflict is very less on a
small cluster. However as the number of hosts that shares the LUN
increases conflicts could arise more frequently.
STORAGE TROUBLESHOOTING - VSCSISTAT
vscsiStats collects and reports counters on storage
activity. Its data is collected at the virtual SCSI device
level in the kernel. This means that results are reported
per VMDK (or RDM) irrespective of the underlying
storage protocol. The following data are reported in
histogram form:


o IO size
o Seek distance
o Outstanding IOs
o Latency (in mSecs)
STORAGE TROUBLESHOOTING - ALIGNMENT

VMDK file (NTFS)         Cluster      Cluster    Cluster   Cluster      Cluster    Cluster
 VMFS volume                       Block                             Block
   SAN LUN                   Chunk                             Chunk


VMDK file (NTFS)   Cluster   Cluster       Cluster   Cluster   Cluster       Cluster
 VMFS volume                  Block                             Block
   SAN LUN                   Chunk                             Chunk


  Like other known disk based file systems, VMFS suffers a
  penalty when the partition is unaligned. Use the vSphere
  client to create VMFS partitions since the vSphere client
  automatically aligns the partitions along the 64 KB boundary.
STORAGE TROUBLESHOOTING – ALIGNMENT

• Guest OS alignment is important for Microsoft Windows
  Server 2003, XP and 2000. When a partition is created on
  Windows 2008 or Windows 7 the newly created partition
  is automatically aligned.
• Windows uses a factor of 512 bytes to create volume
  clusters. This behavior causes a misaligned partition.
• To resolve this issue, use the Diskpart.exe tool to create
  the disk partition and to specify a starting offset of 128
  sectors (64 kilobyte).
• Create partition primary align=64

  ((Partition offset) * (Disk sector size)) / (Stripe unit size)
STORAGE TROUBLESHOOTING - RECAP

o If KAVG/cmd > 3 mSec or DAVG/cmd > 20 mSec
  there might be a storage performance problem.
o Check alignment on the array, VMFS and in the
  guest OS.
o Monitor the number of reservation conflicts per
  second and be careful with snapshots.
o Pay attention to drive types, the more drives
  you use the more IOPS you will get.
o When creating an VMFS, give it the right size
  and keep in mind how many virtual machines
  you want to host on that datastore.
o When choosing a block size, stick to it.
NETWORK TROUBLESHOOTING – THE NETWORK STACK
NETWORK TROUBLESHOOTING – DROPPED PKT

Receive packets might be dropped at the virtual switch if the
virtual machine’s network driver runs out of receive (Rx)
buffers, that’s a buffer overflow.

The dropped packets (%DRPR) may be reduced by
increasing the Rx buffers for the virtual network driver.
NETWORK TROUBLESHOOTING – NIC SETTINGS

              In ESX 4.1, you can configure the
              advanced VMXNET3 parameters
              from the Device Manager in the
              Windows guest OS.

              It’s possible to increase the Rx
              buffers for the virtual network
              driver here.

              This also works on an Intel E1000
              with the native driver installed in
              the guest OS.
NETWORK TROUBLESHOOTING – VLAN ID




VMXNET3




For VLAN troubleshooting, you have to create a new
dvPortgroup with a VLAN trunk. This way the network traffic
is delivered with a VLAN tag in the guest OS.

Now you can configure the VLAN advanced parameters for
an Intel E1000 or an VMXNET3 adapter in the guest OS and
specify a VLAN ID. This allows you to hop between VLANs.
NETWORK TROUBLESHOOTING – LOAD BASED TEAMING
                                             pSwitch




LBT reshuffles port binding dynamically based on load
and dvUplinks usage to make an efficient use of the
available bandwidth.

When Load Based Teaming reassigns ports, the MAC
address change to a different pSwitch port. The pSwitch
must allow for this.
NETWORK TROUBLESHOOTING – LOAD BASED TEAMING




LBT will only move a flow when the mean send or receive
utilization on an uplink exceeds 75 percent of capacity over
a 30-second period. LBT will not move flows more often
than every 30 seconds. Enable PortFast mode for the
physical switch ports facing the ESX Server.
NETWORK TROUBLESHOOTING - RECAP

o Enable PortFast mode for the physical switch ports
  facing the ESXi Server.

o Disable STP for the physical switch ports facing the
  ESX Server.

o Use the VMXNET3 virtual network card wherever
  possible.
TROUBLESHOOTING TOOLS

•   Veeam Monitor
•   VMTurbo Watchdog
•   Quest vFoglight
•   VKernel Capacity Analyzer
•   VESI VMware Community PowerPack
•   VMware Health Check Analyzer
•   Bouke Groenescheij -> Graph-VM
•   Esxplot and perfmon
•   Rob de Veij - RVTools
•   Xangati for ESX
TROUBLESHOOTING TOOLS – GRAPH-VM




                                   http://www.jume.nl




Bouke Groenescheij has created a framework of scripts
which are able to produce some real nice graphs. Graph-
VM uses PowerShell to gather the information and creates
reports with the RDDTool.
TROUBLESHOOTING TOOLS – ESXPLOT




                                  http://labs.vmware.com




The following command would run esxtop in batch mode,
updating all statistics to the file perfstats.csv every 10
seconds for 360 iterations (a total of 60 minutes) before
exiting:

esxtop -a -b -d 10 -n 360 > perfstats.csv
TROUBLESHOOTING TOOLS - RVTOOLS



                                http://www.robware.net




RVTools is a windows .NET 2.0 application which uses the VI
SDK to display information about your virtual machines and
ESX hosts. RVTools is able to list information about cpu,
memory, disks, nics, cd-rom, floppy drives, snapshots,
VMware tools, ESX hosts, nics, datastores, switches, ports
and health checks.
TROUBLESHOOTING TOOLS - XANGATI




                               http://xangati.com




Xangati for ESX is a Free tool designed for smaller scale
environments with only a few ESX/ESXi hosts. It offers
continuous, real-time visibility into over 100 metrics on an
ESX/ESXi host and its VMs activity, including
communications, CPU, memory, disk, and storage latency.
THANK YOU - QUESTIONS
This presentation is available for download at
http://www.ntpro.nl and http://www.vmug.nl

Don't forget to fill out the Session Evaluation.

More Related Content

What's hot (19)

PDF
How to Optimize Microsoft Hyper-V Failover Cluster and Double Performance
StarWind Software
 
PDF
Comparação entre XenServer 6.2 e VMware VSphere 5.1 - Comparison of Citrix Xe...
Lorscheider Santiago
 
PDF
vSphere APIs for performance monitoring
Alan Renouf
 
PDF
Hyper-V Best Practices & Tips and Tricks
Amit Gatenyo
 
PDF
Presentation v mware v-sphere advanced troubleshooting by eric sloof
solarisyourep
 
PPTX
Секреты виртуализации - Windows Server 2012 Hyper-V
Виталий Стародубцев
 
PDF
The Unofficial VCAP / VCP VMware Study Guide
Veeam Software
 
PPTX
Esxi troubleshooting
Ovi Chis
 
PPTX
Windows Server "10": что нового в кластеризации
Виталий Стародубцев
 
PPT
Using Virtualization To Improve Development And Testing
elliando dias
 
PDF
Poc guide vsan
Ram Prasad Ohnu
 
PDF
VMware Performance for Gurus - A Tutorial
Richard McDougall
 
PPT
Tech X Virtualization Tips
Youssef EL HADJ
 
PPTX
Integration with EMC VNX and VNXe hybrid storage arrays
Veeam Software
 
KEY
Backup virtual machines with XenServer 5.x
Thomas Krampe
 
PDF
Visão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
Lorscheider Santiago
 
PDF
Building vSphere Perf Monitoring Tools
Pablo Roesch
 
PPTX
VMware Advance Troubleshooting Workshop - Day 5
Vepsun Technologies
 
PPTX
Salt Cloud vmware-orchestration
Mo Rawi
 
How to Optimize Microsoft Hyper-V Failover Cluster and Double Performance
StarWind Software
 
Comparação entre XenServer 6.2 e VMware VSphere 5.1 - Comparison of Citrix Xe...
Lorscheider Santiago
 
vSphere APIs for performance monitoring
Alan Renouf
 
Hyper-V Best Practices & Tips and Tricks
Amit Gatenyo
 
Presentation v mware v-sphere advanced troubleshooting by eric sloof
solarisyourep
 
Секреты виртуализации - Windows Server 2012 Hyper-V
Виталий Стародубцев
 
The Unofficial VCAP / VCP VMware Study Guide
Veeam Software
 
Esxi troubleshooting
Ovi Chis
 
Windows Server "10": что нового в кластеризации
Виталий Стародубцев
 
Using Virtualization To Improve Development And Testing
elliando dias
 
Poc guide vsan
Ram Prasad Ohnu
 
VMware Performance for Gurus - A Tutorial
Richard McDougall
 
Tech X Virtualization Tips
Youssef EL HADJ
 
Integration with EMC VNX and VNXe hybrid storage arrays
Veeam Software
 
Backup virtual machines with XenServer 5.x
Thomas Krampe
 
Visão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
Lorscheider Santiago
 
Building vSphere Perf Monitoring Tools
Pablo Roesch
 
VMware Advance Troubleshooting Workshop - Day 5
Vepsun Technologies
 
Salt Cloud vmware-orchestration
Mo Rawi
 

Viewers also liked (13)

PPS
Linux Administration
SiliconExpert Technologies
 
PPT
ESX performance problems 10 steps
Concentrated Technology
 
PDF
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
Suresh Kumar
 
PPT
Top ESXi command line v2.0
Concentrated Technology
 
PPTX
VMworld 2017 vSAN Network Design
Cormac Hogan
 
PPTX
VMworld 2017 - Top 10 things to know about vSAN
Duncan Epping
 
PPTX
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld
 
PPTX
VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld
 
PPT
Linux Administration
Harish1983
 
PPS
01 t1 s2_linux_lesson1
Niit Care
 
PDF
Linux On V Mware ESXi
Masafumi Ohta
 
PPTX
VMworld 2017 Core Storage
Cormac Hogan
 
PPTX
Linux ppt
lincy21
 
Linux Administration
SiliconExpert Technologies
 
ESX performance problems 10 steps
Concentrated Technology
 
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
Suresh Kumar
 
Top ESXi command line v2.0
Concentrated Technology
 
VMworld 2017 vSAN Network Design
Cormac Hogan
 
VMworld 2017 - Top 10 things to know about vSAN
Duncan Epping
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld
 
VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld
 
Linux Administration
Harish1983
 
01 t1 s2_linux_lesson1
Niit Care
 
Linux On V Mware ESXi
Masafumi Ohta
 
VMworld 2017 Core Storage
Cormac Hogan
 
Linux ppt
lincy21
 
Ad

Similar to Advancedtroubleshooting 101208145718-phpapp01 (20)

PPT
WinConnections Spring, 2011 - 30 Bite-Sized Tips for Best vSphere and Hyper-V...
Concentrated Technology
 
PPTX
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld
 
PPT
Good virtual machines
Concentrated Technology
 
DOCX
Cpu ready recomendaciones
Cristian Muñoz
 
PPT
Mysql talk
LogicMonitor
 
PPTX
PlovDev 2016: Application Performance in Virtualized Environments by Todor T...
PlovDev Conference
 
PDF
OOPs, OOMs, oh my! Containerizing JVM apps
Sematext Group, Inc.
 
PPTX
VDI Design Guide
Dan Brinkmann
 
PDF
Avoid resource contention with e4 c
Eco4Cloud
 
PDF
Session 7362 Handout 427 0
jln1028
 
PDF
Improving MeeGo boot-up time
Hiroshi Doyu
 
PDF
Presentation v mware performance overview
solarisyourep
 
PPT
VMWare Performance Tuning by Virtera (Jan 2009)
vmug
 
PDF
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld
 
PPTX
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Unidesk Corporation
 
PPT
Ha & drs gotcha's
Concentrated Technology
 
PDF
The have no fear guide to virtualizing databases
SolarWinds
 
PDF
VMworld 2013: Performance and Capacity Management of DRS Clusters
VMworld
 
PDF
Enhanced Live Migration for Intensive Memory Loads
Samsung Open Source Group
 
PDF
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld
 
WinConnections Spring, 2011 - 30 Bite-Sized Tips for Best vSphere and Hyper-V...
Concentrated Technology
 
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld
 
Good virtual machines
Concentrated Technology
 
Cpu ready recomendaciones
Cristian Muñoz
 
Mysql talk
LogicMonitor
 
PlovDev 2016: Application Performance in Virtualized Environments by Todor T...
PlovDev Conference
 
OOPs, OOMs, oh my! Containerizing JVM apps
Sematext Group, Inc.
 
VDI Design Guide
Dan Brinkmann
 
Avoid resource contention with e4 c
Eco4Cloud
 
Session 7362 Handout 427 0
jln1028
 
Improving MeeGo boot-up time
Hiroshi Doyu
 
Presentation v mware performance overview
solarisyourep
 
VMWare Performance Tuning by Virtera (Jan 2009)
vmug
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Unidesk Corporation
 
Ha & drs gotcha's
Concentrated Technology
 
The have no fear guide to virtualizing databases
SolarWinds
 
VMworld 2013: Performance and Capacity Management of DRS Clusters
VMworld
 
Enhanced Live Migration for Intensive Memory Loads
Samsung Open Source Group
 
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld
 
Ad

More from Suresh Kumar (7)

PPT
Vsphere 4-partner-training180
Suresh Kumar
 
PDF
Vsphere4 100325065654-phpapp01
Suresh Kumar
 
PDF
Vmwareserver tips-tricks-110218231744-phpapp01
Suresh Kumar
 
PPTX
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Suresh Kumar
 
PPTX
Vmwareperformancetroubleshooting 100224104321-phpapp02
Suresh Kumar
 
PPTX
Managingvspherewiththevesi 091210144626-phpapp02
Suresh Kumar
 
PPTX
Vstoragetamsupportday1 110311121032-phpapp02
Suresh Kumar
 
Vsphere 4-partner-training180
Suresh Kumar
 
Vsphere4 100325065654-phpapp01
Suresh Kumar
 
Vmwareserver tips-tricks-110218231744-phpapp01
Suresh Kumar
 
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Suresh Kumar
 
Vmwareperformancetroubleshooting 100224104321-phpapp02
Suresh Kumar
 
Managingvspherewiththevesi 091210144626-phpapp02
Suresh Kumar
 
Vstoragetamsupportday1 110311121032-phpapp02
Suresh Kumar
 

Recently uploaded (20)

PDF
Ziehl-Neelsen Stain: Principle, Procedu.
PRASHANT YADAV
 
PPTX
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
PPTX
Latest Features in Odoo 18 - Odoo slides
Celine George
 
PPTX
PYLORIC STENOSIS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PDF
Federal dollars withheld by district, charter, grant recipient
Mebane Rash
 
PPTX
classroom based quiz bee.pptx...................
ferdinandsanbuenaven
 
PDF
FULL DOCUMENT: Read the full Deloitte and Touche audit report on the National...
Kweku Zurek
 
PPTX
THE HUMAN INTEGUMENTARY SYSTEM#MLT#BCRAPC.pptx
Subham Panja
 
PDF
Right to Information.pdf by Sapna Maurya XI D
Directorate of Education Delhi
 
PPTX
Nutrition Month 2025 TARP.pptx presentation
FairyLouHernandezMej
 
PDF
Living Systems Unveiled: Simplified Life Processes for Exam Success
omaiyairshad
 
PPTX
GENERAL METHODS OF ISOLATION AND PURIFICATION OF MARINE__MPHARM.pptx
SHAHEEN SHABBIR
 
PPTX
Modern analytical techniques used to characterize organic compounds. Birbhum ...
AyanHossain
 
PDF
Comprehensive Guide to Writing Effective Literature Reviews for Academic Publ...
AJAYI SAMUEL
 
PPTX
Maternal and Child Tracking system & RCH portal
Ms Usha Vadhel
 
PPTX
How to Consolidate Subscription Billing in Odoo 18 Sales
Celine George
 
PPSX
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
PPTX
CLEFT LIP AND PALATE: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Views on Education of Indian Thinkers Mahatma Gandhi.pptx
ShrutiMahanta1
 
Ziehl-Neelsen Stain: Principle, Procedu.
PRASHANT YADAV
 
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
Latest Features in Odoo 18 - Odoo slides
Celine George
 
PYLORIC STENOSIS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Federal dollars withheld by district, charter, grant recipient
Mebane Rash
 
classroom based quiz bee.pptx...................
ferdinandsanbuenaven
 
FULL DOCUMENT: Read the full Deloitte and Touche audit report on the National...
Kweku Zurek
 
THE HUMAN INTEGUMENTARY SYSTEM#MLT#BCRAPC.pptx
Subham Panja
 
Right to Information.pdf by Sapna Maurya XI D
Directorate of Education Delhi
 
Nutrition Month 2025 TARP.pptx presentation
FairyLouHernandezMej
 
Living Systems Unveiled: Simplified Life Processes for Exam Success
omaiyairshad
 
GENERAL METHODS OF ISOLATION AND PURIFICATION OF MARINE__MPHARM.pptx
SHAHEEN SHABBIR
 
Modern analytical techniques used to characterize organic compounds. Birbhum ...
AyanHossain
 
Comprehensive Guide to Writing Effective Literature Reviews for Academic Publ...
AJAYI SAMUEL
 
Maternal and Child Tracking system & RCH portal
Ms Usha Vadhel
 
How to Consolidate Subscription Billing in Odoo 18 Sales
Celine George
 
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
CLEFT LIP AND PALATE: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Views on Education of Indian Thinkers Mahatma Gandhi.pptx
ShrutiMahanta1
 

Advancedtroubleshooting 101208145718-phpapp01

  • 1. VMWARE VSPHERE 4 ADVANCED TROUBLESHOOTING ERIC SLOOF - NTPRO.NL
  • 2. INTRODUCTION • VMware Certified Instructor • Blogger @ NTPRO.NL
  • 4. AGENDA 1. Introduction by Scott Drummonds 2. CPU troubleshooting 3. Memory troubleshooting 4. Storage troubleshooting 5. Network troubleshooting 6. Troubleshooting tools
  • 5. INTRODUCTION Scott Drummonds Technical Director, vSpecialists, APJ at EMC The performance space is massive. It’s nearly impossible to keep up with everything that is happening in this space. With the benefit of close contact with VMware's performance engineering team I was barely able to hold the reins on that massive beast. The secret is not to try and learn every little thing out there, but to develop a strong handle on troubleshooting using esxtop, vCenter and vscsiStats. Everything comes from there.
  • 6. CPU TROUBLESHOOTING – CPU READY TIME The vSphere Client Graph refreshes every 20 seconds 1000 Milliseconds / 20.000 Milliseconds = 5 % 34 Milliseconds / 20.000 Milliseconds = 0,17 % <~ no worries 
  • 7. CPU TROUBLESHOOTING – CPU READY TIME A %RDY figure of 17.97% means that the virtual machine spent 17.97% of its last sample period waiting for available CPU resources. Esxtop’s default refresh interval is 5 seconds. The PCPU AVG value in this example is 100%.
  • 8. CPU TROUBLESHOOTING - FLOWCHART %RDY > 10% NO Other VM CPU Ready Time 2000 mSec Problem: - Memory - Storage - Network No Problem YES YES Used time ~ %PCPU ready time Hmmm YES USED > 90% NO with spikes NO Host CPU Saturation
  • 9. CPU TROUBLESHOOTING - MAX LIMITED %MLMTD - The max limited time is the percentage of time the VM world was ready to run but deliberately wasn't scheduled because that would violate the VM’s "CPU limit" settings. %RDY includes %MLMTD For CPU contention, use "%RDY - %MLMTD“. 99.75 – 99,73 = 0.02 So there’s no contention despite of the high ready time.
  • 10. CPU TROUBLESHOOTING - MAX LIMITED VMKernel deliberately didn't run Yes %MLMTD Hmmm > 0% No SMP virtual machine? Check %CSTP - Co-Scheduling
  • 11. CPU TROUBLESHOOTING – CO SCHEDULING At any particular point in time, each virtual cpu may be scheduled, descheduled, preempted, or blocked waiting for some event. Without co scheduling, the VCPUs associated with an SMP VM would be scheduled independently, breaking the guest's assumptions regarding uniform progress. VMware uses the term "skew" to refer to the difference in execution rates between two or more VCPUs associated with an SMP VM.
  • 12. CPU TROUBLESHOOTING – CO SCHEDULING Type “e” to show all the worlds associated with a single virtual machine. The %CSTP metric indicates co scheduling.
  • 13. CPU TROUBLESHOOTING - RECAP o If ready time <= 5%, there’s no problem. o If ready time is 5% <=> 10%, there might be an issue. o If ready time is => 10% there’s a performance issue. o Check if the virtual machine’s CPU is not limited. o Check if there’s CPU over commitment all the time, occasional spikes are no problem. o If it’s an SMP virtual machine check if the application is multithreading and actually using the resources. o If the ESX host is saturated reduce the number of virtual machines.
  • 14. MEMORY TROUBLESHOOTING For each running virtual machine, the ESX host reserves physical memory for the virtual machine’s reservation (if any) and for its virtualization overhead. Because of the memory management techniques the ESX host uses, your VMs can use more memory than there’s physically available…
  • 15. MEMORY TROUBLESHOOTING – PAGE SHARING Transparent Page Sharing Transparent page sharing (TPS) reclaims memory by consolidating redundant pages with identical content. This helps to free memory that a virtual machine would otherwise (not) be using. Page sharing will show up in esxtop at modern Intel/AMD processors only when host memory is overcommitted.
  • 16. MEMORY TROUBLESHOOTING – PAGE SHARING Guest physical memory is not “freed”, the memory is moved to the “free” list. The ESX host has no access to the guest’s “free” list and the ESX host cannot “reclaim” the memory freed up by the guest. Sharing happens with other virtual machines on the same host but also within virtual machines.
  • 17. MEMORY TROUBLESHOOTING - BALLOONING Ballooning reclaims memory by artificially increasing the memory pressure inside the guest and will become a performance issue when the guest OS is paging active memory to its own page file. Ballooning offers a better performance than ESX swapping or ESX memory compression.
  • 18. MEMORY TROUBLESHOOTING - BALLOONING The MCTLTGT (target) value set by VMkernel for the VM’s memory balloon size, in conjunction with MCTLSZ (size) metric, is used by VMkernel to inflate and deflate the balloon for a virtual machine. If MCTLTGT > MCTLSZ the VMkernel inflates the balloon. If MCTLTGT < MCTLSZ the VMkernel deflates balloon.
  • 19. MEMORY TROUBLESHOOTING - LIMIT Don’t configure VM memory limits, set an appropriate VM memory size instead! Virtual machines deployed from a template with a configured memory limit will become ballooning ghosts after adding more configured memory. Even though there’s enough memory available at host level you will see ballooning with a maximum of 65%.
  • 20. MEMORY TROUBLESHOOTING - LIMIT This is an example of a virtual machine configured with 1024 MB of memory and no limit. Before 20:15 there’s no memory limit configured after 20:15 the limit is set 512 MB. As soon as the VM is trying to access memory above 512 MB - ballooning kicks in.
  • 21. MEMORY TROUBLESHOOTING - RESERVATION Ballooning RES Compression RES SWAP Be careful with configuring a high VM reservation. As soon as a virtual machine has used or touched it’s reserved memory, the other virtual machines can’t use it anymore. The VM reservation is also used for calculating the slot size in an HA cluster with “number host failures allowed”. Only reserve what is really used and needs to be guaranteed.
  • 22. MEMORY TROUBLESHOOTING – COMPRESSION Compression Memory compression reclaims memory by compressing the pages that need to be swapped out. If the swapped out pages can be compressed and stored in a compression cache located in the main memory, the next access to the page only causes a page decompression, which can be an order of magnitude faster than the disk access. This means the number of future synchronous swap-in operations will be reduced. The compression ratio must be + 50%.
  • 23. MEMORY TROUBLESHOOTING – COMPRESSION o The CACHESZ value (10% of the VM memory) is the compression cache size. o The CACHEUSD value is the compression cache currently used. o ZIP/s and UNZIP/s are the compressions and uncompressing actions per second.
  • 24. MEMORY TROUBLESHOOTING – SWAP o SWCUR is the current amount of guest physical memory swapped out to the virtual machine's swap file by the VMkernel. Swapped memory stays on disk until the virtual machine needs it. o If SWTGT > SWCUR, the VMkernel can start swapping when necessary. o If SWTGT < SWCUR, the VMkernel stops swapping memory.
  • 25. MEMORY TROUBLESHOOTING - SWAP Ballooning Compression SWAP High swap-in latency, which can be tens of milliseconds, can severely degrade guest performance. If available configure local SSD storage for your virtual machine swap file location. There’s a -12% degradation with local SSD versus -69% for Fiber Channel and -83% for local SATA storage.
  • 26. MEMORY TROUBLESHOOTING – SWAP o SWPWT is the percentage of time that the virtual machine is waiting for memory to be swapped in. This value shouldn’t be above 5%. o SWR/s is the rate at which memory is swapped from (SSD) disk into active memory. o SWW/s is the rate at which memory is being swapped from active memory and written to (SSD) disk.
  • 27. MEMORY TROUBLESHOOTING - RECAP o Be careful with setting virtual machine memory reservations. When memory is touched by the VM, the other virtual machines can’t use the memory anymore. Only configure what the virtual machine really needs. o Don’t set memory limits, set an appropriate virtual machine memory size instead. o Do not disable page sharing or the balloon driver. Ballooning is OK as long as the guest OS isn’t using it’s own page file for active memory swapping. o The use of large pages results in reduced memory management overhead and can therefore increase hypervisor performance. But take into consideration that using large pages (2MB) TSP might not occur until memory over commitment is high enough to require the large pages to be broken into small pages.
  • 28. STORAGE TROUBLESHOOTING – THE STACK Application Guest File System I/O Drivers HD Tune Pro VMM VSCSI GAVG/cmd VMFS VMKernel Core Storage KAVG/cmd Driver DAVG/cmd
  • 29. STORAGE TROUBLESHOOTING – THE METRICS DAVG - This is the latency seen at the device driver level. It includes the roundtrip time between the HBA and the storage. KAVG - This counter tracks the latency due to the ESX Kernel's command. GAVG - This is the round-trip latency that the guest sees for all IO requests sent to the virtual storage device.
  • 30. STORAGE TROUBLESHOOTING – IBM DS3400 IBM-DS3400 with 2 arrays and 18 logical drives – RAID 5 ISP2432-based 4Gb Fiber Channel to PCI Express HBA
  • 31. STORAGE TROUBLESHOOTING – IOMEGA IX2 Iomega StorCenter ix2 with 500 GB - RAID 1 1 Gigabit Ethernet Jumbo frame support iSCSI target or CIFS/NFS
  • 32. STORAGE TROUBLESHOOTING - (CONS/S) The SCSI reservation conflict counter - CONS/s will become non-zero when a host tries to do SCSI reservation on a LUN which has a SCSI reservation in progress. This happens only when two hosts try to do metadata operation on the same LUN at the same exact time.
  • 33. STORAGE TROUBLESHOOTING - (CONS/S) SCSI reservation is held for a very short period (few hundred microseconds) so the chances of getting a conflict is very less on a small cluster. However as the number of hosts that shares the LUN increases conflicts could arise more frequently.
  • 34. STORAGE TROUBLESHOOTING - VSCSISTAT vscsiStats collects and reports counters on storage activity. Its data is collected at the virtual SCSI device level in the kernel. This means that results are reported per VMDK (or RDM) irrespective of the underlying storage protocol. The following data are reported in histogram form: o IO size o Seek distance o Outstanding IOs o Latency (in mSecs)
  • 35. STORAGE TROUBLESHOOTING - ALIGNMENT VMDK file (NTFS) Cluster Cluster Cluster Cluster Cluster Cluster VMFS volume Block Block SAN LUN Chunk Chunk VMDK file (NTFS) Cluster Cluster Cluster Cluster Cluster Cluster VMFS volume Block Block SAN LUN Chunk Chunk Like other known disk based file systems, VMFS suffers a penalty when the partition is unaligned. Use the vSphere client to create VMFS partitions since the vSphere client automatically aligns the partitions along the 64 KB boundary.
  • 36. STORAGE TROUBLESHOOTING – ALIGNMENT • Guest OS alignment is important for Microsoft Windows Server 2003, XP and 2000. When a partition is created on Windows 2008 or Windows 7 the newly created partition is automatically aligned. • Windows uses a factor of 512 bytes to create volume clusters. This behavior causes a misaligned partition. • To resolve this issue, use the Diskpart.exe tool to create the disk partition and to specify a starting offset of 128 sectors (64 kilobyte). • Create partition primary align=64 ((Partition offset) * (Disk sector size)) / (Stripe unit size)
  • 37. STORAGE TROUBLESHOOTING - RECAP o If KAVG/cmd > 3 mSec or DAVG/cmd > 20 mSec there might be a storage performance problem. o Check alignment on the array, VMFS and in the guest OS. o Monitor the number of reservation conflicts per second and be careful with snapshots. o Pay attention to drive types, the more drives you use the more IOPS you will get. o When creating an VMFS, give it the right size and keep in mind how many virtual machines you want to host on that datastore. o When choosing a block size, stick to it.
  • 38. NETWORK TROUBLESHOOTING – THE NETWORK STACK
  • 39. NETWORK TROUBLESHOOTING – DROPPED PKT Receive packets might be dropped at the virtual switch if the virtual machine’s network driver runs out of receive (Rx) buffers, that’s a buffer overflow. The dropped packets (%DRPR) may be reduced by increasing the Rx buffers for the virtual network driver.
  • 40. NETWORK TROUBLESHOOTING – NIC SETTINGS In ESX 4.1, you can configure the advanced VMXNET3 parameters from the Device Manager in the Windows guest OS. It’s possible to increase the Rx buffers for the virtual network driver here. This also works on an Intel E1000 with the native driver installed in the guest OS.
  • 41. NETWORK TROUBLESHOOTING – VLAN ID VMXNET3 For VLAN troubleshooting, you have to create a new dvPortgroup with a VLAN trunk. This way the network traffic is delivered with a VLAN tag in the guest OS. Now you can configure the VLAN advanced parameters for an Intel E1000 or an VMXNET3 adapter in the guest OS and specify a VLAN ID. This allows you to hop between VLANs.
  • 42. NETWORK TROUBLESHOOTING – LOAD BASED TEAMING pSwitch LBT reshuffles port binding dynamically based on load and dvUplinks usage to make an efficient use of the available bandwidth. When Load Based Teaming reassigns ports, the MAC address change to a different pSwitch port. The pSwitch must allow for this.
  • 43. NETWORK TROUBLESHOOTING – LOAD BASED TEAMING LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75 percent of capacity over a 30-second period. LBT will not move flows more often than every 30 seconds. Enable PortFast mode for the physical switch ports facing the ESX Server.
  • 44. NETWORK TROUBLESHOOTING - RECAP o Enable PortFast mode for the physical switch ports facing the ESXi Server. o Disable STP for the physical switch ports facing the ESX Server. o Use the VMXNET3 virtual network card wherever possible.
  • 45. TROUBLESHOOTING TOOLS • Veeam Monitor • VMTurbo Watchdog • Quest vFoglight • VKernel Capacity Analyzer • VESI VMware Community PowerPack • VMware Health Check Analyzer • Bouke Groenescheij -> Graph-VM • Esxplot and perfmon • Rob de Veij - RVTools • Xangati for ESX
  • 46. TROUBLESHOOTING TOOLS – GRAPH-VM http://www.jume.nl Bouke Groenescheij has created a framework of scripts which are able to produce some real nice graphs. Graph- VM uses PowerShell to gather the information and creates reports with the RDDTool.
  • 47. TROUBLESHOOTING TOOLS – ESXPLOT http://labs.vmware.com The following command would run esxtop in batch mode, updating all statistics to the file perfstats.csv every 10 seconds for 360 iterations (a total of 60 minutes) before exiting: esxtop -a -b -d 10 -n 360 > perfstats.csv
  • 48. TROUBLESHOOTING TOOLS - RVTOOLS http://www.robware.net RVTools is a windows .NET 2.0 application which uses the VI SDK to display information about your virtual machines and ESX hosts. RVTools is able to list information about cpu, memory, disks, nics, cd-rom, floppy drives, snapshots, VMware tools, ESX hosts, nics, datastores, switches, ports and health checks.
  • 49. TROUBLESHOOTING TOOLS - XANGATI http://xangati.com Xangati for ESX is a Free tool designed for smaller scale environments with only a few ESX/ESXi hosts. It offers continuous, real-time visibility into over 100 metrics on an ESX/ESXi host and its VMs activity, including communications, CPU, memory, disk, and storage latency.
  • 50. THANK YOU - QUESTIONS This presentation is available for download at http://www.ntpro.nl and http://www.vmug.nl Don't forget to fill out the Session Evaluation.