0% found this document useful (0 votes)
471 views

ESXTOP

ESXTOP is a utility for ESX/ESXi hosts that displays real-time resource usage. It has three modes: interactive mode shows live statistics; batch mode saves output to a file; replay mode replays previously collected statistics. ESXTOP allows switching between views of CPU, memory, disk, network, and other usage. Key commands provide filtering and customization options to optimize troubleshooting performance issues.

Uploaded by

Umesh Chavaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
471 views

ESXTOP

ESXTOP is a utility for ESX/ESXi hosts that displays real-time resource usage. It has three modes: interactive mode shows live statistics; batch mode saves output to a file; replay mode replays previously collected statistics. ESXTOP allows switching between views of CPU, memory, disk, network, and other usage. Key commands provide filtering and customization options to optimize troubleshooting performance issues.

Uploaded by

Umesh Chavaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

ESXTOP

ESXTOP is the utility only for ESX host to examine real-time resource usage for ESX and resxtop for ESX
& ESXi hosts. esxtop can only be used for the local ESX machine but resxtop can be used remotely to
view the resource utilization of ESX/ESXi hosts from other ESX/ESXi servers or VMA.
There are 3 different types of Modes in esxtop/resxtop
Interactive Mode
Batch Mode
Replay Mode.
Interactive mode (the default mode) All statistics are displayed are in real time
Batch mode Statistics can be collected so that the output can be saved in a file (csv) and can be
viewed & analyzed using windows perfmon & other tools in later time.
Replay mode It is similar to record and replay operation. Data that was collected by the vm-support
command is interpreted and played back as esxtop statistics. We can view the captured performance
information for a particular duration or time period as like real time to view what was happening during
that time. It is perfectly used for the VMware support person to replay the stats to understand what was
happening to the server during that time.
ESX Interactive mode (the default mode) All statistics are displayed are in real time which is similar to
windows task manager. By default screen refreshes by every 2 seconds.
Below is snapshot of esxtop with the memory stats

Below are the single key commands in esxtop to switch to different stats while running in Interactive mode.
C CPU View (default screen when you typed esxtop)
Type c in the interactive mode to Switch to the CPU resource utilization screen of ESX server
m Memory view

Type m in the interactive mode to Switch to the memory resource utilization screen of ESX server
d Disk adapter view
Type d in the interactive mode to Switch to the storage disk adapter resource utilization screen of the ESX server
u Disk device view
Type u in the interactive mode to Switch to the storage Disk device resource utilization screen of the ESX server
v Virtual Disk View
Type v in the interactive mode to Switch to the virtual disk resource utilization screen of the ESX server
n Network View
Type n in the interactive mode to Switch to the network utilization screen of the ESX server
y Power Management
Type y in the interactive mode to switch to the power utilization screen of the ESX server
h Help screen for esxtop
Type h to display the help for esxtop commands

q quit the esxtop


Type q to quit from the esxtop interactive mode
f add or remove the fields in the respective view. For example, In the memory stats view, you can add or remove
some fields by pressing f and If you want MEM SIZE to display press G to add that field to display in stats and
press Enter to back to view the added filed in the interactive mode.

o to order the fields in the respective view. use a-o to change order.Uppercase moves a filed left, lowercase moves
a filed right.

s to set the refresh delay to refresh the screen. Default is 5 seconds. Press Space bar to refresh immediately.

W to save as the customized fields. Add or remove the fields as per your wish and if you want the customized
fields to load everytime just save as with the default name (Default to : /root/.esxtop4rc) or save as with name as per
your wish.

To load the esxtop with your customized fields


1. Type ESXTOP command
2. Add or remove the fields which you want, for example I just pressed m to show the memory details
3. Press W to save the file
4. When it prompted to save a config file to type the location other than the default one, for example typed
/home/mohammedk/mystats

5. Quit the current esxtop screen


6. To load ESXTOP with your saved custom configuration, type esxtop c <your configuration file name>, In our
case esxtop c /home/mohammedk/mystats

ESXTOP -Batch Mode


Batch mode Statistics can be collected and output can be saved in a file (csv) and also it can be viewed &
analyzed using windows perfmon & other tools in later time.

To run esxtop in batch mode and save the output file for feature analysis use the command as in in below syntax

esxtop -b -d 10 -n 5 >/home/mohammedk/esxtstats.csv

d Switch is used for the number of seconds between refreshes


n switch is the number of iterations to run the esxtop
In our above example, esxtop command will run for about 50 seconds. 10 seconds dealy* 5 iterations.
( >/home/mohammedk/esxstats.csv) redirecting the output of above esxtop stats into csvfile to store in the location
/home/mohammedk/esxstats.csv

Once the command completed, browse towards the location /home/mohammedk to see the esxtop output file
esxstats.csv. Transfer the csv file using winscp to your windows desktop and analyze using
windows perfmon or esxplot.
ESXTOP Replay Mode
Replay mode It is similar to record and replay operation. Data that was collected by the vm-support command is
interpreted and played back as esxtop statistics. We can view the captured performance information for a particular
duration or time period as like real time to view what was happened during that time.
This will be very useful for VMware support engineers who dont have access to your system to troubleshoot some
performance issues. They can run esxtop against the collected support file to analyze the performance issue occurred
during that particular time. Make sure you have enough free space on your server to save the support file.
Running esxtop for a longer duration will consume a huge amount of disk space.
To run the esxtop in replay mode, first run the vm-support command first. I am running from the directory
/home/mohammedk. So the output file will be saved in the same directory.

vm-support -s i 5 -d 10

i is the iteration and -d is the delay between the refresh. Above command will collect stats for 50 seconds ( 10
seconds * 5 iteration) = 50 seconds.Once vm-support completed, all the files are stored in the location
(/home/mohammedk)
We need to extract the file from esx-2012-06-2813.51.29993.tgz to the same directory using the below command
tar -xzf esx-2012-06-2813.51.29993.tgz

To run the esxtop in replay mode, run the below command with the extracted file vm-support-vmware-arena-201206-2813.51.29993 from the above command.
esxtop -r vm-support-vmware-arena-2012-06-2813.51.29993

The output will appear similar to esxtop command but here we are re playing the support file output
using esxtop replay mode.

ESXTOP is a fantastic tool available for the VMware administrator when troubleshooting performance issues in a
vSphere Environment. ESXTOP has a somewhat steep learning curve, but it is all worth it. In this post I want to help
you get a head start with ESXTOP. If you want a really good read I recommend Duncans very comprehensive post
on the same subject
ESXTOP is available in two ways. Either through the ESXi Shell or through the vSphere Management Assistant
with the command RESXTOP. In this article I will focus on ESXTOP from the ESXi shell. It is very simple to get
access to ESXTOP.
Step 1: Get access to the ESXi Shell. This is done by opening your vSphere Client, go to host, configuration,
security profile and start the ESXi Shell service on a specific ESXi host.
Step 2: Download putty (or another SSH client) and create a SSH connection on port 22 to your ESXi host. Login
with root and your password.
Step 3: Type the command esxtop and hit return
Step 4: You are now looking at ESXTOP it should look similar to this:

What you are looking at is the CPU screen in ESXTOP and you are now looking for CPU specific counters. You can
browse around through different pages. If you type

M you will see memory metrics. N for network etc. If you

type H you will see all available commands. By default ESXTOP shows a lot of worlds a world is similar to a
process in windows task manager. To sort it out and not show vmkernel worlds you type lower case v. By doing
this you only see the virtual machines running on this specific ESXi host.
Now you are inside ESXTOP so lets focus on some good counters to use for performance troubleshooting.

CPU
When troubleshooting CPU performance for your virtual machines the following counters are the most important.

%USED, %RDY, %CSTP


%USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
%RDY is a Key Performance Indicator! Always start with this one. This one defines how much time your virtual
machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did
you spend in a queue. I normally expect this value to be better than 5% (this equals 1000ms in the vCenter
Performance Graphs read about it
CPU Ready 1000 ms. equals 5%
JANUARY 31, 2011 BY

FRANKBRIX 8 COMMENTS

One of the key performance counters in a vSphere enviroment is:

CPU ready (%rdy in ESXTOP)

CPU ready is the time a virtual CPU is ready to run but is not being scheduled on a physical CPU. This would under
normal circumstances indicate that there is not enough physical CPU resources on an ESX/ESXi host. This is the
first go-to counter when your users complain about bad performance.
The CPU ready counter is accessible from the vSphere Client and from ESXTOP. I have made two screenshots
showing the a virtual machine and its ready time:

vCenter Performance Graphs (Value 1035 milliseconeds)

ESXTOP (value 5.38%)

What we see is a virtual machine with a ready time of 1035 ms. or 5.38%. These numbers are actually telling us the
same thing. When we are using the performance graphs the graph updates every 20 second (or 20,000 millisecond).
With a ready time of 1035 ms. we can change it to a percentage:

(1035 ms. x 100) / 20000 ms = 5,175%

To be able to interprept ready times it is essential to know the relationship between the percentage of ESXTOP and
ms. of the Performance Graphs. You are seeing the same numbers. One is in milliseconds the other is a percentage.

1% = 200 ms.
5% = 1,000 ms.
10% = 2,000 ms.
100% = 20,000 ms.
In general you want to see virtual machines with a ready time lower than 1000 ms. or 5%.
Read more about ESXTOP

here

Just heard of a cool calculator to convert cpu ready times to a percentage: http://www.vmcalc.com/

%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch
up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.
Memory
When troubleshooting memory performance this is the counters you want to focus on from a virtual machine
perspective.

MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s


MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver is
automatically installed with VMware tools and should be in every virtual machine. If it says No in this column then
figure out why.

MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates
to the balloon driver inside the guest operating system has stolen 500MB from Windows/Linux etc. You would
expect to see a value of 0 (zero) in this column

SWCUR tells you how much memory the virtual machine has in the .vswp file. If you see a number of 500MB
here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out
if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy
environment you would want this value to p 0 (zero)

SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual
machine is suffering from hypervisor swapping.

SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every
number above 0 is BAD.
ESXTOP
Intro
Thresholds
Howto Run
Howto Capture
Howto Analyze
Howto Limit esxtop to specific VMs
References
Changelog
This page is solely dedicated to one of the best tools in the world for ESX; esxtop.
Intro
I am a huge fan of esxtop! I read a couple of pages of the esxtop bible every day before I go to bed. Something I
however am always struggling with is the thresholds of specific metrics. I fully understand that it is not
black/white, performance is the perception of a user in the end.
There must be a certain threshold however. For instance it must be safe to say that when %RDY constantly exceeds
the value of 20 it is very likely that the VM responds sluggish. I want to use this article to define these thresholds,
but I need your help. There are many people reading these articles, together we must know at least a dozen metrics
lets collect and document them with possible causes if known.
Please keep in mind that these should only be used as a guideline when doing performance troubleshooting! Also be
aware that some metrics are not part of the default view. You can add fields to an esxtop view by clicking f on
followed by the corresponding character.

I used VMworld presentations, VMware whitepapers, VMware documentation, VMTN Topics and of course my
own experience as a source and these are the metrics and thresholds I came up with so far. Please comment and help
build the main source for esxtop thresholds.
Metrics and Thresholds
Display

Metric

Threshold

Explanation

CPU

%RDY

10

Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check

CPU

%CSTP

%MLMTD) has been set. See Jasons explanationfor vSMP VMs


Excessive usage of vSMP. Decrease amount of vCPUs for this

CPU

%SYS

20

particular VM. This should lead to increased scheduling opportunities.


The percentage of time spent by system services on behalf of the world.
Most likely caused by high IO VM. Check other metrics and VM for

CPU

%MLMTD

possible root cause


The percentage of time the vCPU was ready to run but deliberately
wasnt scheduled because that would violate the CPU limit settings.

CPU

%SWPWT

If larger than 0 the world is being throttled due to the limit on CPU.
VM waiting on swapped pages to be read from disk. Possible cause:

MEM

MCTLSZ

Memory overcommitment.
If larger than 0 host is forcing VMs to inflate balloon driver to reclaim

MEM

SWCUR

memory as host is overcommited.


If larger than 0 host has swapped memory pages in the past. Possible

MEM

SWR/s

cause: Overcommitment.
If larger than 0 host is actively reading from swap(vswp). Possible

MEM

SWW/s

cause: Excessive memory overcommitment.


If larger than 0 host is actively writing to swap(vswp). Possible cause:

MEM

CACHEUSD

Excessive memory overcommitment.


If larger than 0 host has compressed memory. Possible cause: Memory

MEM

ZIP/s

overcommitment.
If larger than 0 host is actively compressing memory. Possible cause:

MEM

UNZIP/s

Memory overcommitment.
If larger than 0 host has accessing compressed memory. Possible cause:

MEM

N%L

80

Previously host was overcommited on memory.


If less than 80 VM experiences poor NUMA locality. If a VM has a
memory size greater than the amount of memory local to each
processor, the ESX scheduler does not attempt to use NUMA

optimizations for that VM and remotely uses memory via


interconnect. Check GST_ND(X) to find out which NUMA nodes
NETWORK

%DRPTX

are used.
Dropped packets transmitted, hardware overworked. Possible cause:

NETWORK

%DRPRX

very high network utilization


Dropped packets received, hardware overworked. Possible cause: very

DISK

GAVG

25

high network utilization


Look at DAVG and KAVG as the sum of both is GAVG.

DISK

DAVG

25

Disk latency most likely to be caused by array.

DISK

KAVG

Disk latency caused by the VMkernel, high KAVG usually means

DISK

QUED

queuing. Check QUED.


Queue maxed out. Possibly queue depth set to low. Check with array

DISK

ABRTS/s

vendor for optimal queue depth value.


Aborts issued by guest(VM) because storage is not responding. For
Windows VMs this happens after 60 seconds by default. Can be caused
for instance when paths failed or array is not accepting any IO for

DISK

RESETS/s

whatever reason.
The number of commands reset per second.

DISK

CONS/s

20

SCSI Reservation Conflicts per second. If many SCSI Reservation


Conflicts occur performance could be degraded due to the lock on the
VMFS.

Running esxtop

Although understanding all the metrics esxtop provides seem to be impossible using esxtop is fairly simple. When
you get the hang of it you will notice yourself staring at the metrics/thresholds more often than ever. The following
keys are the ones I use the most.
Open console session or ssh to ESX(i) and type:
esxtop
By default the screen will be refreshed every 5 seconds, change this by typing:
s2
Changing views is easy type the following keys for the associated views:
c = cpu
m = memory
n = network
i = interrupts
d = disk adapter
u = disk device (includes NFS as of 4.0 Update 2)
v = disk VM
p = power states
V = only show virtual machine worlds
e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID)
k = kill world, for tech support purposes only!
l = limit display to a single group (GID), enables you to focus on one VM
# = limiting the number of entitites, for instance the top 5
2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
e = statistics broken down per world
6 = statistics broken down per world
Add/Remove fields:
f
<type appropriate character>
Changing the order:
o

<move field by typing appropriate character uppercase = left, lowercase = right>


Saving all the settings youve changed:
W
Keep in mind that when you dont change the file-name it will be saved and used as default settings.
Help:
?
In very large environments esxtop can high CPU utilization due to the amount of data that will need to be gathered
and calculations that will need to be done. If CPU appears to highly utilized due to the amount of entities (VMs /
LUNs etc) a command line option can be used which locks specific entities and keeps esxtop from gathering specific
info to limit the amount of CPU power needed:
esxtop -l
More info about this command line option can be found here.
Capturing esxtop results
First things first. Make sure you only capture relevant info. Ditch the metrics you dont need. In other words run
esxtop and remove/add(f) the fields you dont actually need or do need! When you are finished make sure to
write(W) the configuration to disk. You can either write it to the default config file(esxtop4rc) or write the
configuration to a new file.
Now that you have configured esxtop as needed run it in batch mode and save the results to a .csv file:
esxtop -b -d 2 -n 100 > esxtopcapture.csv
Where -b stands for batch mode, -d 2 is a delay of 2 seconds and -n 100 are 100 iterations. In this specific
case esxtop will log all metrics for 200 seconds. If you want to record all metrics make sure to add -a to your
string.
Or what about directly zipping the output as well? These .csv can grow fast and by zipping it a lot of precious
diskspace can be saved!
esxtop -b -a -d 2 -n 100 | gzip -9c > esxtopoutput.csv.gz
Please note that when a new VM is powered on, a VM is vMotion to the host or a new world is created it will not
show up within esxtop when -b is used as the entities are locked! This behavior is similar to starting esxtop with l.

Analyzing results
You can use multiple tools to analyze the captured data.
1.

VisualEsxtop

2.

perfmon

3.

excel

4.

esxplot

What is VisualEsxtop as it is a relatively new tool (published 1st of July 2013).


VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or
ESX hosts, and display ESX server stats with a better user interface and more advanced features.
That sounds nice right? Lets have a look how it works, this is what I did to get it up and running:

Go to http://labs.vmware.com/flings/visualesxtop and click download

Unzip VisualEsxtop.zip in to a folder you want to store the tool

Go to the folder

Double click visualesxtop.bat when running Windows (Or follow Williams tip for the Mac)

Click File and Connect to Live Server

Enter the Hostname, Username and Password and hit Connect

That is it

Now some simple tips:

By default the refresh interval is set to 5 seconds. You can change this by hitting Configuration and then
Change Interval

You can also load Batch Output, this might come in handy when you are a consultant for instance and a
customers sends you captured data, you can do this under: File -> Load Batch Output

You can filter output, very useful if you are looking for info on a specific virtual machine / world! See the
filter section.

When you click Charts and double click Object Types you will see a list of metrics that you can create
a chart with. Just unfold the ones you need and double click them to add them to the right pane

There are a bunch of other cool features in their like color-coding of important metrics for instance. Also the fact
that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide
a description of the counter! If you ask me, a tool everyone should download and check out.
Lets continue with my second favorite tool, perfmon. Ive used perfmon(part of Windows also know as
Performance Monitor) multiple times and its probably the easiest as many people are already familiar with it. You
can import a CSV as follows:
1.

Run: perfmon

2.

Right click on the graph and select Properties.

3.

Select the Source tab.

4.

Select the Log files: radio button from the Data source section.

5.

Click the Add button.

6.

Select the CSV file created by esxtop and click OK.

7.

Click the Apply button.

8.

Optionally: reduce the range of time over which the data will be displayed by using the sliders under the
Time Range button.

9.

Select the Data tab.

10. Remove all Counters.


11. Click Add and select appropriate counters.
12. Click OK.
13. Click OK.
The result of the above would be:

With MS Excel it is also possible to import the data as a CSV. Keep in mind though that the amount of captured data
is insane so you might want to limit it by first importing it into perfmon and then select the correct timeframe and
counters and export this to a CSV. When you have done so you can import the CSV as follows:

1.

Run: excel

2.

Click on Data

3.

Click Import External Data and click Import Data

4.

Select Text files as Files of Type

5.

Select file and click Open

6.

Make sure Delimited is selected and click Next

7.

Deselect Tab and select Comma

8.

Click Next and Finish

All data should be imported and can be shaped / modelled / diagrammed as needed.
Another option is to use a tool called esxplot. It hasnt been updated in a while, and I am not sure what the state of
the tool is. You can download the latest version here though, but personally I would recommend using VisualEsxtop
instead of esxplot, just because it is more recent.
1.

Run: esxplot

2.

Click File -> Import -> Dataset

3.

Select file and click Open

4.

Double click host name and click on metric

As you can clearly see in the screenshot above the legend(right of the graph) is too long. You can modify that as
follows:
1.

Click on File -> preferences

2.

Select Abbreviated legends

3.

Enter appropriate value

For those using a Mac, esxplot uses specific libraries which are only available on the 32Bit version of Python. In
order for esxplot to function correctly set the following environment variable:
export VERSIONER_PYTHON_PREFER_32_BIT=yes
Limiting your view
In environments with a very high consolidation ratio (high number of VMs per host) it could occur that the VM you
need to have performance counters for isnt shown on your screen. This happens purely due to the fact that height of
the screen is limited in what it can display. Unfortunately there is currently no command line option for esxtop to
specify specific VMs that need to be displayed. However you can export the current list of worlds and import it
again to limit the amount of VMs shown.
esxtop -export-entity filename
Now you should be able to edit your file and comment out specific worlds that are not needed to be displayed.
esxtop -import-entity filename
I figured that there should be a way to get the info through the command line as and this is what I came up with.
Please note that <virtualmachinename> needs to be replaced with the name of the virtual machine that you need the
GID for.

VMWID=`vm-support -x | grep <virtualmachinename> |awk '{gsub("wid=", "");print $1}'`


VMXCARTEL=`vsish -e cat /vm/$VMWID/vmxCartelID`
vsish -e cat /sched/memClients/$VMXCARTEL/SchedGroupID
Now you can use the outcome within esxtop to limit(l) your view to that single GID. William Lam has written
an article a couple of days after I added the GID section. The following is a lot simpler than what I came up with,
thanks William!
VM_NAME=STA202G ;grep "${VM_NAME}" /proc/vmware/sched/drm-stats | awk '{print $1}'

Identify CPU Memory Network Disk device or disk issues using ESXTOP , in interactive
batch or replay mode
Determine use cases for and apply esxtop Interactive, Batch and Replay modes
Use vscsiStats to gather storage performance data
Use esxtop/resxtopto collect performance data
witch display: c:cpu i:interrupt m:memory n:network

d:disk adapter u:disk device v:disk VM p:power mgmt


fF Add or remove fieldsoO Change the order of displayed fields
s Set the delay in seconds between updates
# Set the number of instances to display
W Write configuration file ~/.esxtop50rc
k Kill a world
e Expand/Rollup Cpu Statistics
V View only VM instances
L Change the length of the NAME field
l Limit display to a single group
usage: esxtop [-h] [-v] [-b] [-l] [-s] [-a] [-c config file] [-R vm-support-dir-path][-d delay] [-n iterations]
[-export-entity entity-file] [-import-entity entity-file]
-h prints this help menu.
-v prints version.
-b enables batch mode.
-l locks the esxtop objects to those available in the first snapshot.
-s enables secure mode.
-a show all statistics.
-c sets the esxtop configuration file, which by default is .esxtop50rc
-R enables replay mode.
-d sets the delay between updates in seconds.
-n runs esxtop for only n iterations.
Experimental Features
-export-entity writes the entity ids into a file, which can be modified to select interesting entities.
-import-entity reads the file of selected entities. If this option is used, esxtop only shows the data for the
selected entities.
2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
Type below command to display all fields not default ones:
~ # esxtop -a

Of course my screen even will not be enough to show all of them, but the Magic when you are here and
press h that will take you to the help screen , my concern here is not the help but how to order by the screen , for the
above one , I have the below filters:

CPU (%USED, %RDY, %CSTP)

Press h as mentioned so you can sort by:


Sort by:
U:%USED R:%RDY N:GID
When troubleshooting CPU performance for your virtual machines the following counters are the most important.
%USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
%RDY is a Key Performance Indicator! Always start with this one. This one defines how much time your virtual
machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did
you spend in a queue. I normally expect this value to be better than 5% (this equals 1000ms in the vCenter
Performance raphs read about it here)
%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch
up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.

Memory (MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s)

Sort by:
M:MEMSZ B:MCTLSZ N:GID
When troubleshooting memory performance this is the counters you want to focus on from a virtual machine
perspective.

MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver
is automatically installed with VMware tools and should be in every virtual machine. If it says No in this column
then figure out why.
MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates
to the balloon driver inside the guest operating system has stolen 500MB from Windows/Linux etc. You would
expect to see a value of 0 (zero) in this column
SWCUR tells you how much memory the virtual machine has in the .vswp file. If you see a number of 500MB
here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out
if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy
environment you would want this value to p 0 (zero)
SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual
machine is suffering from hypervisor swapping.
SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every
number above 0 is BAD.
Sequence of memory bottle neck

IF ESXi host has a memory pressure situation it starts with:


Page sharing then ballooning MCTLSZ then compression Cacheusd & ZIP/s then the
last swap SWR/s & SWW/s which is really so bad
Network (MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s)

Sort by:
T:MbTX/s R:MbRX/s
t:PKTTX/s r:PKTRX/s
N:Default
SPEED (Mbps) The link speed in Megabits per second. This information is only valid for a physical NIC.
FDUPLX Y implies the corresponding link is operating at full duplex. N implies it is not. This information is
only valid for a physical NIC.
UP Y implies the corresponding link is up. N implies it is not. This information is only valid for a physical
NIC.
PKTTX/s The number of packets transmitted per second.
PKTRX/s The number of packets received per second.
MbTX/s (Mbps) The MegaBits transmitted per second.
MbRX/s (Mbps) The MegaBits received per second.
Q: Why does MbRX/s not match PKTRX/s for different workloads?
A: This is because the packet size may not be the same. The average packet size can be computed as follows:
average_packet_size = MbRX/s / PKTRX/s . A large packet size may improve CPU efficiency of processing the
packet. However, it may potentially increase latency.

Storage (d:disk adapter u:disk device v:disk VM vscsiStats )


Disk Adapters:

KAVG/cmd
DAVG/cmd
GAVG/cmd
QAVG/cmd

Average ESXi VMkernel latency per command, in milliseconds


Average device latency per command, in milliseconds.
Average virtual machine operating system latency per command, in
milliseconds.
Average queue latency per command, in milliseconds.

Metric

Threshold

What to Check

DAVG/cmd
KAVG/cmd
GAVG/cmd

>20
>1
>20

Storage processor/array performance for bottleneck.


Kernel driver firmware and adapter queue length.
DAVG/KAVG metrics, and Guest OS performance.

Note that:

GAVG/cmd = KAVG/cmd + DAVG/cmd

DAVG/cmd is the adapter device Driver Average Latency per Command. This is the round-trip
in milliseconds from the HBA to the storage array and the return acknowledgement. Typically, most admins like to
see around 20ms or less, though it can vary significantly depending on your workload and its sensitivity to
latency.
DAVG/cmd is a good indicator that you need to start your investigation outside of ESX at the fabric and storage
array levels.

KAVG/cmd is the adapter device VMkernel Average Latency per Command. This is the
average latency between when the HBA receives the data from the storage fabric and passes it along to the Guest
OS, or vice versabasically the round trip time in the kernel itself. So, it should be a very low value, meaning that
the the I/O operation should spend as little time as possiblezero or near-zero is idealin the kernel.

GAVG/cmd is the adapter device Guest OS Average Latency per Command. This is the roundtrip in milliseconds from the Guest OS (its perspective) through the HBA to the storage array and back. This is why
this number is a sum of DAVG/cmd + KAVG/cmd. If DAVG & KAVG are within normal thresholds, but
GAVG/cmd is high, typically this indicates the VMs on that adapter or at least one of them is constrained by another
resource, and needs more ESXi resources in order to process IOs more quickly. In my experience, however, high
GAVG/cmd will typically be accompanied by another high value in either DAVG or KAVG.
If KAVG/cmd is greater than 1ms or so, check a couple of things.
1) Your device drivers are up-to-date and you are using compatible firmware versions, as this can slow down the
kernel IO path;
2) Your adapter optimization settings, which will be provided by the vendor (some of which we will discuss in the
next post).
Disk Device:

Metric

Threshol
d

What to Check

DQLEN

n/a

For reference; configured device queue length (prior to 5.0


LQLEN)

BLKSZ

n/a

For reference; configured device block size (for


alignment issues)

RESETS/s

>0

ABRTS/sQUED
RESV/s
CON/s

>0
>0-1
n/a

Check paths and device availabilityCheck storage


fabric/array for bottleneck
Check queue depth and storage fabric/array for bottleneck
Compare to CONS/s
If >RESV/s, check for reservation conflicts with other

ESXi hosts
>RESV/s

DQLEN is the configured Device Queue Length. This is really a reference point to make sure you have
configured your devices correctly. A quick glance, as in the screenshot above, and you might notice one queue
misconfigured.

BLKSZ is the configured Device Block Size. This is another reference point to ensure that you have the
correct block size for the type of workload you are running.

RESETS/s is the number of Device SCSI Reset Commands per Second. A SCSI reset command is
issued when the SCSI operation fails to reach the target, and in a SAN environment is usually indicative in a path
down or multipathing issuei.e., ESXi thinks a path is fine but in reality it is faulty. This is commonly seen on
Cisco Nexus fabrics as CRC errors on a port, for example.

ABRTS/s is the number of Device SCSI Abort Commands per Second. A SCSI abort command is
issued from the Guest OS when the command times out waiting for a response acknowledgement. In Windows 2008
and later, this is 60 seconds by default. Typically if you are encountering a large number of aborts, the storage
fabric/array is causing a bottleneck and is the place to begin your investigation.
If you are using something such as a NetApp FAS, be sure that you run the GOS Timeout Script on your VM or
VM template to make sure you have the proper timeout values (login required) set in order to prevent a SCSI
abort during a path failover or path problem.

QUED is the current Device Commands Queued in the VMkernel. As I explained previously,
this number should be at zero or near zero, otherwise it is indicating that something in the kernel is throttling the IO
throughput between the Guest OS and the HBA/storage fabric/array. Check firmware versions for correct revisions
and other performance tuning options within ESXi, especially vendor recommendations.

RESV/s is the Device SCSI Reservations per Second. SCSI reservations are commonplace; thats how
SCSI commands work. This value is only important as it relates to CONS/s.

CONS/s is the Device SCSI Reservation Conflicts per Second. If this value is greater than RESV/s,
then it is indicative that some other ESXi hosts are holding reservations on this particular path that are conflicting
with reservations currently held by this particular host. A very high value could be felt as a performance
sluggishness in the storage subsystem due to the kernel constantly requesting SCSI locks and being denied, and
consequently, retrying.
Troubleshooting SCSI reservation conflicts can be challenging. Some helpful information can be found in this
VMware KB deep-dive article on Troubleshooting SCSI Reservation Conflicts, as well as in VMware
KB 1005009 and VMware KB 1002293.
From <http://www.datacenterdan.com/blog/vsphere-55-bptroubleshooting06-esxtop-disk-devices>
Virtual Machine Disk

vscsiStat ,,
Please

review

You can output your results to csv file for other analysis :
vscsiStats -p all -c > /tmp/output.csv

Determine use cases for and apply esxtop/resxtop Interactive, Batch and Replay modes
Use cases:
Troubleshooting poor performance for specific VM , or identify issues with storage , network or Memory.

Interactive mode (the default mode): All statistics are displayed are in real time

Batch mode: Statistics can be collected so that the output can be saved in a file (csv) and can be viewed &
analyzed using windows perfmon & other tools in later time.
~ # esxtop -b -d 20 -n 2 -a > /tmp/20secsnds2intrpts.csv
This will run for 20 seconds for 2 iterations and output as csv

Replay mode: It is similar to record and replay operation. Data that was collected by the vm-support command
is interpreted and played back as esxtop statistics. We can view the captured performance information for a
particular duration or time period as like real time to view what was happening during that time. It is perfectly used
for the VMware support person to replay the stats to understand what was happening to the server during that time.
First let us see the vm-support switches:

So I run it with
intervals

p to collect the performance data and d during a period of 100 seconds , then over 2 seconds

/vmfs/volumes/4aaa440f-1a187eb4-6f5e-0000c985147e/LoGs # vm-support -p -d 100 -i 2 -w


/vmfs/volumes/4aaa440f-1a187eb4-6f5e-0000c985147e/LoGs

Then reconstruct the data:


/vmfs/volumes/4aaa440f-1a187eb4-6f5e-0000c985147e/LoGs # cd esx-esx01.com-2015-04-2104.56/
./reconstruct.sh

You might also like