ESXTOP
ESXTOP
ESXTOP is the utility only for ESX host to examine real-time resource usage for ESX and resxtop for ESX
& ESXi hosts. esxtop can only be used for the local ESX machine but resxtop can be used remotely to
view the resource utilization of ESX/ESXi hosts from other ESX/ESXi servers or VMA.
There are 3 different types of Modes in esxtop/resxtop
Interactive Mode
Batch Mode
Replay Mode.
Interactive mode (the default mode) All statistics are displayed are in real time
Batch mode Statistics can be collected so that the output can be saved in a file (csv) and can be
viewed & analyzed using windows perfmon & other tools in later time.
Replay mode It is similar to record and replay operation. Data that was collected by the vm-support
command is interpreted and played back as esxtop statistics. We can view the captured performance
information for a particular duration or time period as like real time to view what was happening during
that time. It is perfectly used for the VMware support person to replay the stats to understand what was
happening to the server during that time.
ESX Interactive mode (the default mode) All statistics are displayed are in real time which is similar to
windows task manager. By default screen refreshes by every 2 seconds.
Below is snapshot of esxtop with the memory stats
Below are the single key commands in esxtop to switch to different stats while running in Interactive mode.
C CPU View (default screen when you typed esxtop)
Type c in the interactive mode to Switch to the CPU resource utilization screen of ESX server
m Memory view
Type m in the interactive mode to Switch to the memory resource utilization screen of ESX server
d Disk adapter view
Type d in the interactive mode to Switch to the storage disk adapter resource utilization screen of the ESX server
u Disk device view
Type u in the interactive mode to Switch to the storage Disk device resource utilization screen of the ESX server
v Virtual Disk View
Type v in the interactive mode to Switch to the virtual disk resource utilization screen of the ESX server
n Network View
Type n in the interactive mode to Switch to the network utilization screen of the ESX server
y Power Management
Type y in the interactive mode to switch to the power utilization screen of the ESX server
h Help screen for esxtop
Type h to display the help for esxtop commands
o to order the fields in the respective view. use a-o to change order.Uppercase moves a filed left, lowercase moves
a filed right.
s to set the refresh delay to refresh the screen. Default is 5 seconds. Press Space bar to refresh immediately.
W to save as the customized fields. Add or remove the fields as per your wish and if you want the customized
fields to load everytime just save as with the default name (Default to : /root/.esxtop4rc) or save as with name as per
your wish.
To run esxtop in batch mode and save the output file for feature analysis use the command as in in below syntax
esxtop -b -d 10 -n 5 >/home/mohammedk/esxtstats.csv
Once the command completed, browse towards the location /home/mohammedk to see the esxtop output file
esxstats.csv. Transfer the csv file using winscp to your windows desktop and analyze using
windows perfmon or esxplot.
ESXTOP Replay Mode
Replay mode It is similar to record and replay operation. Data that was collected by the vm-support command is
interpreted and played back as esxtop statistics. We can view the captured performance information for a particular
duration or time period as like real time to view what was happened during that time.
This will be very useful for VMware support engineers who dont have access to your system to troubleshoot some
performance issues. They can run esxtop against the collected support file to analyze the performance issue occurred
during that particular time. Make sure you have enough free space on your server to save the support file.
Running esxtop for a longer duration will consume a huge amount of disk space.
To run the esxtop in replay mode, first run the vm-support command first. I am running from the directory
/home/mohammedk. So the output file will be saved in the same directory.
vm-support -s i 5 -d 10
i is the iteration and -d is the delay between the refresh. Above command will collect stats for 50 seconds ( 10
seconds * 5 iteration) = 50 seconds.Once vm-support completed, all the files are stored in the location
(/home/mohammedk)
We need to extract the file from esx-2012-06-2813.51.29993.tgz to the same directory using the below command
tar -xzf esx-2012-06-2813.51.29993.tgz
To run the esxtop in replay mode, run the below command with the extracted file vm-support-vmware-arena-201206-2813.51.29993 from the above command.
esxtop -r vm-support-vmware-arena-2012-06-2813.51.29993
The output will appear similar to esxtop command but here we are re playing the support file output
using esxtop replay mode.
ESXTOP is a fantastic tool available for the VMware administrator when troubleshooting performance issues in a
vSphere Environment. ESXTOP has a somewhat steep learning curve, but it is all worth it. In this post I want to help
you get a head start with ESXTOP. If you want a really good read I recommend Duncans very comprehensive post
on the same subject
ESXTOP is available in two ways. Either through the ESXi Shell or through the vSphere Management Assistant
with the command RESXTOP. In this article I will focus on ESXTOP from the ESXi shell. It is very simple to get
access to ESXTOP.
Step 1: Get access to the ESXi Shell. This is done by opening your vSphere Client, go to host, configuration,
security profile and start the ESXi Shell service on a specific ESXi host.
Step 2: Download putty (or another SSH client) and create a SSH connection on port 22 to your ESXi host. Login
with root and your password.
Step 3: Type the command esxtop and hit return
Step 4: You are now looking at ESXTOP it should look similar to this:
What you are looking at is the CPU screen in ESXTOP and you are now looking for CPU specific counters. You can
browse around through different pages. If you type
type H you will see all available commands. By default ESXTOP shows a lot of worlds a world is similar to a
process in windows task manager. To sort it out and not show vmkernel worlds you type lower case v. By doing
this you only see the virtual machines running on this specific ESXi host.
Now you are inside ESXTOP so lets focus on some good counters to use for performance troubleshooting.
CPU
When troubleshooting CPU performance for your virtual machines the following counters are the most important.
FRANKBRIX 8 COMMENTS
CPU ready is the time a virtual CPU is ready to run but is not being scheduled on a physical CPU. This would under
normal circumstances indicate that there is not enough physical CPU resources on an ESX/ESXi host. This is the
first go-to counter when your users complain about bad performance.
The CPU ready counter is accessible from the vSphere Client and from ESXTOP. I have made two screenshots
showing the a virtual machine and its ready time:
What we see is a virtual machine with a ready time of 1035 ms. or 5.38%. These numbers are actually telling us the
same thing. When we are using the performance graphs the graph updates every 20 second (or 20,000 millisecond).
With a ready time of 1035 ms. we can change it to a percentage:
To be able to interprept ready times it is essential to know the relationship between the percentage of ESXTOP and
ms. of the Performance Graphs. You are seeing the same numbers. One is in milliseconds the other is a percentage.
1% = 200 ms.
5% = 1,000 ms.
10% = 2,000 ms.
100% = 20,000 ms.
In general you want to see virtual machines with a ready time lower than 1000 ms. or 5%.
Read more about ESXTOP
here
Just heard of a cool calculator to convert cpu ready times to a percentage: http://www.vmcalc.com/
%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch
up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.
Memory
When troubleshooting memory performance this is the counters you want to focus on from a virtual machine
perspective.
MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates
to the balloon driver inside the guest operating system has stolen 500MB from Windows/Linux etc. You would
expect to see a value of 0 (zero) in this column
SWCUR tells you how much memory the virtual machine has in the .vswp file. If you see a number of 500MB
here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out
if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy
environment you would want this value to p 0 (zero)
SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual
machine is suffering from hypervisor swapping.
SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every
number above 0 is BAD.
ESXTOP
Intro
Thresholds
Howto Run
Howto Capture
Howto Analyze
Howto Limit esxtop to specific VMs
References
Changelog
This page is solely dedicated to one of the best tools in the world for ESX; esxtop.
Intro
I am a huge fan of esxtop! I read a couple of pages of the esxtop bible every day before I go to bed. Something I
however am always struggling with is the thresholds of specific metrics. I fully understand that it is not
black/white, performance is the perception of a user in the end.
There must be a certain threshold however. For instance it must be safe to say that when %RDY constantly exceeds
the value of 20 it is very likely that the VM responds sluggish. I want to use this article to define these thresholds,
but I need your help. There are many people reading these articles, together we must know at least a dozen metrics
lets collect and document them with possible causes if known.
Please keep in mind that these should only be used as a guideline when doing performance troubleshooting! Also be
aware that some metrics are not part of the default view. You can add fields to an esxtop view by clicking f on
followed by the corresponding character.
I used VMworld presentations, VMware whitepapers, VMware documentation, VMTN Topics and of course my
own experience as a source and these are the metrics and thresholds I came up with so far. Please comment and help
build the main source for esxtop thresholds.
Metrics and Thresholds
Display
Metric
Threshold
Explanation
CPU
%RDY
10
CPU
%CSTP
CPU
%SYS
20
CPU
%MLMTD
CPU
%SWPWT
If larger than 0 the world is being throttled due to the limit on CPU.
VM waiting on swapped pages to be read from disk. Possible cause:
MEM
MCTLSZ
Memory overcommitment.
If larger than 0 host is forcing VMs to inflate balloon driver to reclaim
MEM
SWCUR
MEM
SWR/s
cause: Overcommitment.
If larger than 0 host is actively reading from swap(vswp). Possible
MEM
SWW/s
MEM
CACHEUSD
MEM
ZIP/s
overcommitment.
If larger than 0 host is actively compressing memory. Possible cause:
MEM
UNZIP/s
Memory overcommitment.
If larger than 0 host has accessing compressed memory. Possible cause:
MEM
N%L
80
%DRPTX
are used.
Dropped packets transmitted, hardware overworked. Possible cause:
NETWORK
%DRPRX
DISK
GAVG
25
DISK
DAVG
25
DISK
KAVG
DISK
QUED
DISK
ABRTS/s
DISK
RESETS/s
whatever reason.
The number of commands reset per second.
DISK
CONS/s
20
Running esxtop
Although understanding all the metrics esxtop provides seem to be impossible using esxtop is fairly simple. When
you get the hang of it you will notice yourself staring at the metrics/thresholds more often than ever. The following
keys are the ones I use the most.
Open console session or ssh to ESX(i) and type:
esxtop
By default the screen will be refreshed every 5 seconds, change this by typing:
s2
Changing views is easy type the following keys for the associated views:
c = cpu
m = memory
n = network
i = interrupts
d = disk adapter
u = disk device (includes NFS as of 4.0 Update 2)
v = disk VM
p = power states
V = only show virtual machine worlds
e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID)
k = kill world, for tech support purposes only!
l = limit display to a single group (GID), enables you to focus on one VM
# = limiting the number of entitites, for instance the top 5
2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
e = statistics broken down per world
6 = statistics broken down per world
Add/Remove fields:
f
<type appropriate character>
Changing the order:
o
Analyzing results
You can use multiple tools to analyze the captured data.
1.
VisualEsxtop
2.
perfmon
3.
excel
4.
esxplot
Go to the folder
Double click visualesxtop.bat when running Windows (Or follow Williams tip for the Mac)
That is it
By default the refresh interval is set to 5 seconds. You can change this by hitting Configuration and then
Change Interval
You can also load Batch Output, this might come in handy when you are a consultant for instance and a
customers sends you captured data, you can do this under: File -> Load Batch Output
You can filter output, very useful if you are looking for info on a specific virtual machine / world! See the
filter section.
When you click Charts and double click Object Types you will see a list of metrics that you can create
a chart with. Just unfold the ones you need and double click them to add them to the right pane
There are a bunch of other cool features in their like color-coding of important metrics for instance. Also the fact
that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide
a description of the counter! If you ask me, a tool everyone should download and check out.
Lets continue with my second favorite tool, perfmon. Ive used perfmon(part of Windows also know as
Performance Monitor) multiple times and its probably the easiest as many people are already familiar with it. You
can import a CSV as follows:
1.
Run: perfmon
2.
3.
4.
Select the Log files: radio button from the Data source section.
5.
6.
7.
8.
Optionally: reduce the range of time over which the data will be displayed by using the sliders under the
Time Range button.
9.
With MS Excel it is also possible to import the data as a CSV. Keep in mind though that the amount of captured data
is insane so you might want to limit it by first importing it into perfmon and then select the correct timeframe and
counters and export this to a CSV. When you have done so you can import the CSV as follows:
1.
Run: excel
2.
Click on Data
3.
4.
5.
6.
7.
8.
All data should be imported and can be shaped / modelled / diagrammed as needed.
Another option is to use a tool called esxplot. It hasnt been updated in a while, and I am not sure what the state of
the tool is. You can download the latest version here though, but personally I would recommend using VisualEsxtop
instead of esxplot, just because it is more recent.
1.
Run: esxplot
2.
3.
4.
As you can clearly see in the screenshot above the legend(right of the graph) is too long. You can modify that as
follows:
1.
2.
3.
For those using a Mac, esxplot uses specific libraries which are only available on the 32Bit version of Python. In
order for esxplot to function correctly set the following environment variable:
export VERSIONER_PYTHON_PREFER_32_BIT=yes
Limiting your view
In environments with a very high consolidation ratio (high number of VMs per host) it could occur that the VM you
need to have performance counters for isnt shown on your screen. This happens purely due to the fact that height of
the screen is limited in what it can display. Unfortunately there is currently no command line option for esxtop to
specify specific VMs that need to be displayed. However you can export the current list of worlds and import it
again to limit the amount of VMs shown.
esxtop -export-entity filename
Now you should be able to edit your file and comment out specific worlds that are not needed to be displayed.
esxtop -import-entity filename
I figured that there should be a way to get the info through the command line as and this is what I came up with.
Please note that <virtualmachinename> needs to be replaced with the name of the virtual machine that you need the
GID for.
Identify CPU Memory Network Disk device or disk issues using ESXTOP , in interactive
batch or replay mode
Determine use cases for and apply esxtop Interactive, Batch and Replay modes
Use vscsiStats to gather storage performance data
Use esxtop/resxtopto collect performance data
witch display: c:cpu i:interrupt m:memory n:network
Of course my screen even will not be enough to show all of them, but the Magic when you are here and
press h that will take you to the help screen , my concern here is not the help but how to order by the screen , for the
above one , I have the below filters:
Sort by:
M:MEMSZ B:MCTLSZ N:GID
When troubleshooting memory performance this is the counters you want to focus on from a virtual machine
perspective.
MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver
is automatically installed with VMware tools and should be in every virtual machine. If it says No in this column
then figure out why.
MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates
to the balloon driver inside the guest operating system has stolen 500MB from Windows/Linux etc. You would
expect to see a value of 0 (zero) in this column
SWCUR tells you how much memory the virtual machine has in the .vswp file. If you see a number of 500MB
here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out
if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy
environment you would want this value to p 0 (zero)
SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual
machine is suffering from hypervisor swapping.
SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every
number above 0 is BAD.
Sequence of memory bottle neck
Sort by:
T:MbTX/s R:MbRX/s
t:PKTTX/s r:PKTRX/s
N:Default
SPEED (Mbps) The link speed in Megabits per second. This information is only valid for a physical NIC.
FDUPLX Y implies the corresponding link is operating at full duplex. N implies it is not. This information is
only valid for a physical NIC.
UP Y implies the corresponding link is up. N implies it is not. This information is only valid for a physical
NIC.
PKTTX/s The number of packets transmitted per second.
PKTRX/s The number of packets received per second.
MbTX/s (Mbps) The MegaBits transmitted per second.
MbRX/s (Mbps) The MegaBits received per second.
Q: Why does MbRX/s not match PKTRX/s for different workloads?
A: This is because the packet size may not be the same. The average packet size can be computed as follows:
average_packet_size = MbRX/s / PKTRX/s . A large packet size may improve CPU efficiency of processing the
packet. However, it may potentially increase latency.
KAVG/cmd
DAVG/cmd
GAVG/cmd
QAVG/cmd
Metric
Threshold
What to Check
DAVG/cmd
KAVG/cmd
GAVG/cmd
>20
>1
>20
Note that:
DAVG/cmd is the adapter device Driver Average Latency per Command. This is the round-trip
in milliseconds from the HBA to the storage array and the return acknowledgement. Typically, most admins like to
see around 20ms or less, though it can vary significantly depending on your workload and its sensitivity to
latency.
DAVG/cmd is a good indicator that you need to start your investigation outside of ESX at the fabric and storage
array levels.
KAVG/cmd is the adapter device VMkernel Average Latency per Command. This is the
average latency between when the HBA receives the data from the storage fabric and passes it along to the Guest
OS, or vice versabasically the round trip time in the kernel itself. So, it should be a very low value, meaning that
the the I/O operation should spend as little time as possiblezero or near-zero is idealin the kernel.
GAVG/cmd is the adapter device Guest OS Average Latency per Command. This is the roundtrip in milliseconds from the Guest OS (its perspective) through the HBA to the storage array and back. This is why
this number is a sum of DAVG/cmd + KAVG/cmd. If DAVG & KAVG are within normal thresholds, but
GAVG/cmd is high, typically this indicates the VMs on that adapter or at least one of them is constrained by another
resource, and needs more ESXi resources in order to process IOs more quickly. In my experience, however, high
GAVG/cmd will typically be accompanied by another high value in either DAVG or KAVG.
If KAVG/cmd is greater than 1ms or so, check a couple of things.
1) Your device drivers are up-to-date and you are using compatible firmware versions, as this can slow down the
kernel IO path;
2) Your adapter optimization settings, which will be provided by the vendor (some of which we will discuss in the
next post).
Disk Device:
Metric
Threshol
d
What to Check
DQLEN
n/a
BLKSZ
n/a
RESETS/s
>0
ABRTS/sQUED
RESV/s
CON/s
>0
>0-1
n/a
ESXi hosts
>RESV/s
DQLEN is the configured Device Queue Length. This is really a reference point to make sure you have
configured your devices correctly. A quick glance, as in the screenshot above, and you might notice one queue
misconfigured.
BLKSZ is the configured Device Block Size. This is another reference point to ensure that you have the
correct block size for the type of workload you are running.
RESETS/s is the number of Device SCSI Reset Commands per Second. A SCSI reset command is
issued when the SCSI operation fails to reach the target, and in a SAN environment is usually indicative in a path
down or multipathing issuei.e., ESXi thinks a path is fine but in reality it is faulty. This is commonly seen on
Cisco Nexus fabrics as CRC errors on a port, for example.
ABRTS/s is the number of Device SCSI Abort Commands per Second. A SCSI abort command is
issued from the Guest OS when the command times out waiting for a response acknowledgement. In Windows 2008
and later, this is 60 seconds by default. Typically if you are encountering a large number of aborts, the storage
fabric/array is causing a bottleneck and is the place to begin your investigation.
If you are using something such as a NetApp FAS, be sure that you run the GOS Timeout Script on your VM or
VM template to make sure you have the proper timeout values (login required) set in order to prevent a SCSI
abort during a path failover or path problem.
QUED is the current Device Commands Queued in the VMkernel. As I explained previously,
this number should be at zero or near zero, otherwise it is indicating that something in the kernel is throttling the IO
throughput between the Guest OS and the HBA/storage fabric/array. Check firmware versions for correct revisions
and other performance tuning options within ESXi, especially vendor recommendations.
RESV/s is the Device SCSI Reservations per Second. SCSI reservations are commonplace; thats how
SCSI commands work. This value is only important as it relates to CONS/s.
CONS/s is the Device SCSI Reservation Conflicts per Second. If this value is greater than RESV/s,
then it is indicative that some other ESXi hosts are holding reservations on this particular path that are conflicting
with reservations currently held by this particular host. A very high value could be felt as a performance
sluggishness in the storage subsystem due to the kernel constantly requesting SCSI locks and being denied, and
consequently, retrying.
Troubleshooting SCSI reservation conflicts can be challenging. Some helpful information can be found in this
VMware KB deep-dive article on Troubleshooting SCSI Reservation Conflicts, as well as in VMware
KB 1005009 and VMware KB 1002293.
From <http://www.datacenterdan.com/blog/vsphere-55-bptroubleshooting06-esxtop-disk-devices>
Virtual Machine Disk
vscsiStat ,,
Please
review
You can output your results to csv file for other analysis :
vscsiStats -p all -c > /tmp/output.csv
Determine use cases for and apply esxtop/resxtop Interactive, Batch and Replay modes
Use cases:
Troubleshooting poor performance for specific VM , or identify issues with storage , network or Memory.
Interactive mode (the default mode): All statistics are displayed are in real time
Batch mode: Statistics can be collected so that the output can be saved in a file (csv) and can be viewed &
analyzed using windows perfmon & other tools in later time.
~ # esxtop -b -d 20 -n 2 -a > /tmp/20secsnds2intrpts.csv
This will run for 20 seconds for 2 iterations and output as csv
Replay mode: It is similar to record and replay operation. Data that was collected by the vm-support command
is interpreted and played back as esxtop statistics. We can view the captured performance information for a
particular duration or time period as like real time to view what was happening during that time. It is perfectly used
for the VMware support person to replay the stats to understand what was happening to the server during that time.
First let us see the vm-support switches:
So I run it with
intervals
p to collect the performance data and d during a period of 100 seconds , then over 2 seconds