Performance Counters Thresholds For Windows Server
Performance Counters Thresholds For Windows Server
Windows Server
When you need to measure how many system resources your application consumes, you need to pay
particular attention to the following:
o Disk I/O. Amount of read and write disk activity. I/O bottlenecks occur if read and write operations
begin to queue.
o Memory. Amount of available memory, virtual memory, and cache utilization.
o Network. Percent of the available bandwidth being utilized, network bottlenecks.
o Processor. Processor utilization, context switches, interrupts and so on.
The next sections describe the performance counters that help you measure the preceding metrics. System
Overview (General operating system performance analysis. Use this for a general analysis of the operating
system performance counters) Formatting:
o Counter (Explanation)
o Thresholds
o Disk
o \LogicalDisk(*)\Avg. Disk sec/Read (Avg. Disk sec/Read is the average time, in seconds, of a
read of data to the disk. This analysis determines if any of the physical disks are responding
slowly)
o Average disk responsiveness is slow – more than 15ms
o Average disk responsiveness is very slow – more than 25ms
o Disk responsiveness is very slow (spike of more than 25ms)
1
o \LogicalDisk(*)\Disk Transfers/sec (Disk Transfers/sec is the rate of read and write
operations on the disk)
o Less than 80 I/O’s per second on average when disk latency is longer than 25ms.
This may indicate too many virtual LUNs using the same physical disks on a SAN.
o Less than 80 I/O’s per second on average when disk latency is longer than 25ms.
This may indicate too many virtual LUNs using the same physical disks on a SAN.
This was a spike – not an average.
o \PhysicalDisk(*)\Avg. Disk sec/Read (Avg. Disk sec/Read is the average time, in seconds, of
a read of data to the disk. This analysis determines if any of the physical disks are
responding slowly)
o Average disk responsiveness is slow – more than 15ms
o Average disk responsiveness is very slow – more than 25ms
o Disk responsiveness is very slow (spike of more than 25ms)
o \PhysicalDisk(*)\Avg. Disk sec/Write (Avg. Disk sec/Write is the average time, in seconds, of
a write of data to the disk. This analysis determines if any of the physical disks are
responding slowly)
o Average disk responsiveness is slow – more than 15ms
o Average disk responsiveness is very slow – more than 25ms
o Disk responsiveness is very slow (spike of more than 25ms
2
o Memory
Kernel Mode Memory
o \Memory\Free System Page Table Entries (Free System Page Table Entries is the number of
page table entries not currently in used by the system. This analysis determines if the
system is running out of free system page table entries (PTEs) by checking if there is less
than 5,000 free PTE’s with a Warning if there is less than 10,000 free PTE’s. Lack of enough
PTEs can result in system wide hangs)
o Running low on PTE’s – less than 10,000 (If the free PTEs are under 10,000 the
system is close to a system wide hang)
o Critically low on PTE’s – less than 5000 (If the free PTEs are under 5000 the system
is close to a system wide hang)
o \Memory\Pages Input/sec (Pages Input/sec is the rate at which pages are read from disk to
resolve hard page faults. Hard page faults occur when a process refers to a page in virtual
memory that is not in its working set or elsewhere in physical memory, and must be
retrieved from disk. When a page is faulted, the system tries to read multiple contiguous
pages into memory to maximize the benefit of the read operation. Compare the value of
Memory\\Pages Input/sec to the value of Memory\\Page Reads/sec to determine the
average number of pages read into memory during each read operation)
o More then 10 page file reads per second
3
o \Memory\Pages/sec (If it is high, then the system is likely running out of memory by trying
to page the memory to the disk. Pages/sec is the rate at which pages are read from or
written to disk to resolve hard page faults. This counter is a primary indicator of the kinds
of faults that cause system-wide delays. It is the sum of Memory\Pages Input/sec and
Memory\Pages Output/sec. It is counted in numbers of pages, so it can be compared to
other counts of pages, such as Memory\Page Faults/sec, without conversion. It includes
pages retrieved to satisfy faults in the file system cache (usually requested by applications)
non-cached mapped memory files)
o High pages/sec – greater than 1000 (If it’s higher than 1000, the system is could be
beginning to run out of memory. Consider reviewing the processes to see which
processes are taking up the most memory or consider adding more memory)
o Very high average pages/sec – greater than 2500 (If this is greater than 2500, the
system could be experiencing system-wide delays due to insufficient memory.
Consider reviewing the processes to see which processes are taking up the most
memory or consider adding more memory)
o Critically high average pages/sec – greater than 5000 (If this is greater than 5000. If
so, the system is most likely experiencing delays due to insufficient memory.
Consider reviewing the processes to see which processes are taking up the most
memory or consider adding more memory)
o Spike in pages/sec – greater than 1000 (If this is greater than 5000. If so, the system
is most likely experiencing delays due to insufficient memory. Consider reviewing
the processes to see which processes are taking up the most memory or consider
adding more memory)
o \Memory\Pool Paged Bytes (if the system is becoming close to the maximum Pool paged
memory size. Pool Paged Bytes is the size, in bytes, of the paged pool, an area of system
memory (physical memory used by the operating system) for objects that can be written to
disk when they are not being used)
o Low on Pool Paged memory – less than 40% available
o Critically low on Pool Paged memory – less than 20% available
4
User Mode Memory
o \Process(*)\Private Bytes (Private Bytes is the current size, in bytes, of memory that this
process has allocated that cannot be shared with other processes)
o For Windows 32 Bit: 250MB delta between Minimum Size and Maximum Size
(Maximum – Minimum = !>(not greater than) 250MB)
o For Windows 64 Bit: 500MB delta between Minimum Size and Maximum Size
(Maximum – Minimum = !> (not greater than) 500MB)
o
o \Process(*)\Working Set (Working Set is the current size, in bytes, of the Working Set of this
process. The Working Set is the set of memory pages touched recently by the threads in the
process. If free memory in the computer is above a threshold, pages are left in the Working
Set of a process even if they are not in use. When free memory falls below a threshold,
pages are trimmed from Working Sets. If they are needed they will then be soft-faulted back
into the Working Set before leaving main memory)
o For Windows 32 Bit: 250MB delta between Minimum Size and Maximum Size
(Maximum – Minimum = !>(not greater than) 250MB)
o For Windows 64 Bit: 500MB delta between Minimum Size and Maximum Size
(Maximum – Minimum = !> (not greater than) 500MB)
o
o
o
o \Process(*)Thread Count (The number of threads currently active in this
process. An instruction is the basic unit of execution in a processor, and a
thread is the object that executes instructions. Every running process has at
least one thread.)
o For Windows 32 Bit: For 2GB maximum 2048 threads
o For Windows 64 Bit: For 2GB memory maximum 6600
threads
5
\Process(*)\Handle Count (How many handles each process has open and
determines if a handle leaks is suspected. A process with a large number of
handles and/or an aggresive upward trend could indicate a handle leak which
typically results in a memory leak. The total number of handles currently
open by this process. This number is equal to the sum of the handles
currently open by each thread in this process)
o
o For Windows 32 Bit: For most processes, if higher than
2,500 handles open, investigate.
Exceptions are:
System 10,000
lsass.exe 30,000
store.exe 30,000
sqlsrvr.exe 30,000
o For Windows 64 Bit: For most processes, if higher than
3,000 handles open, investigate.
Exceptions are:
System 20,000
lsass.exe 50,000
store.exe 50,000
sqlsrvr.exe 50,000
6
Network
\Network Interface(*)\Output Queue Length
o High Network I/O – more than 1 thread waiting on network I/O (If the output queue length
is greater than 1. If so, this system’s network is nearing capacity. Consider analyzing
network traffic to determine why network I/O is nearing capacity such as *chatty* network
services and/or large data transfers)
o Very high network I/O – more than 2 threads waiting on network I/O (if the output queue
length is greater than 2. If so, this system’s network is over capacity. Consider analyzing
network traffic to determine why network I/O is nearing capacity such as *chatty* network
services and/or large data transfers)
o Network Utilization Analysis (Bytes Total/sec is the rate at which bytes are sent and received over
each network adapter, including framing characters. Network Interface\Bytes Received/sec is a sum
of Network Interface\Bytes Received/sec and Network Interface\Bytes Sent/sec. This counter
indicates the rate at which bytes are sent and received over each network adapter. This counter
helps you know whether the traffic at your network adapter is saturated and if you need to add
another network adapter. How quickly you can identify a problem depends on the type of network
you have as well as whether you share bandwidth with other applications.
o \Network Interface(*)\Bytes Total/sec
o \Network Interface(*)\Current Bandwidth
o Thresholds:
o High average network utilization – more than 50%
o Very high average network utilization – more than 80%
o Server\Bytes Total/sec (This counter indicates the number of bytes sent and received over the
network. Higher values indicate network bandwidth as the bottleneck. If the sum of Bytes
Total/sec for all servers is roughly equal to the maximum transfer rates of your network, you may
need to segment the network)
o Not be more than 50 percent of network capacity.
7
o Processor:
o Processor\% Processor Time (This counter is the primary indicator of processor activity.
High values many not necessarily be bad. However, if the other processor-related
counters are increasing linearly such as % Privileged Time or Processor Queue Length,
high CPU utilization may be worth investigating)
o Less than 60% consumed = Healthy
o 51% – 90% consumed = Monitor or Caution
o 91% – 100% consumed = Critical or Out of Spec
8
o System\Processor Queue Length (If there are more tasks ready to run than there are
processors, threads queue up. The processor queue is the collection of threads that are
ready but not able to be executed by the processor because another active thread is
currently executing. A sustained or recurring queue of more than two threads is a clear
indication of a processor bottleneck. You may get more throughput by reducing
parallelism in those cases. You can use this counter in conjunction with the Processor\%
Processor Time counter to determine if your application can benefit from more CPUs.
There is a single queue for processor time, even on multiprocessor computers.
Therefore, in a multiprocessor computer, divide the Processor Queue Length (PQL) value
by the number of processors servicing the workload. If the CPU is very busy (90 percent
and higher utilization) and the PQL average is consistently higher than 2 per processor,
you may have a processor bottleneck that could benefit from additional CPUs. Or, you
could reduce the number of threads and queue more at the application level. This will
cause less context switching, and less context switching is good for reducing CPU load.
The common reason for a PQL of 2 or higher with low CPU utilization is that requests for
processor time arrive randomly and threads demand irregular amounts of time from the
processor. This means that the processor is not a bottleneck but that it is your
threading logic that needs to be improved)
o Each processor has 10 or more threads waiting.(Determines if the average processor
queue length exceeds the number of processors by 10. If this threshold is broken, then
the processor(s) may be at capacity)
Each processor has 20 or more threads waiting(Determines if the average processor queue length
exceeds twenty times the number of processors. If this threshold is broken, then the processor(s)
are beyond capacity)
9
Common Performance Monitor counter thresholds
Question
What are some commonly used Performance Monitor (perfmon) counter thresholds?
Answer
This article lists some common perfmon counters with descriptions and thresholds. The threshold
values listed here are meant for use as a general 'rule-of-thumb' and each should be interpreted in
context of the specific performance issue currently at hand.
1. The threshold values provided below are averages, not min/max and are only useful when
looked at within a meaningful time-period as the following points help to clarify.
2. What is the time range captured in the perfmon data set?
3. Extreme highs and lows within the time-range of the capture can result in less-useful averages.
In these cases, try to narrow the time range and then look at counter values around the time(s)
when the poor performance was observed.
4. In general, to be considered a genuine bottleneck, a given counter's threshold must be
exceeded either on average, or on a frequent basis, during the time period being analyzed.
5. This is not meant to be a comprehensive nor definitive reference.
10
How to Measure Storage Performance and IOPS on Windows?
One of the main metric, which allows to estimate the performance of the existing or designed storage
system is IOPS (Input/Output Operations Per Second). In simple terms, IOPS is the number of
read/write operations with a storage, disk or a file system per a time unit. The larger is this number,
the greater the performance of your storage (frankly speaking, the IOPS value has to be considered
along with other storage performance characteristics, like latency, throughput, etc.).
In this article, we will look at several ways to measure the storage performance (IOPS, latency,
throughput) in Windows (you can use this manual for a local hard drive, SSD, SMB network folder,
CSV volume or LUN on SAN/iSCSI storage).
Contents:
You can roughly estimate the current storage I/O workload in Windows using the built-in disk
performance counters from Performance Monitor. To collect these counters data:
11
3. Select the checkbox Create data logs -> Performance counter;
4. Now in the properties of the new data collection set, add the following performance counters for the Physical
Disk object (you can select the counters for a specific disk or for all available local disks):
5. You can change other data collection properties. By default, counter values are collected every 15 seconds.
To display real time disk performance, you need to add the specified Perfmon counters in the Monitoring
Tools -> Performance Monitor section.
12
6. It remains to start collecting performance counters data (select Start) and wait for the collection of sufficient
information for analysis. After that, right click your data collector set and select Stop;
7. To view the collected performance data go to the Perfmon -> Reports -> User Defined -> Data_Disk_IO
—> check_the_last_set. By default, disk data is displayed as graphs;
How to understand storage performance counters collected by Perfmon? For a quick analysis of the disk/storage
performance, you need to look at the values of at least the following 5 counters.
13
When analyzing the counter data, it is advisable for you to understand the current physical disks
(storage) configuration (whether RAID or Stripe is used, the number and types of disks, cache size,
etc.).
Disk sec/Transfer – the time required to perform one write/read operation with the storage device
or disk (disk latency). If the delay is more than 25 ms (0.25), then the disk array cannot handle the
I/O operation on time. For high load servers, the disk latency value should not exceed 10 ms (0.1);
Disk Transfers/sec – (IOPS). The number of read/write operations per second. This is the main
indicator of the disk access intensity (approximate IOPS values for different disk types are listed at
the end of the article);
Disk Bytes/Sec – Total disk throughput (read+write) per second. Maximum values depend on the
disk type (150-250 Mb/s – for a regular HDD disk and 500-10000 for SSD);
Split IO/sec – a disk fragmentation indicator when the operating system has to split one I/O
operation into multiple disk requests. It may also indicate that the application is requesting too large
blocks of data that cannot be transferred in one operation;
Avg. Disk Queue Length – average number of read/write requests that were queued. For a single
disk, the queue length should not exceed 2. For a RAID array of 4 disks, the threshold value of disk
queue length is 8.
Microsoft recommends to use the DiskSpd (https://aka.ms/diskspd) utility for generating a load on a
disk (storage) system and measuring its performance. This is a command line interface tool that can
perform I/O operations with the specified drive target in several threads. I quite often use DiskSpd to
measure the storage performance and get the maximum available read/write speed and IOPS from the
specific server (of course you can measure the performance of storage as well, in this case diskspd will
be used to generate the storage load).
The DiskSpd does not require installation, just download and extract the archive to a local disk. For
x64 bit systems, use the version of diskspd.exe from the amd64fre directory.
I use the following command to test the performance of the disk:
diskspd.exe –c50G -d300 -r -w40 -t8 -o32 -b64K -Sh -L E:\diskpsdtmp.dat > DiskSpeedResults.txt
Important. When using diskspd.exe, quite a considerable load is generated on the disks and CPU of
the tested system. To eliminate the performance degradation for users, it is not recommended to start
it on productive systems in peak hours.
14
-c50G – file size 50 GB (it is better to use a large file size so that it does not fit in the cache of the storage
controller);
-d300 – test duration in seconds;
-r – random read/write operations (if you need to test sequential access, use –s);
-t8 – number of threads;
-w40 – ratio of write to read operations 40%/60%;
-o32 — queue length;
-b64K — block size;
-Sh — do not use cache;
-L — measure latency;
E:\diskpsdtmp.dat – test file path.
After the stress test is completed, average storage performance values can be obtained from the output tables.
In my test, the following performance data (check the Total IO table) was obtained:
You can get individual values for read (section Read IO) or write (section Write IO) operations.
Having tested several disks or storage LUNs using diskspd, you can compare them or select an array
with the desired performance for your tasks.
15
How to Measure Storage IOPS, Throughput and Latency Using
PowerShell?
I have found a PowerShell script (by Mikael Nystrom, Microsoft MVP), which is essentially an add-on
to SQLIO.exe utility (a set of file storage performance tests).
Note. In December, 2015, Microsoft announced the end of support for this tool and replacement of
SQLIO with a more universal tool Diskspd, and removed SQLIO distribution files from its website. So
you will have to search for sqlio.exe by yourself, or download it from our website (it is located in the
archive with the PowerShell script).
16
Let’s consider the script arguments:
In our case (a vmdk virtual disk on the VMFS datastore on HP MSA 2040 connected over SAN is used)
the disk array showed the average IOPS value of about 15,000 and the data transmission rate
(throughput) about 5 Gbit/s.
17
In the following table, the approximate IOPS values for different disk types are shown:
Type IOPS
SSD(SLC) 6000
SSD(MLC) 1000
I have found some recommendations for disk performance in IOPS for some popular Microsoft
services:
1. Microsoft Exchange 2010 with 5,000 users, each of them receives 75 and sends 30 emails per day, will
require at least 3,750 IOPS;
2. Microsoft SQL 2008 Server with 3,500 SQL transactions per second (TPS) requires 28,000 IOPS;
3. A common Windows application server for 10-100 users requires 10-40 IOPS.
18
Diagnosing Disk Performance Issues
Disk performance issues can be hard to track down but can also cause a wide variety of issues. The
disk performance counter available in Windows are numerous, and being able to select the right
counters for a given situation is a great troubleshooting skill. Here, we'll review two basic scenarios -
measuring overall disk performance and determining if the disks are a bottleneck.
When it comes to disk performance, there are two important considerations: IOPS and byte
throughput. IOPS is the raw number of disk operations that are performed per second. Byte
throughput is the effective bandwidth the disk is achieving, usually expressed in MB/s. These numbers
are closely related - a disk with more IOPS can provide better throughput.
Disk Transfers/sec
o Total number of IOPS. This should be about equal to Disk Reads/sec + Disk Writes/sec
Disk Reads/sec
o Disk read operations per second (IOPS which are read operations)
Disk Writes/sec
o Disk write operations per second (IOPS which are write operations)
Disk Bytes/sec
o Total disk throughput per second. This should be about equal to Disk Read Bytes/sec +
Disk Write Bytes/sec
Disk Read Bytes/sec
o Disk read throughput per second
Disk Write Bytes/sec
o Disk write throughput per second
These performance counters are available in both the LogicalDisk and PhysicalDisk categories. In a
standard setup, with a 1:1 disk-partition mapping, these would provide the same results. However, if
you have a more advanced setup with storage pools, spanned disks, or multiple partitions on a single
disk, you would need to choose the correct category for the part of the stack you are measuring.
19
Here are the results on a test VM. In this test, diskspd was used to simulate an average mixed
read/write workload. The results show the following:
3,610 IOPS
o 2,872 read IOPS
o 737 write IOPS
17.1 MB/s total throughput
o 11.2 MB/s read throughput
o 5.9 MB/s write throughput
In this case, we're seeing a decent number of IOPS with fairly low throughput. The expected results
vary greatly depending on the underlying storage and the type of workload that is running. In any
case, you can use these counters to get an idea of how a disk is performing during real world usage.
Disk Bottlenecks
Determining if storage is a performance bottleneck relies on a different set of counters than the
above. Instead of looking at IOPS and throughput, latency and queue lengths needs to be
checked. Latency is the amount of time it takes to get a piece of requested data back from the disk
and is measured in milliseconds (ms). Queue length refers to the number of outstanding IO requests
that are in the queue to be sent to the disk. This is measured as an absolute number of requests.
20
The specific perfmon counters are:
21
Here are the results on a test VM. In this test, diskspd was used to simulate an IO-intensive
read/write workload. Here is what the test shows:
Generally speaking, the performance tests can be interpreted with the following:
Disk latency should be below 15 ms. Disk latency above 25 ms can cause noticeable
performance issues. Latency above 50 ms is indicative of extremely underperforming storage.
Disk queues should be no greater twice than the number of physical disks serving the
drive. For example, if the underlying storage is a 6 disk RAID 5 array, the total disk queue
should be 12 or less. For storage that isn't mapped directly to an array (such as in a private
cloud or in Azure), queues should be below 10 or so. Queue length isn't directly indicative of
performance issues but can help lead to that conclusion.
These are general rules and may not apply in every scenario. However, if you see the counters
exceeding the thresholds above, it warrants a deeper investigation.
If a disk performance issue is suspected to be causing a larger problem, we generally start off by
running the second set of counters above. This will determine if the storage is actually a bottleneck,
or if the problem is being caused by something else. If the counters indicate that the disk is
underperforming, we would then run the first set of counters to see how many IOPS and how much
throughput we are getting. From there, we would determine if the storage is under-spec'ed or if there
is a problem on the storage side. In an on-premise environment, that would be done by working with
the storage team. In Azure, we would review the disk configuration to see if we're getting the
advertised performance.
22
Configuring Windows Performance Monitor to Capture Disk I/O
Activity and Potential Disk Issues
As most SmarterMail administrators know, a server’s hard disks are some of the most heavily used
resources when it comes to managing a secure mail server and often have the biggest impact on how
that mail server performs. As such, it’s important to keep a close eye on your disk activity, ensuring
there are no bottlenecks or latencies that can cause SmarterMail’s performance to suffer.
And when it comes to monitoring server performance on Windows, there’s no better tool than the one
built-in: Windows Performance Monitor! PerfMon, as it’s commonly known, is a console snap-in that
provides tools for analyzing your system’s performance. It’s great to use for recording a performance
baseline, monitoring your daily activity, troubleshooting server data or discovering potential disk issues
before they occur.
Follow along to learn how to configure PerfMon to capture information pertaining to your email server’s
disk I/O utilization. There are three sections in this guide: (1) steps for configuring a monitor to view
data in real-time, (2) steps for configuring a data collection set in which data can be captured over a
period of time and (3) an explanation of the performance counters and their expected values.
Note: While the interface may vary slightly, the steps for configuring PerfMon remain consistent across
supported server versions. The screenshots provided here were taken from Windows Server 2008 R2.
To monitor your disk activity in real-time and catch disk I/O bottlenecks before they occur, you’ll need
to configure certain performance monitors within PerfMon:
1. On the Windows server where SmarterMail is installed, open Performance Monitor. This can
be opened from the Start menu by clicking on Administrative Tools and selecting Performance
Monitor OR by opening the Run command, entering “perfmon.exe” and clicking OK.
2. Once open, add a new counter. This is done by expanding the Monitoring Tools folder in the
navigation pane and clicking on Performance Monitor. In the toolbar of icons above the main
window, click the green plus sign (+) icon. The counter settings will load in a popup window.
23
3. In the popup window, find the ‘Instances of selected object’ section and select the physical
disk(s) you want to monitor. Highlight <All instances> to monitor all disks on the server in
the same report. To monitor one or multiple disks individually, select each individual volume.
By default, _Total will be selected; however, this is the sum of all your disks and won’t provide
meaningful data for this configuration. (It’s important to do this step before selecting
performance counters, as changing the selected instance could remove the highlighting from
the chosen performance counters.)
4. Next, go to the ‘Available counters’ section and find PhysicalDisk. Expand its additional
options and highlight the following counters:
5. Click Add >>. The highlighted counters will be shown in the ‘Added counters’ section on the
right-hand side of the window.
24
Monitoring Real-Time Disk Activity
As soon as you close the Add Counters window, you’ll be dropped back into the PerfMon section where
you can begin monitoring your results!
There are three types of graphs that you can choose to view: Line, Histogram Bar or Report. To toggle
through the options, use the Change graph type button to the left of the plus sign (+) or press
Ctrl+G on your keyboard. We prefer reviewing the Report type as this lays out your data in a neat
table; however, when you’re monitoring quite a few disks, you may not be able to view all disk data
simultaneously.
So, if you review your results using the Line or Histogram Bar graphs instead, here are some things to
be aware of: If you chose <All instances> or the individual disk(s) when adding your performance
counters, each counter will be listed one time for every disk you’re monitoring. Use the column’s
sorting options to group disks together by Instance for easier review.
You may also find the Highlight toolbar button to be extremely useful in these views. When enabled,
the performance counter currently selected at the bottom of the window will have its corresponding
line/bar highlighted in black within the graph.
25
Monitoring Disk I/O Activity Over a Period of Time
Now that your real-time monitoring is squared away, we can move on to capturing data sets over a
period of time. This configuration is extremely useful for those incidents that are tough to catch in
real-time. For example, an issue that occurs once every hour, happens sporadically or one that pops
up after-hours.
To capture disk data over a period of time, we’ll configure a Data Collector Set within PerfMon that can
be started and stopped as needed:
1. On the Windows server where SmarterMail is installed, open Performance Monitor. This can
be opened from the Start menu by clicking on Administrative Tools and selecting Performance
Monitor OR by opening the Run command, entering “perfmon.exe” and clicking OK.
2. Once open, add a new Data Collector Set. This can be done by expanding the Data Collector
Sets folder in the navigation pane. Then right-click on User Defined, hover over New and
select Data Collector Set. The collector set settings will load in a popup window.
3. In the Create new Data Collector Set dialog window, enter a friendly name for your report, such
as “IO Report.” Select the bulleted option to Create manually (Advanced). Click Next.
4. On the next screen, select the bulleted option to Create data logs and
checkmark Performance counter. Click Next.
5. Now, click on Add.... A window for the performance counter settings will appear.
6. In the popup window, find the ‘Instances of selected object’ section and select the physical
disk(s) you want to monitor. Highlight <All instances> to monitor all disks on the server in
the same report. To monitor one or multiple disks individually, select each individual volume.
By default, _Total will be selected; however, this is the sum of all your disks and won’t provide
meaningful data for this configuration. (It’s important to do this step before selecting
performance counters, as changing the selected instance could remove the highlighting from
the chosen performance counters.)
26
7. Next, go to the ‘Available counters’ section and find PhysicalDisk. Expand its additional
options and highlight the following counters:
1. % Disk Read Time
2. % Disk Time
3. % Disk Write Time
4. % Idle Time
5. Current Disk Queue Length
6. Disk Reads/sec
7. Disk Writes/sec
8. Split IO/sec
8. Click Add >>. The highlighted counters will be shown in the ‘Added counters’ section on the
right-hand side of the window.
9. Next, find SmarterMail within the list of ‘Available counters.’ Expand its additional options and
highlight each one. (These counters will be helpful by allowing you to compare the values of
normal disk activity versus high disk I/O. For example, during an instance of high disk I/O, you
could potentially see an influx of IMAP connections, SMTP connections, file handles, threads,
etc., allowing you to understand the SmarterMail sections impacted so you can troubleshoot the
root cause of the issue.)
10. In the ‘Instances of selected object’ section, select the mailservice instance. (If you click on
mailservice immediately after highlighting all SmarterMail counters, all counters should still be
highlighted.) Then click Add >>. The counters will be shown in the ‘Added counters’ section,
indicated with an asterisk (*).
27
11. Click OK to close the window and return to the Create new Data Collector Set dialog window.
The performance counters just added will be displayed.
12. Adjust the Sample Interval as desired. In most cases, 15-30 seconds is enough (and can be
adjusted in the future if needed). Click Next.
13. Set the Root directory path. This is where the actual report data will be saved. It’s
recommended to save this to a volume that is not low on disk space, as these reports can get
fairly large if left running for a long period of time (days). Notate this location so you can pull
from this path, if needed. Click Next.
14. Leave the Run as: option at <Default>, unless special permissions are necessary for your
environment. In the bulleted options below, select Save and close. Click Finish.
When the wizard has finished, you’ll be dropped back into the PerfMon window where you can begin
collecting data for your report! Find the “IO Report” (or whatever friendly name you used) by
expanding the Data Collector Sets folder and clicking on User Defined. To begin capturing the Disk
I/O and SmarterMail data, right-click on the report name at any time and click Start. Once the data
has been captured for the desired period, right-click again and choose Stop.
28
Reviewing the Data Collector Set Report
To review the data you’ve collected, head over to the Reports folder and expand User Defined. Here
you’ll see the name of your report and, below it, each set of data that has been collected. Select the
latest report to view its information.
There are five types of graphs that you can choose to view: Line, Histogram Bar, Report, Area or
Stacked Area. To toggle through the options, use the ‘Change graph type’ button to the left of the plus
sign (+) or press Ctrl+G on your keyboard. Again, we prefer the Report type as this lays out your data
in a neat table; however, when you’re monitoring quite a few disks, you may not be able to view all
disk data simultaneously.
So, if you review your results using the graphs instead, here are some things to be aware of: If you
chose <All instances> or the individual disk(s) when adding your performance counters, each counter
will be listed one time for every disk you’re monitoring. Use the column’s sorting options to group
disks together by Instance for easier review.
You may also find the Highlight toolbar button to be extremely useful in these views. When enabled,
the performance counter currently selected at the bottom of the window will have its corresponding
line/bar highlighted in black within the graph.
Finally, if the actual report data needs to be pulled -- either for a support ticket with the SmarterTools
Support Department or for you to review on an external system -- this can be obtained from the path
specified in step 13.
29
Understanding PerfMon Counters and their Results
So now that we have all the steps in place for monitoring your disk I/O activity, it’s important that you
understand the information each performance counter provides, as well the results that should be
expected on a healthy installation that is capable of handling the I/O requirements:
*Though the percentages for Disk Read/sec and Disk Write/sec can influx up to 35-40%, this isn’t a
firm indicator of true bottlenecking. However, if you see these values exceed 70-80%, this indicates
the disk activity is VERY high. Chances are, during this same period, you will notice % Disk Idle sitting
around 0-10%.
**In combination with high Disk Read\Write percentages, if the Current Disk Queue Length exceeds 1-
2, noticeable slowness will occur within the SmarterMail web interface and many other aspects may be
affected, including message deliveries, IMAP\EWS\EAS synchronization and more. This is because the
OS would have to queue the Read\Write operations rather than committing said operations to the disk
in real time.
There we have it! Using the steps above, you’ve created real-time and historical monitors to keep a
close eye on your server’s disk activity, and with your hard disks performing at their best, you’re well
on your way to a healthy, reliable and high-performing mail server.
So what other tools do you use for maintaining your mail server performance? Are there any additional
performance counters you recommend monitoring? Let us know in the comments!
30
SQL Server disk performance metrics – Part 1 – the most
important disk performance metrics
So far, we have presented the most important memory and processor metrics. These metrics indicate
system and SQL Server performance, and are useful for troubleshooting performance issues and
bottlenecks. Besides memory and processor metrics, equally important are SQL Server disk metrics.
Sometimes a metric from one category can be masked by other events and be misleading – e.g. a disk
issue can cause processor bottlenecks. That’s why it’s necessary to understand the cause and effect of
each metric
Disk metrics are not related only to disk itself, but to the whole disk subsystem which includes disk,
the disk controller card and the I/O disk system bus. For SQL Server disk performance monitoring, it’s
recommended to monitor the metrics for a while, determine the trend, and set a baseline for normal
operation. Then, compare the current metric values to baselines
Most of these metrics are available in Windows Performance Monitor, where they are divided into 2
groups – Physical Disk and Logical Disk metrics. A Logical disk is a disk partition, while a physical disk
is the complete physical disk with all partitions created on it. The metrics in both groups are the same,
the only difference is whether they show the performance for a single partition, or for the entire disk
Some physical disk metrics might not be sufficient for deeper investigation and troubleshooting if you
have more than one logical partition on a disk. This is where logical disk metrics are useful, as they
show more granular results and help determining effect of SQL Server or any other application on disk
performance
SQL Server uses I/O calls to perform reads and writes on a disk, it defines and manages requests for
reading and writing the data, while the operating system actually performs the I/O operations.
Problems with disk I/O operations are manifested through slow response times, operation time outs,
and system bottlenecks
To troubleshoot SQL Server disk issues, besides total disk I/O activity, it’s recommended to monitor
and detect disk activity made by SQL Server
Excessive disk using by various applications can cause SQL Server performance degradation, as SQL
Server might not be the master of disk resources and would have to wait for disk reads and writes.
The SQL Server activities that require disk access are creating database and transaction log backups
and saving them to disk, import/export processes, jobs that read or write large amounts of data
from/to disk, etc.
31
Average Disk sec/Read
The Average Disk sec/Read metric, along with Average Disk sec/Read (presented next), is one of the
most important disk performance metrics. Both metrics can be tracked on logical and physical disk
levels and show disk latency. The shorter the time needed to read or write data, the faster the system
“The value for this counter is generally the number of seconds it takes to do each read. On less-
complex disk subsystems involving controllers that do not have intelligent management of the I/O, this
value is a multiple of the disk’s rotation per minute. This does not negate the rule that the entire
system is being observed. The rotational speed of the hard drive will be the predominant factor in the
value with the delays imposed by the controller card and support bus system.” [1]
Average Disk sec/Read is proportional to time needed for one disk rotation. For example, a disk that
makes 3,600 round per minute needs 60s/3600 = 0.016 seconds, i.e. 16 milliseconds to make one
rotation. Average Disk sec/Read for that disk should be a multiple of 16 milliseconds. The time added
to one disk rotation is the queuing time and the time needed for data transit across the I/O bus
<8 Excellent
8 – 12 OK
12 – 20 Fair
> 20 Bad
Maximum peaks during excessive I/O operations can be up to 25 milliseconds, but values constantly
higher than 20 milliseconds indicate poor performance
32
Average Disk sec/Write
Average Disk sec/Write is another useful disk performance metric that shows the average time in
seconds needed to write data to disk
Usually, the read and write speed on a disk are different. The recommended values for non-cached
writes are the same as for Average Disk sec/Read. In case of cached writes, the values are very
different – values higher than 4 milliseconds indicate poor performance, while the values less than 1
milliseconds indicate the best performance
<1 Excellent
1–2 OK
2–4 Fair
>4 Bad
If the Average Disk sec/Read and Average Disk sec/Write values are constantly above the
recommended values, it’s an indication of a disk bottleneck and additional analysis is required
“After you have found the disks with high levels of read/write activity, look at the read-specific and
write-specific counters (for example, Logical Disk: Disk Write Bytes/sec) for the type of disk activity
that is causing the load on each logical volume.” [2]
If the Average Disk sec/Read and Average Disk sec/Write values are high for all or almost all disks, the
problem is most probably caused by disk communication mediums. If only a specific disk shows poor
performance, the problem is most probably in disk itself
Monitoring both values can help you determine if reconfiguration of disk controller cache is needed. If
for example, the Average Disk sec/Read value is significantly higher than Average Disk sec/Write, you
can consider cache optimization for reading
33
Average Disk sec/Transfer
The Average Disk sec/Transfer metric shows disk efficiency as the average time needed for each read
and write
“Measures the average time of each data transfer, regardless of the number of bytes read or written.
Shows the total time of the read or write, from the moment it leaves the Diskperf.sys driver to the
moment it is complete
A high value for this counter might mean that the system is retrying requests due to lengthy queuing
or, less commonly, disk failures. “[3]
The recommended value is the same as for the previous two metrics
There’s no need to monitor this metric along with Average Disk sec/Read and Average Disk sec/Write,
as the latter two are combined in Average Disk sec/Transfer. But if you’re monitoring Average Disk
sec/Transfer and its values are higher than recommended, monitoring Average Disk
sec/Read and Average Disk sec/Write is the first step in further troubleshooting
34
Disk Reads/sec and Disk Writes/sec
The Disk Reads/sec and Disk Writes/sec metrics show the rate of read and write operations on disk,
respectively
The metric that shows the combined value of these two is Disk Transfers/sec, it’s the total number of
all I/O disk requests generated in a second
If the values are low, they indicate slow disk I/O operation processing and you should check processor
usage parameters and disk-expensive queries
There is no specific threshold, as it depends on disk specification and your server configuration. For an
array system, the values shown are for all disks. With that said, it’s recommended to monitor these
metrics for a while and to determine trends and set a baseline. Any unexpected peaks should be
investigated
In this part of the SQL Server performance metrics series, we presented the most important disk
performance metrics. All metrics show disk latency and if the latency is too high, the final solution is
upgrading the disk subsystem, or adding more disks
35
SQL Server disk performance metrics – Part 2 – other important
disk performance measures
In the previous part of the SQL Server performance metrics series, we presented the most important
and useful disk performance metrics. Now, we’ll show other important disk performance measures
“Indicates the number of disk requests that are currently waiting as well as requests currently being
serviced. Subject to wide variations unless the workload has achieved a steady state and you have
collected a sufficient number of samples to establish a pattern.” [1]
The metric shows how many I/O operations are waiting to be written to or read from the hard drive
and how many are currently processed. If the hard drive is not available, these operations are queued
and will be processed when disk becomes available. The whole disk subsystem has a single queue
The Current Disk Queue Length metric in Windows Performance Monitor is available for both physical
and logical disk. In some earlier versions of Performance Monitor, this counter was named Disk Queue
Length
The Current Disk Queue Length value should be less than 2 per disk spindle. Note that this is not per
logical, but per physical disk. If larger, this indicates a potential disk bottleneck, so further
investigation and monitoring other disk metrics is recommended. Start with monitoring %Disk
Time (explained below). Frequent peaks should also be investigated
Disk array systems such as RAID or SAN have a large number of disks and controllers, which makes
queues on such systems shorter. Because the metric doesn’t indicate queuing per disk, but for the
whole array, some DBAs consider that monitoring Current Disk Queue Length on disk arrays is not
needed
Another scenario where Current Disk Queue Length can be misleading is when data is stored in the
disk cache. It will be reported as being queued for writing and thus the Current Disk Queue
Length value will be higher than actual
36
Average Disk Queue Length
The Average Disk Queue Length metric shows the information similar to Current Disk Queue Length,
only the value is not current but average over a specific time period. The threshold is the same as for
the previous metric – up to 2 per disk. For disk systems, the recommended value is less than 2 per
individual disk drive in an array. For example, in a 6 disk array the Current Disk Queue Length value of
12 means that the queue is 2 per disk
There are two more metrics similar to Average Disk Queue Length – Average Disk Read Queue
Length and Average Disk Write Queue Length. As their names indicate – they show the average queue
length for operations waiting for disk to be read or written
%Disk Time
“This counter indicates a disk problem, but must be observed in conjunction with the Current Disk
Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to
the %Disk Time reaching 100%” [2]
The %Disk Time metric indicates how busy the disk is servicing read and write requests, but as stated
above, it’s not a clear indication of a problem, as its values can be normal while there’s a serious disk
performance issue. Its value is the Average Disk Queue Length value represented in percents (i.e.
multiplied by 100). If Average Disk Queue Length is 1, %Disk Time is 100%
What can be confusing is that %Disk Time values can be over 100%, which isn’t logical. This happens
if the Average Disk Queue Length value is greater than 1. If Average Disk Queue Length is 3, %Disk
Time is 300%, which doesn’t mean that processes are using 3 times more disk time than available, nor
that there is a bottleneck
If you have a hard disk array, the total disk time for all disks is shown, without the indication of how
many disks are available and what disk is having the highest %Disk Time. For example, %Disk
Time equal to 500% might indicate good performance (in case you have 6 disks), or extremely bad (in
case you have only 1 disk). You cannot tell without knowing the machine hardware
37
As this counter can be misleading, some DBAs don’t use it as there are other more straightforward and
indicative metrics that show disk performance
If the value is higher than 90% per disk, additional investigation is needed. First, check the Current
Disk Queue Length value. If it’s higher than the threshold (2 per physical disk), monitor if the high
values occur frequently. If the machine is not used only for SQL Server, other resource-intensive
applications might cause disk bottlenecks, so SQL Server performance will be suffering. If this is the
case, consider moving these applications to another machine and using a dedicated machine for SQL
Server only
If this is not the case, or cannot be done, consider moving some of the files to another disk or machine
– archive databases, database and transaction log backups, using a faster disk, or adding additional
disks to an array
38
%Disk Read Time and the %Disk Write Time
The %Disk Read Time and %Disk Write Time metrics are similar to %Disk Time, just showing the
operations read from or written to disk, respectively. They are actually the Average Disk Read Queue
Length and Average Disk Write Queue Length values presented in percents. The values these metrics
show can be equally misleading as %Disk Time
On a three – disk array system, if one disk reads 50% of the time (%Disk Read Time =50%), the
other one reads 85% of the time, and the third one is idle, %Disk Read Time is 135% and Average
Disk Read Queue Length 1.35. At a first glance, %Disk Read Time equal to 135% looks like a problem,
but it’s not. It doesn’t mean that disks are busy 135% of the time. To get a real value, you should
divide the value with the number of disks and you’ll get 136%/3 = 45%, which indicates normal
performance
39
%Idle Time
The disk is idle when it’s not processing read and write requests
“This measures the percentage of time the disk was idle during the sample interval. If this counter falls
below 20 percent, the disk system is saturated. You may consider replacing the current disk system
with a faster disk system.” [3]
If the value is lower than 20%, disk is not able to service all read and write requests in a timely
fashion. Before opting for disk replacement, check whether it’s possible to remove some applications
to another machine
40
%Free Space
Besides Windows Performance Monitor, this metric is available in Windows Explorer in the computer
and disk Properties tabs. While Performance Monitor shows the percentage of available free disk space,
Windows Explorer shows the amount in GB
“This measures the percentage of free space on the selected logical disk drive. Take note if this falls
below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious
solution here is to add more disk space.” [3]
If the value shows sudden peaks without obvious reasons, further investigation is required
Unlike most of the memory and processor SQL Server performance metrics, disk metrics can be quite
deceptive. They might not clearly indicate a performance problem; their values might be OK, when
actually there is a serious disk issue, while their strangely high values might show normal
performance, as they show values for an array of disks. Then it comes to array metrics, hardware
configuration knowledge is necessary to read them correctly. Despite these disk metrics downsides,
they are necessary for SQL Server performance troubleshooting
41
https://sites.google.com/site/saifsqlserverrecipes/memory-performance-counters/key-performance-
counters-and-their-thresholds-for-windows-server
https://knowledge.broadcom.com/external/article/181816/common-performance-monitor-counter-
thres.html
http://woshub.com/how-to-measure-disk-iops-using-powershell/
https://www.concurrency.com/blog/september-2019/diagnosing-disk-performance-issues
https://www.smartertools.com/blog/2016/07/15-configure-perfmon-to-prevent-disk-issues
https://www.sqlshack.com/sql-server-disk-performance-metrics-part-1-important-disk-performance-
metrics/
https://www.sqlshack.com/sql-server-disk-performance-metrics-part-2-important-disk-performance-
measures/
42