0% found this document useful (0 votes)
0 views

CMG- multiprocessor scalibility in windows

The Computer Measurement Group (CMG) is a non-profit organization focused on the performance evaluation and capacity management of computer systems. This document discusses the multiprocessing support in Microsoft Windows NT/2000, emphasizing scalability and capacity planning issues, particularly in shared-memory multiprocessing configurations. It also highlights the importance of processor affinity and the challenges related to scalability in multiprocessor systems.

Uploaded by

kmdbasappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

CMG- multiprocessor scalibility in windows

The Computer Measurement Group (CMG) is a non-profit organization focused on the performance evaluation and capacity management of computer systems. This document discusses the multiprocessing support in Microsoft Windows NT/2000, emphasizing scalability and capacity planning issues, particularly in shared-memory multiprocessing configurations. It also highlights the importance of processor affinity and the challenges related to scalability in multiprocessor systems.

Uploaded by

kmdbasappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

The Association of System

Performance Professionals

The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the
measurement and management of computer systems. CMG members are primarily concerned with performance evaluation of existing systems to maximize
performance (eg. response time, throughput, etc.) and with capacity management where planned enhancements to existing systems or the design of new
systems are evaluated to find the necessary resources required to provide adequate performance at a reasonable cost.

This paper was originally published in the Proceedings of the Computer Measurement Group’s 2000 International Conference.

For more information on CMG please visit http://www.cmg.org

Copyright Notice and License

Copyright 2000 by The Computer Measurement Group, Inc. All Rights Reserved. Published by The Computer Measurement Group, Inc. (CMG), a non-profit
Illinois membership corporation. Permission to reprint in whole or in any part may be granted for educational and scientific purposes upon written application to
the Editor, CMG Headquarters, 151 Fries Mill Road, Suite 104, Turnersville , NJ 08012.

BY DOWNLOADING THIS PUBLICATION, YOU ACKNOWLEDGE THAT YOU HAVE READ, UNDERSTOOD AND AGREE TO BE BOUND BY THE
FOLLOWING TERMS AND CONDITIONS:

License: CMG hereby grants you a nonexclusive, nontransferable right to download this publication from the CMG Web site for personal use on a single
computer owned, leased or otherwise controlled by you. In the event that the computer becomes dysfunctional, such that you are unable to access the
publication, you may transfer the publication to another single computer, provided that it is removed from the computer from which it is transferred and its use
on the replacement computer otherwise complies with the terms of this Copyright Notice and License.

Concurrent use on two or more computers or on a network is not allowed.

Copyright: No part of this publication or electronic file may be reproduced or transmitted in any form to anyone else, including transmittal by e-mail, by file
transfer protocol (FTP), or by being made part of a network-accessible system, without the prior written permission of CMG. You may not merge, adapt,
translate, modify, rent, lease, sell, sublicense, assign or otherwise transfer the publication, or remove any proprietary notice or label appearing on the
publication.

Disclaimer; Limitation of Liability: The ideas and concepts set forth in this publication are solely those of the respective authors, and not of CMG, and CMG
does not endorse, approve, guarantee or otherwise certify any such ideas or concepts in any application or usage. CMG assumes no responsibility or liability
in connection with the use or misuse of the publication or electronic file. CMG makes no warranty or representation that the electronic file will be free from
errors, viruses, worms or other elements or codes that manifest contaminating or destructive properties, and it expressly disclaims liability arising from such
errors, elements or codes.

General: CMG reserves the right to terminate this Agreement immediately upon discovery of violation of any of its terms.
Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

Multiprocessor scalability in Microsoft Windows NT/2000


Mark B. Friedman

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
DataCore Software, Inc.
1020 Eighth Avenue South, Suite 6
Naples, FL USA 34102
[email protected]

This paper provides an overview of the multiprocessing support in the Microsoft Windows NT/2000 operating system,
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

with an emphasis on scalability and other capacity planning issues. It also discusses specific features of the Intel P6
architecture that provide the hardware basis for large scale multiprocessing systems. As a shared memory multiprocessing
implementation, Windows NT/2000 is predictably vulnerable to saturation on the shared memory bus. Processor
hardware measurements that can illuminate memory bus contention when it appears are also described and discussed.

Introduction processor Object reported in both Taskman (as illustrated


The specific type of multiprocessing Windows NT/2000 below in Figure 2) and Perfmon (see Figure 3).
implements using Intel P6 processors (the Pentium Pro, The specific type of multiprocessor support offered
Pentium II, and Pentium III models) is generally classified as beginning in Windows NT 4.0 is known as symmetric
shared-memory multiprocessing. In this type of configura- multiprocessing, often abbreviated as SMP. Symmetric in
tion, the processors operate totally independent of each this context means that every thread is eligible to execute
other. But they do share a single copy of the operating on any processor. Prior to NT 4.0, Windows NT supported
system and they share access to main memory (i.e., RAM). only asymmetric multiprocessing. Interrupts could only be
A typical dual processor shared-memory configuration is processed in an NT 3.5x machine on CPU 0. When they
illustrated in Figure 5.1. Notice the illustration shows two are present, CPUs 1, 2, 3, can only run user and kernel
P6 processors, which contain dedicated Level 2 caches. code, never Interrupt Service Routines (ISRs) and Deferred
(They each have separate built-in Level 1 caches, too.) A Procedure Calls (DPCs). This asymmetry ultimately limits
two-way configuration, as illustrated, simply means having the scalability of NT 3.5x multiprocessor systems because
twice the hardware – two identical sets of processors, caches, the CPU 0 engine is readily overloaded under some
and internal buses. Similarly, a four-way configuration means workloads, while the remaining microprocessors are idling.
having four of everything. Having duplicate caches is In an SMP, in theory, all the microprocessors should run
designed to promote scalability since cache is so funda- out of capacity at the same time. One of the key Microsoft
mental to the performance of pipelined processors. development projects associated with the NT 4.0 release
The processors also share a common bus that is used to was changes to the kernel to support SMPs. In addition, the
access main memory locations. This, obviously, is L1 Cache L1 Cache
not so scalable. And, according to experienced
hardware designers, this shared component is

P6 P6
precisely where the bottleneck in shared-memory
designs often is.

Operating system support for multiprocessing.


Each processor in a multipro-
cessor is capable of executing L2 Cache L2 Cache
work independently of the
other. Separate, independent
threads may be dispatched, one
per processor, and run in
parallel. Only one copy of the
Windows 2000 operating
system is running, controlling
what runs on all the processors.
Shared Memory Bus
From a performance monitoring
perspective, you will see FIGURE 1. A SHARED-MEMORY MULTIPROCESSOR. EACH PROCESSOR IN A MULTIPROCESSOR HAS ACCESS
TO DEDICATED LEVEL 1 CACHE AND LEVEL 2 CACHES. ACCESS TO SYSTEM RAM, IN CONTRAST, IS SHARED
multiple instances of the
VIA A COMMON SYSTEM BUS.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

The symmetric multiprocessing (SMP) support avail-


able in Windows NT 4.0 and above allows any processor
normally to process any interrupt, as illustrated in Figure 3.
This performance data illustrates a two-way symmetric

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
multiprocessor system running NT 4.0 Server. (Be careful,
the vertical axis scale was adjusted down to a maximum of
thirty to make the chart data easier to decipher.) The two
processor instances of % Privileged Time and % Interrupt
Time Counters are shown. The processing workload is
roughly balanced across both processors, although the load
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

does bounces back and forth a bit, depending on what


threads happen to be ready to run. As should be evident
from this picture, threads are dispatched independently in
Windows 2000 so that it is possible for a multithreaded
application to run in parallel on separate processors.
Figure 3 shows the amount of CPU time consumed
servicing interrupts on the two processors. The amount of
time spent processing interrupts per processor is roughly
equal, though there is some degree of variation that occurs
naturally. This is not always the most efficient way to
process interrupts. Having one type of ISR or DPC
FIGURE 2. MEASUREMENT SUPPORT FOR MULTIPROCESSORS IN TASK
directed to a single processor can have a positive impact on
MANAGER.
performance if the processor that runs the DPC code, for
instance, is likely to be able to cache it, rather than being
Windows NT development team fine-tuned the OS code to forced to fetch it from memory. Similarly, The Win2K
run much better in a multiprocessing environment. Win- scheduler tries to dispatch a ready thread on the same
dows 2000 also incorporates further improvements inside processor where it recently ran for that very same reason.
the operating system to boost performance on large n-way A thread is said to have an affinity for the processor where
multiprocessor configurations. it was most recently dispatched. Processor affinity in
Windows 2000 thread scheduling is discussed below.

Processor affinity. Logically, the


structure of the Windows 2000 Thread
Scheduler Ready Queue and the
relative priority of threads waiting to
execute is identical whether Win2K is
executing on an single processor or on
multiple processors. The main differ-
ence is that multiple threads can run
concurrently on a multiprocessor, a
little detail that leads to many compli-
cations. The Win2K Scheduler, in turn,
selects the highest priority waiting
thread to run on each available
processor. In addition, certain execu-
tion threads may have an affinity to
execute on specific processors. Win2K
supports both hard affinity, where a
given thread is eligible to run only on
specific processors, and soft affinity,
where Win2K favors scheduling
specific threads on specific processors,
FIGURE 3. SYMMETRIC MULTIPROCESSING IN WINDOWS 2000 AND
usually for performance reasons.
NT 4. OPERATING SYSTEM PRIVILEGED THREADS AND INTERRUPT
Hard affinity is specified at the process and thread level
SERVICE ROUTINES (ISRS) ARE ELIGIBLE TO BE DISPATCHED ON ANY
using a processor affinity mask. The Win32 API calls to
AVAILABLE PROCESSOR.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

accomplish this are straightforward. First, a Thread issues


a GetProcessAffinityMask call referencing a process
handle, which returns a 32-bit SystemAffinityMask. Each
bit in the SystemAffinityMask represents a configured

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
processor. Then, the Thread calls SetProcessAffinityMask
with a corresponding 32-bit affinity mask that indicates
which processors Threads from the process can be dis-
patched on. Figure 4, which illustrates the use of this
function in Taskman, allows you to set a process’s affinity
mask dynamically, subject to the usual security restrictions
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

that allow Taskman to operate only on foreground-eligible


processes by default. There is a corresponding call to FIGURE 4. SETTING A PROCESS’S PROCESSOR AFFINITY MASK USING
SetThreadAffinityMask override the process settings for TASKMAN.
specific threads. Once hard affinity is set, threads are only
eligible to be dispatched on specific processors.
Suppose that following an interrupt an application scalability. Each time you add a processor to the system,
program thread becomes Ready to run and there are you do get a corresponding boost in overall performance,
multiple processors that are idle. First, Win2K must chose but with each additional processor the boost you get tends
among available processors to select the processor for a to diminish. It is also possible to reach a point of diminish-
Ready thread to run on. This decision is based on perfor- ing returns where adding another processor actually
mance. If the thread was previously dispatched within the reduces overall capacity. In this section we discuss several
last Scheduler quantum (or timeslice), Win2K attempts to fundamental issues that impact multiprocessing scalability,
schedule the thread on that processor, so long as the including:
current thread has a higher priority than the thread that is
• the overhead of multiprocessor synchronization,
already running there. This is known as soft affinity. If the
desired processor is currently busy with a higher priority • multiprocessor-related pipeline stalls that are caused
task, Win2K is willing to schedule the waiting thread on a by cache coherence conflicts, and
different processor. By scheduling the thread back on the
• cycles wasted by code executing spin locks.
same processor where it ran last, Win2K hopes that a good
deal of the thread’s code and data from the previous Understanding these sources of performance degradation in
execution interval are still present in that processor’s a shared-memory multiprocessor will help us in examining
cache. The difference in instruction execution rate between and interpreting the extensive processor utilization
a cache “cold start” when a thread is forced to fault its way measurements that are available in Windows NT/2000.
through its frequently accessed code and data and a “warm One thing to be careful about is that the processors in an
start” when the cache is preloaded can be substantial. SMP may look very busy, but if you are able to look inside
the processor, you may find they are not performing as
Shared-memory much productive work. The way to look inside is to use the
Pentium Counters [1]. On a simple, single engine system,
multiprocessor scalability Instructions retired/sec, the internal P6 measure of Instruc-
Shared-memory multiprocessors running SMP operating tion Execution Rate (IER), generally tracks processor %
systems are the most common breed of multiprocessor. A Processor Time very well. Scalability issues mean that
great deal is known about hardware performance in this internal IER and external processor busy measures can no
context. Any multithreaded application will likely benefit longer be expected to correspond in predictable ways for
from having a choice of processors to run on – where any more complicated multiprocessors.
thread can run on any available processor in an SMP. Even Figure 5, taken from a two-way multiprocessor, illus-
single threaded applications may benefit, since multiple trates this essential point. Figure 5a shows the Task
applications can run in parallel. Because a dual processor Manager histogram of processor utilization on this ma-
system running Windows 2000 can dispatch two threads at chine. Notice that both engines are running at near 100%
a time, not just one, it seems reasonable to assume the dual utilization. This configuration contains two 200 MHz
processor configuration is twice as powerful as having a Pentium Pro machines. Remember, on a uniprocessor a
single engine to do the work. To see why this isn’t exactly good Rule of Thumb is to expect performance at or near 2
so, we will need to investigate a few aspects of shared- Cycles per instruction (CPI). This translates into capacity
memory multiprocessor scalability. of about 100,000,000 instructions per second on a 200
Shared-memory multiprocessors have some well known MHz P6. Figure 5b shows actual measurements for one of
scalability limitations. They seldom provide perfect linear these engines at only 12,345,000 instructions per second.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

What happens when you


add a third, fourth, or even
more processors? The best
models of shared memory

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
multiprocessor performance
suggest that the machines get
progressively even less
efficient as you add more
processors to the shared bus.
However, with proper care
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

and feeding, multiprocessor


configurations with quite
good scalability can be
configured, although this
requires a fair amount of skill
and effort.
Microsoft reported at a
presentation at CMG 1996
FIGURE 5. MULTIPROCESSOR SCALABILITY ISSUES FORCE YOU TO LOOK INSIDE THE PROCESSOR AT INTERNAL that the symmetric multipro-
MEASURES OF EXECUTION RATE BECAUSE IT MAY NO LONGER CORRESPOND TO EXTERNAL PROCESSOR BUSY
cessing support built for
MEASUREMENTS. AT LEFT IS A TASK MANAGER HISTOGRAM OF PROCESSOR UTILIZATION SHOWING THAT BOTH
Windows NT version 4.0, in
ENGINES ARE RUNNING AT NEAR 100% UTILIZATION. THIS CONFIGURATION CONTAINS TWO 200 MHZ PENTIUM
fact, sported a speed-up
PRO MACHINES. FIGURE 5.B AT RIGHT SHOWS P6 MEASUREMENTS FOR ONE OF THESE ENGINES RUNNING AT factor of 0.85. Figure 6
ONLY 12,345,000 INSTRUCTIONS PER SECOND, OR A CPI OF ABOUT 16.7. (A GOOD TARGET CPI ON A
compares the theoretical
UNIPROCESSOR WAS ABOUT 2.0.) OVER 146 MILLION RESOURCE-RELATED STALLS ARE BEING MEASURED
prospects for linear speed-up
CONCURRENTLY.
in a multiprocessor design to
This machine is delivering only about 12% of its rated the actual (projected) scalability of Windows NT version
performance – notice over 146 million resource-related 4.0, based on the actual measurements reported by
stalls, too. Making sure that an expensive 4 or 8-way Microsoft. The projection used here is a guess, of course,
multiprocessor Server configuration that you purchased is and your mileage may vary, but it is based on the formula
performing up to its capacity is not a trivial affair. Gunther [2] recommends for predicting multiprocessor
scalability. Actual performance is very workload-depen-
Speed-up factors. When we talk about scalability in dent, as we will discuss below. Figure 6 illustrates that
the context of either multiprocessors or parallel processors, actual performance of a multiprocessor running Windows
we are referring to our basic desire to harness the power of NT falls far short of the ideal linear speed-up. In fact,
more than one processor to solve a common problem. The beyond four multiprocessors, the projection is that adding
goal of a dual processor design is to apply double the CPU more engines hardly boosts performance at all. Many
horsepower to a single problem and solve it in one half the published benchmark results of Windows NT multiproces-
time. The goal of a quad processor is to apply quadruple sors evidence similar behavior, as in, for example, Figure
the processing power and solve problems in one quarter the 7, benchmark results that Intel published on its web site in
time. You get the idea. The term speed-up factor refers to 1998. (Look carefully — the results depict measurements
our expectation that a multiprocessor design will improve of 1, 2 and 4-way systems.) Results such as these suggest
the amount of time it takes to process some workload. If a that the theoretical model has at least some underlying
multiprocessing design supported a speed-up factor of 1, validity.
then two processors would provide fully twice the power of Notice that after a certain point (> 12 processors),
a single engine. This would be perfect linear scalability, adding additional engines actually degrades overall
something shared-memory multiprocessors are just not performance, according to the theoretical model. It is
capable of. A reasonable expectation is for a speed up worth noting that Windows NT 4.0 multiprocessor
factor in the range of about 0.85. This means that two Intel scalability is rather typical of general-purpose operating
processors tied together would only be able to function at systems – no worse, no better. MVS, IBM’s flagship
about 85% efficiency. Together, the two would provide 1.7 multiprocessing operating system, achieved a similar 0.85
times the power of standalone processor. An improvement, scalability up until it was re-architected for massively-
but certainly a more marginal and less cost-effective one. It parallel processing. To achieve anywhere near linear
turns out that Windows NT running on Intel hardware scalability requires highly engineered, special purpose
provides MP scalability in that range.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

Serializing instructions. The


:LQGRZV6036FDODELOLW\
first noticeable multiprocessor
 effect is the performance
impact of serializing LOCKed

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
17 instructions. Instructions coded
,GHDO with the LOCK prefix are
 guaranteed to run uninterrupted
5HODWLYH3HUIRUPDQFH

:LQ."
and gain exclusive access to the
:LQ."" designated memory locations.
Locking the shared-memory
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

 bus delays any threads execut-


ing on other processors that
need access to memory. There
are, in addition, a number of
 hardware-oriented operations
that are performed by the
operating system that implicitly
serialize by locking the shared-
 memory bus on an Intel
     shared-memory multiprocessor.
These include setting the active
RI3URFHVVRUV
Task State Segment (TSS),
FIGURE 6. THEORETICAL LINEAR SCALABILITY OF A MULTIPROCESSOR COMPARED TO ACTUAL PROJECTED
which is performed during a
SCALABILITY OF WINDOWS NT 4.0, BASED ON MEASUREMENTS TAKEN AND REPORTED BY MICROSOFT IN
context switch of any type.
1996. THE PROJECTION USES A FORMULA RECOMMENDED BY GUNTHER [1], BUT IS CONSISTENT WITH A
Intel hardware also automati-
NUMBER OF PUBLISHED BENCHMARKING RESULTS. FOR INSTANCE, SEE THE ARTICLE ON 8-WAY SCALABILITY IN
cally serializes updates of the
THE SEPTEMBER 1998 WINDOWS NT MAGAZINE AT HTTP://WINNTMAG.COM/MAGAZINE/
Page Directory Entries and
ARTICLE.CFM?ISSUEID=58&ARTICLEID=3781. WINDOWS 2000 INCORPORATES FURTHER
Page Table Entries that are
MULTIPROCESSOR SCALABILITY ENHANCEMENTS, BUT IT IS NOT YET CLEAR JUST HOW MUCH MORE SCALABLE
used in translating virtual
WIN2K IS.
memory addresses to real
memory locations. (This
parallel processing hardware and complementary operating
impacts the page replacement algorithm that Win2K uses
system services to match.
on Intel multiprocessors, as discussed in [3].)
Windows 2000 incorporates some further enhancements
Intel documentation [4] describes some specific
designed to improve multiprocessor scalability. Microsoft
serializing instructions that force the processor executing
implemented a new HAL function called queued spin locks
that exploits a new Intel instruction on the
Pentium III. It is not clear just how much this
new function will help on large scale 8 and 16-
way machines. Figure 6 suggests two
possibilities, reflecting a relatively marginal
increase in multiprocessor scalability to 0.90 or
possibly even to 0.95.
To summarize this discussion so far, it is
simply not possible to string processor after
processor together and double, triple, quadruple,
etc., the amount of total processing power
available. The principal obstacle of shared-
memory multiprocessor designs, which are quite
simple from the standpoint of the programmer, is
that they typically encounter a bottleneck in
accessing shared-memory locations using the
shared system memory bus. To understand the nature of FIGURE 7. MULTIPROCESSING BENCHMARK RESULTS PUBLISHED BY
this bottleneck, let’s proceed to a discussion of the sources INTEL IN 1998. LOOK CAREFULLY — THE RESULTS DEPICT
of performance degradation in a multiprocessor. MEASUREMENTS OF 1, 2AND 4-WAY SYSTEMS, COMPARING SIMILARLY
CONFIGURED SYSTEMS WITH 1 AND 2 MB OF LEVEL 2 CACHE.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

these instructions to drain the pipeline before executing the Spin locks. If two threads are attempting to access the
instruction. Following execution of the serializing instruc- same serializable resource, one thread will acquire the
tion, the pipeline is started up again. These serializing lock, which then blocks the other one until the lock is
instructions include privileged operations that move values released. A block of code guarded by some synchroniza-

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
into internal Control and Debug Register, for example. tion or locking structure is called a critical section. (The
Serializing instructions have the effect on the P6 of forcing generic name should not be confused with the Win32 API
the processor to re-execute out of order instructions, for function which provides platform independent locking
example. services for critical sections.) Problem: what should the
The performance impact of draining the instruction thread that is blocked waiting on a critical section do while
execution pipeline ought to be obvious. Current generation it is waiting? An application program in Windows 2000 is
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

P5 and P6 Intel processors are pipelined, superscalar expected to use Win32 serialization runtime services that
architectures. The performance impact of executing an puts the application to sleep until notified that the lock is
instruction serialized with the LOCK prefix includes available. Win32 serialization services arrange multiple
potentially stalling the pipelines of other processors threads waiting on a shared resource in a FIFO queue so
executing instructions until the instruction that requires that the queueing discipline is fair. This suggests that a key
serialization frees up the shared-memory bus. This can be a element of designing an application to run well on a
fairly substantial performance hit, too, which is solely a shared-memory multiprocessor is to minimize the amount
consequence of running in a multiprocessor environment. of processing time spent inside critical sections. The
The cost of both sorts of instruction serialization contribute shorter the time spent executing inside a locked critical
to at least some of the less than linear scalability that we section of code, the less time other threads are blocked
can expect in a multiprocessor. How much is very difficult waiting to enter it. Much of the re-engineering work
to quantify, and certainly workload dependent. There is Microsoft did on NT 4.0 and again in Windows 2000 was
also very little one can do about this source of degradation. to redesign the critical sections internal to the OS to
Without serializing instructions, multiple processors would minimize the amount of time kernel threads would have to
simply not work reliably. wait for shared resources.
A second source of multiprocessor interference is When critical sections are designed appropriately, then
interprocessor signaling instructions. These are instruc- threads waiting on a locked critical section should not have
tions issued on one processor to signal another processor, long to wait. Furthermore, while a thread is waiting on a
for example, to wake it up to process a pending interrupt. lock, there may be nothing else for it do. For example, a
By its very nature, interprocessor signaling is quite thread waiting on the Win2K Scheduler lock can perform
expensive, in performance terms. no useful work until it has successfully acquired that lock.
For example, consider a kernel or device driver with OSD
Cache effects. Effective on-board CPU caching is privileges that is blocked waiting on a lock. The resource
critical to the performance of pipelined processors [5]. the thread is waiting for is required before any other useful
Intel waited to introduce pipelining with its 486 chips until work on the processor can be performed. The wait can be
there was enough real estate available to include an on- expected to be of very short duration. Under these circum-
board cache. It should not be a big surprise to learn that stances, the best thing to do may be to loop back and test
one secondary effect of multiprocessor coordination and for the availability of the lock again. Code that tests for
serialization is that it makes caching less effective. This, in availability of a lock that finally enters a critical section
turn, serves to slow down the processor’s instruction and sets the lock using a serializing instruction. If the same
execution rate. In order to understand why SMPs impact code finds the lock is already set (presumably by a thread
cache effectiveness, we will take a detour into the realm of running on a different processor), there is nothing to do on
cache coherence in the next section. From a configuration a shared memory multiprocessor other than retry the lock
and tuning perspective, one intended effect of setting up an again. The entry code simply branches back to retest the
application to run with processor affinity is to improve lock. This coding technique is known as a spin lock. If you
cache effectiveness and increase the instruction execution are able to watch this code’s execution, it appears to be
rate. Direct measurements of both instruction execution stuck in a very tight loop of just a few instructions — until
rate and caching efficiency, fortunately, are available via the lock requested is finally available.
the Pentium Counters. Unfortunately, the Pentium Counter Spin locks are used in many, many different places
support Microsoft provides in the NT 4.0 Resource Kit throughout the operating system in Windows 2000 because
falls short of the precision tool that MP configurations operating system code waiting for a critical section to be
require. Moreover, Microsoft no longer provides a means unlocked often has nothing better to do during what is,
to gather Pentium statistics in Windows 2000. hopefully, a very short waiting period than retest the lock.
For example, device drivers are required to use spin locks
to protect data structures if there is any possibility that

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

spin lock function to protect critical sections of code.


Consider a 2, 4 or 8-way multiprocessor with an ntfs
file system. ntfs.sys functions can be executed on any
processor where there is an executing thread that needs

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
access to disk files. In fact, it is likely that ntfs
functions will execute concurrently (on more than one
processor) from time to time. ntfs.sys uses HAL spin
lock functions to protect critical sections of code,
preserving the integrity of the file system in a multi-
processor environment.
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

Spin lock code is effectively dormant when it is run


on a single processor system, but consumes a signifi-
cant number of processor cycles on a multiprocessor.
Again, the use of spin locks is, to a large degree,
unavoidable. The performance implication of spin locks
is that processor utilization increases, but no useful
work is actually being performed. Here’s where simple
measures of processor utilization are misleading.
FIGURE 8. THE THUNK FUNCTION IN THE X86 PERF METER Win2K client applications care about throughput and
APPLICATION IN THE RESOURCE KIT CAN BE USED TO MONITOR SPIN response time, which may be degraded on a multiprocessor
LOCK ACTIVITY. IN THIS EXAMPLE, THE NTFS.SYS FILE SYSTEM DRIVER even as measurements of CPU utilization look rosy.
MODULE IS CALLING INTO HAL.DLL (THE TARGET MODULE). THE The combined impact of serializing instructions,
KFACQUIRESPINLOCK AND KFRELEASESPINLOCK HAL FUNCTIONS interprocessor signaling, diminished cache effectiveness,
ARE BEING MONITORED. NTFS FILE SYSTEM REQUESTS THAT MODIFY and the consumption of processor cycles by spin lock code
FILE SYSTEM USE HAL SPIN LOCKS FUNCTIONS TO PROTECT CRITICAL serve to limit the scalability of shared-memory multipro-
SECTIONS OF CODE. cessors in Windows 2000 and other operating systems.
multiple threads could be active concurrently on a symmet- Furthermore, each additional processor that is added to the
ric multiprocessor where device interrupts are eligible to configuration amplifies these multiprocessor scalability
be processed on any processor. Windows 2000 provides a factors. These scalability factors make sizing, configuring,
standard set of spin lock services for Device Drivers to use and tuning large scale n-way Win2K multiprocessors a
in Interrupt Service Routines and kernels threads outside of very tricky business. Just how tricky should become more
ISRs. (See the DDK documentation on apparent after a consideration of the cache coherence
KeInitializeSpinLock, IoAcaquireCancelSpinLock, and problem in the next section.
related services for more detail on these services.) These
standard services allow Device Drivers written in C Cache coherence
language to be portable across versions of Windows NT
running on different hardware.
In Windows NT version 4.0, CPU 0 CPU 1
you can use the Thunk function Thread 0 L1 Cache L1 Cache Thread 1
in the x86 Perf Meter application unlock: xchg EAX,mem1 spinlock: xchg EAX,mem1

P6 P6
cmp EAX,zero
in the Resource Kit (pperf.exe – jne spinlock
the same application used to
access the Pentium Counters) to L2 Cache L2 Cache
witness spin lock activity. For
mem1

mem
example, from the Thunk menu, 1

select the ntfs.sys file system


driver module, using hal.dll as
Shared Memory Bus
the target module. Then select
the KfAcquireSpinLock and FIGURE 9. TWO THREADS OPERATING ON THE SAME MEMORY LOCATION CONCURRENTLY LEAD TO
KfReleaseSpinLock for monitor- PROBLEMS MAINTAINING THE COHERENCE OF INFORMATION STORED IN LOCAL PROCESSOR CACHES. IN THIS
ing, as illustrated in Figure 8. If EXAMPLE, THREAD 0 EXECUTING ON CPU 0 IS ABOUT TO RESET A LOCK WORD AT LOCATION MEM1,
you then generate some ntfs file RESIDENT IN ITS LEVEL 2 CACHE. MEANWHILE, THREAD 1 EXECUTING ON CPU 1 IS ATTEMPTING TO SET
system requests, like emptying THE SAME LOCK WORD AT LOCATION MEM1 TO ENTER THE CRITICAL SECTION THREAD 0 IS ABOUT TO EXIT.
the Recycle bin, you will observe HOW THE UPDATE TO AT LOCATION MEM1 PERFORMED IN LOCAL CACHE ON CPU 0 IS PROPAGATED TO
ntfs driver code using the HAL CPU 1 IS AN EXAMPLE OF THE CACHE COHERENCE PROBLEM.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

The cache effects of running on a shared-memory frequently accessed instructions and data, but we will see
multiprocessor are probably the most salient of the factors that so do Win2K systems software and hardware cached
limiting the scalability of this type of computer architec- disk controllers, for example.
ture. The various forms of processor cache, including In the interest of program correctness, updates made to

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
Translation Lookaside Buffers (TLBs), code and data private cache, which are deferred, ultimately must be
caches, and branch prediction tables, all play a critical role applied to the appropriate shared-memory locations before
in the performance of pipelined machines like the Pentium, any threads running on other processors attempt to access
Pentium Pro, Pentium II, and Pentium III. For the sake of the same information. Moreover, as Figure 9 illustrates,
performance, in a multiprocessor configuration each CPU there is an additional data integrity exposure because
retains its own private cache memory, as depicted in Figure another CPU can (and frequently does) have the same
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

9. We have seen that multiple threads executing inside the mem1 location resident in cache. The diagram illustrates a
Win2K kernel or running device driver code concurrently second thread that is in a spin loop trying to enter the same
can attempt to access the same memory locations. Propa- critical section. This code continuously tests the contents
gating changes to the contents of memory locations cached of the lock word at mem1 until it is successful. For the sake
locally to other engines that may have their own copies of of performance, the XCHG instruction running on CPU 1
the same memory is a major issue in designing multipro- also operates only on the cache line that contains mem1
cessors to operate correctly. This is also known as the and does not attempt to lock the bus, each time, because
cache coherence problem in shared-memory multiproces- that would stall each processor’s instruction execution
sors. Cache coherence issues also have significant pipeline. We can see that unless there is some way to let
performance ramifications. CPU 1 know that code running on CPU 0 has changed the
Maintaining cache coherence in a shared-memory contents of mem1, the code on CPU 1 will spin in this loop
multiprocessor is absolutely necessary in order for pro- forever. The Intel P6 processors solve this problem in
grams to execute correctly. While, for the most part, maintaining cache coherence using a method convention-
independent program execution threads operate indepen- ally called snooping.
dently of each other, sometimes they must interact.
Whenever they Read and Write common or shared- Intel MESI snooping protocol. Snooping protocols to
memory data structures, threads must communicate and maintain cache coherence have each processor listening to
coordinate accesses to these memory locations. This the shared-memory bus for changes in the status of cache
necessary coordination inevitably has performance resident addresses that other processors happen to be
consequences. We will illustrate this side effect by drawing operating on concurrently. Snooping requires that proces-
on an example where two kernel threads are attempting to sors place the memory addresses of any shared cache lines
gain access to the Win2K Scheduler Ready Queue simulta- being updated on the memory bus. All processors listen on
neously. As indicated earlier, a global data structure like the memory bus for memory references made by other
the Ready Queue that is subject to access from multiple processors that affect memory locations that are resident in
threads executing concurrently on different processors their private cache. Thus, the term snooping. The term
must be protected by a lock. Let’s look at how a lock word snooping also has the connotation that this method for
value set by one thread on one processor is propagated to keeping every processor’s private cache memory synchro-
cache memory in another processor where another thread is nized can be performed in the background (which it is)
attempting to gain access to the same critical section. without a major performance hit (which is true, but only up
In Figure 9, Thread 0 running on CPU 0 that has just to a point). In practice, maintaining cache coherence is a
finished updating the Win2K Scheduler Ready Queue, for complex process that can interfere substantially with
example, is about to exit a critical section. Upon exiting normal pipelined instruction execution and generates some
the critical section of code, Thread 0 resets the lock word serious scalability issues.
at location mem1 using a serializing instruction like Let’s illustrate how the Intel snooping protocol works,
XCHG. Instead of locking the bus during the execution of continuing with our Ready Queue lock word example. CPU
the XCHG instruction, the Intel P6 operates instead only 1, snooping on the bus, recognizes that the update to the
on the cache line that contains mem1. This is to boost mem1 address performed by CPU 0 invalidates its cache
performance. The locked memory fetch and store that the line containing mem1. Then, because the cache line
instruction otherwise requires would stall the CPU 0 containing mem1 is marked invalid, CPU 1 is forced to
pipeline. In the Intel Architecture, if the operand of a refetch mem1 from memory the very next time it attempts
serializing instruction like XCHG is resident in processor to execute the XCHG instruction inside the spin lock code.
cache in a multiprocessor configuration, then the P6 does Of course, at this point CPU 0 has still not yet updated
not lock the shared-memory bus. This is a form of deferred mem1 in memory. But CPU 0, also snooping on the shared-
write-back caching, which is very efficient. Not only does memory bus, discovers that CPU 1 is attempting to read
the processor cache hardware use this approach to caching the current value of mem1 from memory, CPU 0 intercepts

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

and delays the request. Then CPU 0 writes the cache line memory, locking the bus in the process to ensure coherent
containing mem1 back to memory. Then, and only then, is execution of all programs. CPU 0, snooping on the bus,
CPU 1 allowed to continue refreshing the corresponding blocks the memory fetch by CPU 1 because the state of
line in its private cache and updating it. that memory in CPU 0 cache is modified. CPU 0 then

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
The cache coherence protocol used in the Intel Archi- writes the contents of this line of cache back to memory,
tecture is denoted MESI, which corresponds to the four reflecting the current data in CPU 0’s cache. At this point,
states of each line in processor cache: modified, exclusive, CPU 1’s request to refresh cache memory is honored, and
shared, or invalid. The MESI protocol very rigidly defines the now current 32 bytes containing mem1 are brought into
what actions each processor in a multiprocessor configura- CPU 1 cache. At the end of this sequence, both CPU 0 and
tion must take based on the state of a line of cache and the CPU 1 have valid data in cache, with both lines in the
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

attempt by another processor to act on the same data. The shared state.
scenario described above illustrates just one set of circum- The MESI protocol ensures that cache memory in the
stances that the MESI protocol is designed to handle. Let’s various independently executing processors is consistent
review this example using the Intel MESI terminology. no matter what the other processors are doing. Clearly,
what is happening in one processor can interfere with the
An invalid line that must be instruction execution stream running on the other. With
Invalid refreshed from memor y multiple threads accessing shared-memory locations, there
is no avoiding this. These operations on shared-memory
Valid line, unmodified, stall the pipelines of the processors affected. For example,
Exclusive guaranteed that this line only when CPU 0 snoops on the bus and finds another processor
exists in this cache
is attempting to fetch a line of cache from memory that is
Valid line, unmodified, line also resident in its private cache in a modified state, then
Shared exists in at least one other whatever instructions CPU 0 is attempting to execute in its
cache
pipeline are suspended. Writing back modified data from
Valid line, modified, guaranteed cache to memory takes precedence because another
that this line only exists in this
Modified cache, the cor responding
processor is waiting. Similarly, CPU 1 running its spin lock
memor y line is stale
code must update the state of that shared line of cache
when CPU 0 resets the lock word. Once the line of cache
TABLE 1. THE MESI CACHE COHERENCE PROTOCOL USED IN THE containing the lock word is marked invalid on CPU 1, the
INTEL ARCHITECTURE. MESI REFERS TO THE FOUR STATES THAT A serializing instruction issued on CPU 1 stalls the pipeline
LINE OF CACHE CAN BE IN: MODIFIED, EXCLUSIVE, SHARED, OR because cache must be refreshed from memory. The
INVALID. AT ANY ONE TIME, A LINE IN CACHE IS IN ONE AND ONLY ONE pipeline is stalled until CPU 0 can update memory and
OF THESE FOUR STATES. allow the memory fetch operation to proceed.

Suppose that Thread 1 running in a spin lock on CPU 1 Memory bus contention. One not so obvious perfor-
starts by testing the lock word at location mem1. The 32 mance implication of snooping protocols is that they utilize
bytes containing this memory location are brought into the the shared-memory bus heavily. Every time an instruction
cache. This line of cache is flagged exclusive because it is executing on one processor needs to fetch a new value
currently contained only in CPU 1 cache. Meanwhile, from memory or update an existing one, it must place the
when CPU 0 executes the first part of the XCHG instruc- designated memory address on the shared bus. The bus
tion on mem1 designed to reset the lock, the 32 bytes itself is a resource which must be shared. With more and
containing this memory location are brought into the CPU more processors executing, the bus tends to get quite busy.
0 cache. CPU 1, snooping on the bus, detects CPU 0’s When the bus is in use, other processors must wait.
interest in a line of cache that is currently marked exclusive Utilization of the shared-memory bus is likely to be the
and transitions this line from exclusive to shared. CPU 1 most serious bottleneck impacting scalability in multipro-
signals CPU 0 that it too has this line of memory in cache cessor configurations of three, four, or more processing
so that CPU 0 marks the line shared, too. The second part engines.
of the XCHG instruction updates mem1 in CPU 0 cache. The measurement facility in the Intel P6 or Pentium Pro
The cache line resident in CPU 0 transitions from shared processors (including Pentium II and Pentium III proces-
to modified as a result. Meanwhile CPU 1, snooping on the sors) was strengthened to help hardware designers cope
bus, flags its corresponding cache line as invalid, as with the demands of more complicated multiprocessor
described above. Subsequent execution of the XCHG designs. By installing the Pentium Counter support
instruction within the original spin lock code executing on provided in the Windows NT 4.0 Resource Kit, system
CPU 1 to acquire the lock finds the cache line invalid. administrators and performance analysts can access these
CPU 1 then attempts to refresh the cache line from hardware measurements, as discussed last chapter. (This

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

facility does not work under Windows 2000 and was and some counter names use the term cycles and others use
removed from the Windows 2000 Resource Kit.) While the term clocks. The two terms appear to be interchange-
these Counters are given the cautionary rating of Wizard able. Although not explicitly indicated, some counters that
within Perfmon, we hope that the discussion above on mention neither clocks nor cycles are also measured in

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
multiprocessor design and performance will give you the clocks. For example, an especially useful measure is Bus
confidence to start using them to help diagnose specific requests outstanding, which measures the total number of
performance problems associated with large scale Win2K clocks the bus is busy.
multiprocessors. The P6 Counters provide valuable insight Bus memory transactions and Bus all transactions
into multiprocessor performance, including direct measure- measure the number of bus requests. One thing about the
ment of the processor instruction rate, level 2 cache, TLB, bus measurements is that they are not processor-specific
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

branch prediction, and the all important shared-memory since the memory bus is a shared component. The memory
bus. bus that the processors share is a single resource, subject to
The P6 measurements that can often shed the most light the usual queuing delays. We will derive a measure of bus
on multiprocessor performance are the shared-memory bus queuing delay in a moment.
measurements. Appendix A lists the various P6 bus Now, let’s look at some more P6 measurement data
measurement counters, using the Microsoft counter names from a multiprocessor system. A good place to start is with
from Counters.hlp [5]. Many of the counter names and Bus all transactions/sec, which, as noted above, is the total
their unilluminating Explain text are very arcane and number of bus requests. Figure 10 shows that when the bus
esoteric. For example, to understand what Bus DRDY is busy, it usually is busy due to memory accesses. Bus
asserted clocks/second means might send us scurrying in memory transactions/sec represent over 99% of all bus
vain to the Intel Architecture manuals for help, where, transactions. The measurement data is consistent with the
unfortunately, not much help can be had. A second discussion above suggesting that bus utilization is often the
observation, which is triggered by the experience viewing bottleneck in shared-memory multiprocessors that utilize
the counters under controlled conditions, is that some of snooping protocols to maintain cache coherence. Every
them probably do not mean what they appear to. For time any processor attempts to access main memory, it
example, the Bus LOCK asserted clocks/sec counter must first gain access to the shared bus.
consistently appears to be zero on both uniprocessor and
multiprocessor configurations. Not much help there. The
shared-memory bus is driven at the processor clock rate,

FIGURE 10. MEMORY ACCESSES DRIVE BUS UTILIZATION. MEMORY FIGURE 11. BUS SNOOP STALLED CYCLES/SEC PROVIDES A DIRECT
TRANSACTIONS REPRESENT OVER 99% OF ALL BUS TRANSACTIONS IN MEASURE OF MULTIPROCESSOR SHARED-MEMORY CONTENTION. IN
THIS EXAMPLE, WHICH IS TYPICAL OF BOTH UNIPROCESSORS AND THIS EXAMPLE FROM A TWO-WAY MULTIPROCESSOR, THE NUMBER OF
MULTIPROCESSORS. THE SHARED BUS CAN EASILY BECOME A STALLS DUE TO SNOOPING IS RELATIVELY SMALL COMPARED TO ALL
BOTTLENECK ON A MULTIPROCESSOR. RESOURCE STALLS.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

necessary to maintain cache coher-


ence. In this regard, both the rate of
Level 2 cache misses and the number
of write-back memory transactions is

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
relevant because both actions drive
bus utilization. The P6 Level 2 cache
performance measurements are
especially useful in this context for
evaluating different processor configu-
rations from Intel and other vendors
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

that have different amounts of Level 2


cache. By accessing this measurement
data, you can assess the benefits of
different configuration options
directly. This is always a better
method than relying on some Rule of
Thumb value proposed by this or that
performance expert, perhaps based on
a measurement taken running a
benchmark workload that does not
reflect your workload.
FIGURE 12. TRACKING P6 BUS MEASUREMENTS ON A UNIPROCESSOR Another Counter called Bus snoop
USING PERFORMANCE MONITOR.
stalled cycles/sec has intrinsic interest on a multiprocessor.
A high rate of stalls due to snooping is a direct indicator of
Larger Level 2 caches help reduce bus traffic, but there multiprocessor contention. See Figure 11, which again was
are diminishing returns from caches that are, in effect, too measured on a two-way multiprocessor. Notice the number
big. Each time a memory location is fetched directly from a of snooping-induced stalls is low in this example. Even
Level 1 or Level 2 cache, it is not necessary to broadcast though as a percentage of the total resource stalls they are
the address on the bus. However, at some point, larger practically insignificant in this example, this is still a
caches do not result in significant improvements in the rate measurement that bears watching.
of cache hits, yet they increase the management overhead

&ORFNVSHUEXVWUDQVDFWLRQ
10,000,000 50

9,000,000

8,000,000 40

7,000,000
V
Q
V
6,000,000 30 D
U
N W

F 5,000,000 V
R N
O
& F
4,000,000 20 R
O
&
3,000,000

2,000,000 10

1,000,000

0 0

7LPH
Bus requests outstanding/sec Bus all transactions/sec Clocks per bus transaction

FIGURE 13. CALCULATING THE AVERAGE CLOCKS PER BUS TRANSACTION FROM THE UNIPROCESSOR MEASUREMENTS SHOWN IN FIGURE 11 USING
EXCEL. THE NUMBER OF CLOCKS PER BUS TRANSACTION RANGES BETWEEN 10 AND 30, WITH AN AVERAGE OF ABOUT 18.

Find a CMG regional meeting near you at www.cmg.org/regions


Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference

Next, consider the P6 Bus requests outstanding Counter,


which is a direct measurement of bus utilization in clocks.
By also monitoring Bus all transactions, you can derive a
simple response time measure of bus transactions measured

Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
as the average clocks per transactions:
AVERAGE CLOCKS PER TRANSACTIONS =
BUS REQUESTS OUTSTANDING ¸ BUS ALL TRANSACTIONS
Assuming contention for the shared-memory bus is a
factor, saturation of the bus on an n-way multiprocessor
will likely drive up bus transaction response time, mea-
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe

sured in clocks per transaction on average. Figure 12, a


Perfmon screen shot, provides a uniprocessor baseline for
this calculation. Since Perfmon cannot perform any
arithmetic calculations, we exported the chart data to a file
so that it could be processed in an Excel spreadsheet.
Using Excel, we are able to divide Bus requests outstand-
ing by Bus all transactions to derive the average number of FIGURE 14. AVERAGE CLOCKS PER BUS TRANSACTION ON A TWO-
clocks per transaction. (Bear in mind that we can access WAY MULTIPROCESSOR. THE AVERAGE BUS TRANSACTION HERE
only two P5 or P6 Counters at a time. Since memory TAKES ABOUT THIRTY CLOCKS.
transactions typically represent more than 99% of all bus
transactions, it is safe to assume that clock cycles calcu- The PAUSE instruction was designed to eliminate the
lated using this formula genuinely do reflect the time it bus transactions that occur when spin lock code repeatedly
takes the processor to access memory.) The average clocks tries to test and set a memory location. Instead of repeat-
per bus transaction in this example generally falls in the edly executing code that tests the value of a lock word to
range of 10-30 clocks, with an average of about 18 clocks see if it is safe to enter a critical section, queued spin locks
per transaction. These calculations are summarized in the wait quietly without generating any bus transactions. Since
chart shown in Figure 13. In the case of a uniprocessor, the saturation of the shared memory bus is an inherent problem
memory bus is a dedicated resource and there is no in shared memory multiprocessors, this innovation should
contention. Now compare the uniprocessor baseline in improve the scalability of Windows 2000. At the same
Figure 12 to a two-way multiprocessor in Figure 14. Here time, Intel designers also boosted the performance of the
the average clocks per transaction is about 30, coming in at Pentium III system bus significantly, something which
the high end of the uniprocessor range. The average should also improve multiprocessor scalability under
number of clocks per bus transaction increases because of Windows 2000.
queuing delays in accessing the shared-memory bus in the
multiprocessor. In a shared memory multiprocessor, there
is going to be memory bus contention. By tracking these
P6 Counters, you can detect environments where adding
more processors to the system will not speed up processing
any because the shared-memory bus is already saturated.
Bus contention tends to set an upper limit on the perfor-
mance of a multiprocessor configuration, and the P6
Counters let you measure this.
References.
Queued spin locks. With the Pentium III, Intel intro- [1] Neil J. Gunther, The Practical Performance Analyst.
duced a new instruction called PAUSE that reduces the bus New York: McGraw-Hill, 1998.
contention that results from repeatedly executing spin lock [2] Mark B. Friedman, “Optimizing the Performance
code. The Windows 2000 HAL adds a new queued spin of Wintel Applications.” CMG ‘98 Proceedings,
lock function that exploits the new hardware instruction, (December 1998) 245-259.
where available. Device driver code trying to enter a [3] Mark B. Friedman, “Windows NT Page replace-
critical section protected by a queued spin lock issues the ment Policies.” CMG ‘99 Proceedings,
PAUSE instruction, referencing the lock word protecting (December 1999) 234-244.
that piece of code. PAUSE halts the CPU until the lock [4] Intel Architecture Software Developer’s Guide:
word is changed by a different executing thread running on Volume 3, System Programming Guide, 1998.
another processor. At that point, the PAUSEd processor [5] Microsoft Windows NT 4.0 Workstation Resource
wakes up and resumes execution. Kit. Redmond, WA: Microsoft Press, 1996.

Find a CMG regional meeting near you at www.cmg.org/regions

You might also like