CMG- multiprocessor scalibility in windows
CMG- multiprocessor scalibility in windows
Performance Professionals
The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the
measurement and management of computer systems. CMG members are primarily concerned with performance evaluation of existing systems to maximize
performance (eg. response time, throughput, etc.) and with capacity management where planned enhancements to existing systems or the design of new
systems are evaluated to find the necessary resources required to provide adequate performance at a reasonable cost.
This paper was originally published in the Proceedings of the Computer Measurement Group’s 2000 International Conference.
Copyright 2000 by The Computer Measurement Group, Inc. All Rights Reserved. Published by The Computer Measurement Group, Inc. (CMG), a non-profit
Illinois membership corporation. Permission to reprint in whole or in any part may be granted for educational and scientific purposes upon written application to
the Editor, CMG Headquarters, 151 Fries Mill Road, Suite 104, Turnersville , NJ 08012.
BY DOWNLOADING THIS PUBLICATION, YOU ACKNOWLEDGE THAT YOU HAVE READ, UNDERSTOOD AND AGREE TO BE BOUND BY THE
FOLLOWING TERMS AND CONDITIONS:
License: CMG hereby grants you a nonexclusive, nontransferable right to download this publication from the CMG Web site for personal use on a single
computer owned, leased or otherwise controlled by you. In the event that the computer becomes dysfunctional, such that you are unable to access the
publication, you may transfer the publication to another single computer, provided that it is removed from the computer from which it is transferred and its use
on the replacement computer otherwise complies with the terms of this Copyright Notice and License.
Copyright: No part of this publication or electronic file may be reproduced or transmitted in any form to anyone else, including transmittal by e-mail, by file
transfer protocol (FTP), or by being made part of a network-accessible system, without the prior written permission of CMG. You may not merge, adapt,
translate, modify, rent, lease, sell, sublicense, assign or otherwise transfer the publication, or remove any proprietary notice or label appearing on the
publication.
Disclaimer; Limitation of Liability: The ideas and concepts set forth in this publication are solely those of the respective authors, and not of CMG, and CMG
does not endorse, approve, guarantee or otherwise certify any such ideas or concepts in any application or usage. CMG assumes no responsibility or liability
in connection with the use or misuse of the publication or electronic file. CMG makes no warranty or representation that the electronic file will be free from
errors, viruses, worms or other elements or codes that manifest contaminating or destructive properties, and it expressly disclaims liability arising from such
errors, elements or codes.
General: CMG reserves the right to terminate this Agreement immediately upon discovery of violation of any of its terms.
Learn the basics and latest aspects of IT Service Management at CMG's Annual Conference - www.cmg.org/conference
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
DataCore Software, Inc.
1020 Eighth Avenue South, Suite 6
Naples, FL USA 34102
[email protected]
This paper provides an overview of the multiprocessing support in the Microsoft Windows NT/2000 operating system,
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
with an emphasis on scalability and other capacity planning issues. It also discusses specific features of the Intel P6
architecture that provide the hardware basis for large scale multiprocessing systems. As a shared memory multiprocessing
implementation, Windows NT/2000 is predictably vulnerable to saturation on the shared memory bus. Processor
hardware measurements that can illuminate memory bus contention when it appears are also described and discussed.
P6 P6
precisely where the bottleneck in shared-memory
designs often is.
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
multiprocessor system running NT 4.0 Server. (Be careful,
the vertical axis scale was adjusted down to a maximum of
thirty to make the chart data easier to decipher.) The two
processor instances of % Privileged Time and % Interrupt
Time Counters are shown. The processing workload is
roughly balanced across both processors, although the load
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
processor. Then, the Thread calls SetProcessAffinityMask
with a corresponding 32-bit affinity mask that indicates
which processors Threads from the process can be dis-
patched on. Figure 4, which illustrates the use of this
function in Taskman, allows you to set a processs affinity
mask dynamically, subject to the usual security restrictions
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
multiprocessor performance
suggest that the machines get
progressively even less
efficient as you add more
processors to the shared bus.
However, with proper care
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
17 instructions. Instructions coded
,GHDO with the LOCK prefix are
guaranteed to run uninterrupted
5HODWLYH3HUIRUPDQFH
:LQ."
and gain exclusive access to the
:LQ."" designated memory locations.
Locking the shared-memory
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
these instructions to drain the pipeline before executing the Spin locks. If two threads are attempting to access the
instruction. Following execution of the serializing instruc- same serializable resource, one thread will acquire the
tion, the pipeline is started up again. These serializing lock, which then blocks the other one until the lock is
instructions include privileged operations that move values released. A block of code guarded by some synchroniza-
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
into internal Control and Debug Register, for example. tion or locking structure is called a critical section. (The
Serializing instructions have the effect on the P6 of forcing generic name should not be confused with the Win32 API
the processor to re-execute out of order instructions, for function which provides platform independent locking
example. services for critical sections.) Problem: what should the
The performance impact of draining the instruction thread that is blocked waiting on a critical section do while
execution pipeline ought to be obvious. Current generation it is waiting? An application program in Windows 2000 is
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
P5 and P6 Intel processors are pipelined, superscalar expected to use Win32 serialization runtime services that
architectures. The performance impact of executing an puts the application to sleep until notified that the lock is
instruction serialized with the LOCK prefix includes available. Win32 serialization services arrange multiple
potentially stalling the pipelines of other processors threads waiting on a shared resource in a FIFO queue so
executing instructions until the instruction that requires that the queueing discipline is fair. This suggests that a key
serialization frees up the shared-memory bus. This can be a element of designing an application to run well on a
fairly substantial performance hit, too, which is solely a shared-memory multiprocessor is to minimize the amount
consequence of running in a multiprocessor environment. of processing time spent inside critical sections. The
The cost of both sorts of instruction serialization contribute shorter the time spent executing inside a locked critical
to at least some of the less than linear scalability that we section of code, the less time other threads are blocked
can expect in a multiprocessor. How much is very difficult waiting to enter it. Much of the re-engineering work
to quantify, and certainly workload dependent. There is Microsoft did on NT 4.0 and again in Windows 2000 was
also very little one can do about this source of degradation. to redesign the critical sections internal to the OS to
Without serializing instructions, multiple processors would minimize the amount of time kernel threads would have to
simply not work reliably. wait for shared resources.
A second source of multiprocessor interference is When critical sections are designed appropriately, then
interprocessor signaling instructions. These are instruc- threads waiting on a locked critical section should not have
tions issued on one processor to signal another processor, long to wait. Furthermore, while a thread is waiting on a
for example, to wake it up to process a pending interrupt. lock, there may be nothing else for it do. For example, a
By its very nature, interprocessor signaling is quite thread waiting on the Win2K Scheduler lock can perform
expensive, in performance terms. no useful work until it has successfully acquired that lock.
For example, consider a kernel or device driver with OSD
Cache effects. Effective on-board CPU caching is privileges that is blocked waiting on a lock. The resource
critical to the performance of pipelined processors [5]. the thread is waiting for is required before any other useful
Intel waited to introduce pipelining with its 486 chips until work on the processor can be performed. The wait can be
there was enough real estate available to include an on- expected to be of very short duration. Under these circum-
board cache. It should not be a big surprise to learn that stances, the best thing to do may be to loop back and test
one secondary effect of multiprocessor coordination and for the availability of the lock again. Code that tests for
serialization is that it makes caching less effective. This, in availability of a lock that finally enters a critical section
turn, serves to slow down the processors instruction and sets the lock using a serializing instruction. If the same
execution rate. In order to understand why SMPs impact code finds the lock is already set (presumably by a thread
cache effectiveness, we will take a detour into the realm of running on a different processor), there is nothing to do on
cache coherence in the next section. From a configuration a shared memory multiprocessor other than retry the lock
and tuning perspective, one intended effect of setting up an again. The entry code simply branches back to retest the
application to run with processor affinity is to improve lock. This coding technique is known as a spin lock. If you
cache effectiveness and increase the instruction execution are able to watch this codes execution, it appears to be
rate. Direct measurements of both instruction execution stuck in a very tight loop of just a few instructions until
rate and caching efficiency, fortunately, are available via the lock requested is finally available.
the Pentium Counters. Unfortunately, the Pentium Counter Spin locks are used in many, many different places
support Microsoft provides in the NT 4.0 Resource Kit throughout the operating system in Windows 2000 because
falls short of the precision tool that MP configurations operating system code waiting for a critical section to be
require. Moreover, Microsoft no longer provides a means unlocked often has nothing better to do during what is,
to gather Pentium statistics in Windows 2000. hopefully, a very short waiting period than retest the lock.
For example, device drivers are required to use spin locks
to protect data structures if there is any possibility that
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
access to disk files. In fact, it is likely that ntfs
functions will execute concurrently (on more than one
processor) from time to time. ntfs.sys uses HAL spin
lock functions to protect critical sections of code,
preserving the integrity of the file system in a multi-
processor environment.
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
P6 P6
cmp EAX,zero
in the Resource Kit (pperf.exe jne spinlock
the same application used to
access the Pentium Counters) to L2 Cache L2 Cache
witness spin lock activity. For
mem1
mem
example, from the Thunk menu, 1
The cache effects of running on a shared-memory frequently accessed instructions and data, but we will see
multiprocessor are probably the most salient of the factors that so do Win2K systems software and hardware cached
limiting the scalability of this type of computer architec- disk controllers, for example.
ture. The various forms of processor cache, including In the interest of program correctness, updates made to
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
Translation Lookaside Buffers (TLBs), code and data private cache, which are deferred, ultimately must be
caches, and branch prediction tables, all play a critical role applied to the appropriate shared-memory locations before
in the performance of pipelined machines like the Pentium, any threads running on other processors attempt to access
Pentium Pro, Pentium II, and Pentium III. For the sake of the same information. Moreover, as Figure 9 illustrates,
performance, in a multiprocessor configuration each CPU there is an additional data integrity exposure because
retains its own private cache memory, as depicted in Figure another CPU can (and frequently does) have the same
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
9. We have seen that multiple threads executing inside the mem1 location resident in cache. The diagram illustrates a
Win2K kernel or running device driver code concurrently second thread that is in a spin loop trying to enter the same
can attempt to access the same memory locations. Propa- critical section. This code continuously tests the contents
gating changes to the contents of memory locations cached of the lock word at mem1 until it is successful. For the sake
locally to other engines that may have their own copies of of performance, the XCHG instruction running on CPU 1
the same memory is a major issue in designing multipro- also operates only on the cache line that contains mem1
cessors to operate correctly. This is also known as the and does not attempt to lock the bus, each time, because
cache coherence problem in shared-memory multiproces- that would stall each processors instruction execution
sors. Cache coherence issues also have significant pipeline. We can see that unless there is some way to let
performance ramifications. CPU 1 know that code running on CPU 0 has changed the
Maintaining cache coherence in a shared-memory contents of mem1, the code on CPU 1 will spin in this loop
multiprocessor is absolutely necessary in order for pro- forever. The Intel P6 processors solve this problem in
grams to execute correctly. While, for the most part, maintaining cache coherence using a method convention-
independent program execution threads operate indepen- ally called snooping.
dently of each other, sometimes they must interact.
Whenever they Read and Write common or shared- Intel MESI snooping protocol. Snooping protocols to
memory data structures, threads must communicate and maintain cache coherence have each processor listening to
coordinate accesses to these memory locations. This the shared-memory bus for changes in the status of cache
necessary coordination inevitably has performance resident addresses that other processors happen to be
consequences. We will illustrate this side effect by drawing operating on concurrently. Snooping requires that proces-
on an example where two kernel threads are attempting to sors place the memory addresses of any shared cache lines
gain access to the Win2K Scheduler Ready Queue simulta- being updated on the memory bus. All processors listen on
neously. As indicated earlier, a global data structure like the memory bus for memory references made by other
the Ready Queue that is subject to access from multiple processors that affect memory locations that are resident in
threads executing concurrently on different processors their private cache. Thus, the term snooping. The term
must be protected by a lock. Lets look at how a lock word snooping also has the connotation that this method for
value set by one thread on one processor is propagated to keeping every processors private cache memory synchro-
cache memory in another processor where another thread is nized can be performed in the background (which it is)
attempting to gain access to the same critical section. without a major performance hit (which is true, but only up
In Figure 9, Thread 0 running on CPU 0 that has just to a point). In practice, maintaining cache coherence is a
finished updating the Win2K Scheduler Ready Queue, for complex process that can interfere substantially with
example, is about to exit a critical section. Upon exiting normal pipelined instruction execution and generates some
the critical section of code, Thread 0 resets the lock word serious scalability issues.
at location mem1 using a serializing instruction like Lets illustrate how the Intel snooping protocol works,
XCHG. Instead of locking the bus during the execution of continuing with our Ready Queue lock word example. CPU
the XCHG instruction, the Intel P6 operates instead only 1, snooping on the bus, recognizes that the update to the
on the cache line that contains mem1. This is to boost mem1 address performed by CPU 0 invalidates its cache
performance. The locked memory fetch and store that the line containing mem1. Then, because the cache line
instruction otherwise requires would stall the CPU 0 containing mem1 is marked invalid, CPU 1 is forced to
pipeline. In the Intel Architecture, if the operand of a refetch mem1 from memory the very next time it attempts
serializing instruction like XCHG is resident in processor to execute the XCHG instruction inside the spin lock code.
cache in a multiprocessor configuration, then the P6 does Of course, at this point CPU 0 has still not yet updated
not lock the shared-memory bus. This is a form of deferred mem1 in memory. But CPU 0, also snooping on the shared-
write-back caching, which is very efficient. Not only does memory bus, discovers that CPU 1 is attempting to read
the processor cache hardware use this approach to caching the current value of mem1 from memory, CPU 0 intercepts
and delays the request. Then CPU 0 writes the cache line memory, locking the bus in the process to ensure coherent
containing mem1 back to memory. Then, and only then, is execution of all programs. CPU 0, snooping on the bus,
CPU 1 allowed to continue refreshing the corresponding blocks the memory fetch by CPU 1 because the state of
line in its private cache and updating it. that memory in CPU 0 cache is modified. CPU 0 then
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
The cache coherence protocol used in the Intel Archi- writes the contents of this line of cache back to memory,
tecture is denoted MESI, which corresponds to the four reflecting the current data in CPU 0s cache. At this point,
states of each line in processor cache: modified, exclusive, CPU 1s request to refresh cache memory is honored, and
shared, or invalid. The MESI protocol very rigidly defines the now current 32 bytes containing mem1 are brought into
what actions each processor in a multiprocessor configura- CPU 1 cache. At the end of this sequence, both CPU 0 and
tion must take based on the state of a line of cache and the CPU 1 have valid data in cache, with both lines in the
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
attempt by another processor to act on the same data. The shared state.
scenario described above illustrates just one set of circum- The MESI protocol ensures that cache memory in the
stances that the MESI protocol is designed to handle. Lets various independently executing processors is consistent
review this example using the Intel MESI terminology. no matter what the other processors are doing. Clearly,
what is happening in one processor can interfere with the
An invalid line that must be instruction execution stream running on the other. With
Invalid refreshed from memor y multiple threads accessing shared-memory locations, there
is no avoiding this. These operations on shared-memory
Valid line, unmodified, stall the pipelines of the processors affected. For example,
Exclusive guaranteed that this line only when CPU 0 snoops on the bus and finds another processor
exists in this cache
is attempting to fetch a line of cache from memory that is
Valid line, unmodified, line also resident in its private cache in a modified state, then
Shared exists in at least one other whatever instructions CPU 0 is attempting to execute in its
cache
pipeline are suspended. Writing back modified data from
Valid line, modified, guaranteed cache to memory takes precedence because another
that this line only exists in this
Modified cache, the cor responding
processor is waiting. Similarly, CPU 1 running its spin lock
memor y line is stale
code must update the state of that shared line of cache
when CPU 0 resets the lock word. Once the line of cache
TABLE 1. THE MESI CACHE COHERENCE PROTOCOL USED IN THE containing the lock word is marked invalid on CPU 1, the
INTEL ARCHITECTURE. MESI REFERS TO THE FOUR STATES THAT A serializing instruction issued on CPU 1 stalls the pipeline
LINE OF CACHE CAN BE IN: MODIFIED, EXCLUSIVE, SHARED, OR because cache must be refreshed from memory. The
INVALID. AT ANY ONE TIME, A LINE IN CACHE IS IN ONE AND ONLY ONE pipeline is stalled until CPU 0 can update memory and
OF THESE FOUR STATES. allow the memory fetch operation to proceed.
Suppose that Thread 1 running in a spin lock on CPU 1 Memory bus contention. One not so obvious perfor-
starts by testing the lock word at location mem1. The 32 mance implication of snooping protocols is that they utilize
bytes containing this memory location are brought into the the shared-memory bus heavily. Every time an instruction
cache. This line of cache is flagged exclusive because it is executing on one processor needs to fetch a new value
currently contained only in CPU 1 cache. Meanwhile, from memory or update an existing one, it must place the
when CPU 0 executes the first part of the XCHG instruc- designated memory address on the shared bus. The bus
tion on mem1 designed to reset the lock, the 32 bytes itself is a resource which must be shared. With more and
containing this memory location are brought into the CPU more processors executing, the bus tends to get quite busy.
0 cache. CPU 1, snooping on the bus, detects CPU 0s When the bus is in use, other processors must wait.
interest in a line of cache that is currently marked exclusive Utilization of the shared-memory bus is likely to be the
and transitions this line from exclusive to shared. CPU 1 most serious bottleneck impacting scalability in multipro-
signals CPU 0 that it too has this line of memory in cache cessor configurations of three, four, or more processing
so that CPU 0 marks the line shared, too. The second part engines.
of the XCHG instruction updates mem1 in CPU 0 cache. The measurement facility in the Intel P6 or Pentium Pro
The cache line resident in CPU 0 transitions from shared processors (including Pentium II and Pentium III proces-
to modified as a result. Meanwhile CPU 1, snooping on the sors) was strengthened to help hardware designers cope
bus, flags its corresponding cache line as invalid, as with the demands of more complicated multiprocessor
described above. Subsequent execution of the XCHG designs. By installing the Pentium Counter support
instruction within the original spin lock code executing on provided in the Windows NT 4.0 Resource Kit, system
CPU 1 to acquire the lock finds the cache line invalid. administrators and performance analysts can access these
CPU 1 then attempts to refresh the cache line from hardware measurements, as discussed last chapter. (This
facility does not work under Windows 2000 and was and some counter names use the term cycles and others use
removed from the Windows 2000 Resource Kit.) While the term clocks. The two terms appear to be interchange-
these Counters are given the cautionary rating of Wizard able. Although not explicitly indicated, some counters that
within Perfmon, we hope that the discussion above on mention neither clocks nor cycles are also measured in
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
multiprocessor design and performance will give you the clocks. For example, an especially useful measure is Bus
confidence to start using them to help diagnose specific requests outstanding, which measures the total number of
performance problems associated with large scale Win2K clocks the bus is busy.
multiprocessors. The P6 Counters provide valuable insight Bus memory transactions and Bus all transactions
into multiprocessor performance, including direct measure- measure the number of bus requests. One thing about the
ment of the processor instruction rate, level 2 cache, TLB, bus measurements is that they are not processor-specific
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
branch prediction, and the all important shared-memory since the memory bus is a shared component. The memory
bus. bus that the processors share is a single resource, subject to
The P6 measurements that can often shed the most light the usual queuing delays. We will derive a measure of bus
on multiprocessor performance are the shared-memory bus queuing delay in a moment.
measurements. Appendix A lists the various P6 bus Now, lets look at some more P6 measurement data
measurement counters, using the Microsoft counter names from a multiprocessor system. A good place to start is with
from Counters.hlp [5]. Many of the counter names and Bus all transactions/sec, which, as noted above, is the total
their unilluminating Explain text are very arcane and number of bus requests. Figure 10 shows that when the bus
esoteric. For example, to understand what Bus DRDY is busy, it usually is busy due to memory accesses. Bus
asserted clocks/second means might send us scurrying in memory transactions/sec represent over 99% of all bus
vain to the Intel Architecture manuals for help, where, transactions. The measurement data is consistent with the
unfortunately, not much help can be had. A second discussion above suggesting that bus utilization is often the
observation, which is triggered by the experience viewing bottleneck in shared-memory multiprocessors that utilize
the counters under controlled conditions, is that some of snooping protocols to maintain cache coherence. Every
them probably do not mean what they appear to. For time any processor attempts to access main memory, it
example, the Bus LOCK asserted clocks/sec counter must first gain access to the shared bus.
consistently appears to be zero on both uniprocessor and
multiprocessor configurations. Not much help there. The
shared-memory bus is driven at the processor clock rate,
FIGURE 10. MEMORY ACCESSES DRIVE BUS UTILIZATION. MEMORY FIGURE 11. BUS SNOOP STALLED CYCLES/SEC PROVIDES A DIRECT
TRANSACTIONS REPRESENT OVER 99% OF ALL BUS TRANSACTIONS IN MEASURE OF MULTIPROCESSOR SHARED-MEMORY CONTENTION. IN
THIS EXAMPLE, WHICH IS TYPICAL OF BOTH UNIPROCESSORS AND THIS EXAMPLE FROM A TWO-WAY MULTIPROCESSOR, THE NUMBER OF
MULTIPROCESSORS. THE SHARED BUS CAN EASILY BECOME A STALLS DUE TO SNOOPING IS RELATIVELY SMALL COMPARED TO ALL
BOTTLENECK ON A MULTIPROCESSOR. RESOURCE STALLS.
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
relevant because both actions drive
bus utilization. The P6 Level 2 cache
performance measurements are
especially useful in this context for
evaluating different processor configu-
rations from Intel and other vendors
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe
&ORFNVSHUEXVWUDQVDFWLRQ
10,000,000 50
9,000,000
8,000,000 40
7,000,000
V
Q
V
6,000,000 30 D
U
N W
F 5,000,000 V
R N
O
& F
4,000,000 20 R
O
&
3,000,000
2,000,000 10
1,000,000
0 0
7LPH
Bus requests outstanding/sec Bus all transactions/sec Clocks per bus transaction
FIGURE 13. CALCULATING THE AVERAGE CLOCKS PER BUS TRANSACTION FROM THE UNIPROCESSOR MEASUREMENTS SHOWN IN FIGURE 11 USING
EXCEL. THE NUMBER OF CLOCKS PER BUS TRANSACTION RANGES BETWEEN 10 AND 30, WITH AN AVERAGE OF ABOUT 18.
Buy the Latest Conference Proceedings and Find Latest Computer Performance Management 'How To' for All Platforms at www.cmg.org
as the average clocks per transactions:
AVERAGE CLOCKS PER TRANSACTIONS =
BUS REQUESTS OUTSTANDING ¸ BUS ALL TRANSACTIONS
Assuming contention for the shared-memory bus is a
factor, saturation of the bus on an n-way multiprocessor
will likely drive up bus transaction response time, mea-
Join over 14,000 peers - subscribe to free CMG publication, MeasureIT(tm), at www.cmg.org/subscribe