0% found this document useful (0 votes)

3 views

Jouppi Improving Direct Mapped Cache Performance

This paper discusses techniques to enhance cache performance in processors, particularly through the use of miss caching and stream buffers. By introducing small fully-associative caches and prefetch buffers, the authors demonstrate significant reductions in cache miss rates, which can lead to improved overall system performance. The findings highlight the critical need for advanced memory hierarchy designs to mitigate the performance losses associated with cache misses in modern processors.

Uploaded by

Andrii Herasymenko

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Jouppi Improving Direct Mapped Cache Performance

Uploaded by

Andrii Herasymenko

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Improving Direct-Mapped Cache Performance by the Addition

of a Small Fully-Associative Cache and Prefetch Buffers

Norman P. Jouppi
Digital Equipment Corporation WesternResearchLab
100 Hamilton Ave., Palo Alto, CA 94301

Abstract dous increasesin miss cost. For example, a cache miss

on a VAX 1l/780 only costs 60% of the averageinstruc-
tion execution. Thus even if every instruction had a
Projectionsof computer technology forecastproces- cache miss, the machine performancewould slow down
sors with peak performanceof 1,000 MIPS in the rela- by only 60%! However, if a RISC machine like the
tively near future. These processorscould easily lose WRL Titan [lo] has a miss, the cost is almost ten in-
half or more of their performancein the memory hierar- struction times. Moreover, these trends seemto be con-
chy if the hierarchy design is based on conventional tinuing, especially the increasing ratio of memory access
caching techniques. This paper presentshardware tech- time to machine cycle time. In the future a cache miss
niques to improve the performanceof caches. all the way to main memory on a superscalarmachine
Miss caching places a small fully-associative cache executing two instructions per cycle could cost well over
between a cache and its refill uath. Misses in the cache 100 instruction times! Even with careful application of
that hit in the miss cache ha;e only a one cycle miss well-known cache design techniques, machines with
penalty, aso posedto a many cycle miss penalty without main memory latenciesof over 100 instruction times can
the mns catii e. Small miss cachesof 2 to 5 entries are easily lose over half of their potential performanceto the
shown to be very effective in removing mapping conflict memory hierarchy. This makes both hardware and
missesin first-level direct-mappedcaches. software research on advanced memory hierarchies in-
is an improvement to miss caching creasingly important.
-associativecache with the vic-
tim of a miss and not t e requestedline. Small victim Mechine cycles cycle mem miaa miaa
cachesof 1 to 5 entries are even more effective at remov- P-r time time coat co*t
ing conflict missesthan miss caching. in&r (ns) (na) (cycles) (in&r)
------------------------------------------------
Stream buffers refetch cache lines statting at a vAx11/760 10.0 200 1200 6 .6
cachemiss address. fh e prefetcheddata is placed in the WRL Titan 1.4 45 540 12 8.6
buffer and not in the cache. Streambuffers are useful in ? 0.5 4 280 70 140.0
removing capacity and compulsory cache misses,as well ------------------------------------------------
as some instruction cache conflict misses. Stream buf-
fers are more effective than previously investigated Table l-l: The increasing cost of cachemisses
next slower level in the
An extension to This paper investigatesnew hardwaretechniquesfor
increasing the performance of the memory hierarchy,
Section 2 describesa baseline design using conventional
prefetching along multiple intertwined data reference caching techniques. The large performance loss due to
sueaIns. the memory hierarchy is a detailed motivation for the
Together, victim caches and stream buffers reduce techniques discussed in the remainder of the paper.
the miss rate of the first level in the cachehierarchy by a Techniques for reducing missesdue to map ing conflicts
factor of two to three on a set of six large benchmarks. (i.e., lack of associativity) are presentedin s ection 3. An
extension to prefetch techniques called stream buffering
is evaluated in Section 4. Section 5 summarizes this
1. Introduction work and evaluatespromising directions for future work
Cacheperformanceis becoming increasingly impor-
tant since it has a dramatic effect on the performanceof
advanced processors. Table l-l lists some cache miss 2. Baseline Design
times and the effect of a miss on machine performance. Figure 2-l shows the range of configurations of in-
Over the last decade, cycle time has been decreasing terest in this study. The CPU, floating-point unit,
much fasterthan main memory accesstime. The average memory managementunit (e.g., TLB), and first level in-
number of machine cycles per instruction has also been struction and data cachesare on the same chip or on a
decreasing dramatically, especially when the transition single high-speed module built with an advancedpaek-
from CISC machines to RISC machines is included. aging technology. (We will refer to the central processor
Thesetwo effects are multiplicative and result in tremen- as a single chip in the remainder of the paper, but chip or

CH2887-8/90/0000/0384$01.00
(D1990lEEE

388
module is implied.) The cycle time off this chip is 3 to 8 of pipeline stagesin a second-levelcache accesscould be
times longer than the instruction issue rate (i.e., 3 to 8 2 or 3 depending on whether the pipestage going from
instructions can issue m one off-chip clock cycle). This the processor chip to the cache chips and the pipestage
is obtained either by having a very fast on-chip clock returning from the cachechips to the processorare full or
(e.g., superpipeiining [S]), by issuing many instructions half pipestages.
per cycle (e.g., superscalar or VLIW), and/or by using In order to provide sufficient memory for a proces-
higher speedtechnologies for the processorchip than for sor of this speed(e.g., several megabytesper MIP), main
the rest of the system(e.g., GaAs vs. BiCMOS). memory should be in the range of 512MB to 4GB. This
The expected size of the on-chip cachesvaries with means that even if 16Mb DRAMS are used that it will
the implementation technology for the processor, but contain roughly a thousandDRAMS. The main memory
higher-speedtechnologies generally result in smaller on- system probably will take about ten times longer for an
chip caches. For example, quite large on-chip caches accessthan the second-level cache. This accesstime is
should be feasible in CMOS but only small caches are easily dominated by the time required to fan out address
feasible in the near term for GaAs or bipolar processors. and data signals among a thousand DRAMS spread over
Thus, although GaAs and bipolar are faster, the higher many cards. Thus even with the advent of faster
miss rate from their smaller cachestends to decreasethe DRAMS, the access time for main memory may stay
actual system performance ratio between GaAs or roughly the same. The relatively large accesstime for
bipolar machinesand denseCMOS machinesto less than main memory in turn requires that second-level cache
the ratio betweentheir gate speeds. In all casesthe first- line sizes of 128 or 256B are needed. As a counter
level cachesare assumedto be direct-mapped, since this example, consider the case where only 16B are returned
results in the fastesteffective accesstime [7]. Line sizes after 320ns. This is a bus bandwidth of SOMB/sec.
in the on-chip cachesare most likely in the range of I6B Since a IO MIP processorwith this bus bandwidth would
to 32B. The data cache may be either write-through or be bus-bandwidth limited in copying from one memory
write-back, but this paper does not examine those location to another [l 11, little extra erformance would
tradeoffs. be obtained by the use of a 100 to 1,loo MlP processor.
This is an important consideration in the system perfor-
manceof a processor.
Instrucmn I*s”o r*te:
250-1000MlPS Several observations are in order on the baseline
(evey1-4nr) system. First, the memory hierarchy of the system is
actually quite similar to that of a machine like the VAX
1l/780 [3,4], only each level in the hierarchy has moved
one step closer to the CPU. For example, the 8KB
board-level cache in the 780 has moved on-chip. The
512KB to 16MB main memory on early VAX models
has become the board-level cache. Just as in the 780’s
%z: lime main memory, the incoming transfer size is large
WP-x. (128-256B here vs. 512B pagesin the VAX). The main
70-160X memory in this system is of similar size lo the disk sub-
systemsof the early 780’s and performs similar functions
such as paging and file systemcaching.
The actual parametersassumedfor our baseline sys-
tem are 1,000 MIPS peak instruction issue rate, separate
4KB first-level instruction and data caches with 16B
Iines, and a IMB second-level cache with I28B lines.
The miss penalties are assumedto be 24 instruction times
Fiyre 2-1: Baseline design for the first level and 320 instrucrion times for the second
The second-level cache is assumedto range from level. The characteristics of the test programs used in
512XB to 16M3. and to be built from very high speed this study are given in Table 2-l. These benchmarksare
static RAMS. it is assumedto be direct-mapped for the reasonably long in comparison with most traces in use
samereasonsas the first-level caches. For cachesof this today, however the effects of multiprocessing have not
size accesstimes of 16 to 30ns are likely. This yields an been modeled in this work The first-level cache miss
accesstime for the cache of 4 to 30 instruction times. rates of these programs running on the baseline system
The relative speed of the processor as compared to the configuration are given in Table 2-2.
accesstime of the cache Implies that the second-level program dynamic data tot81 program
cache must be ipelmed in order for it to provide suf- n8m8 in&r. rmf8. rmfs. typr
ficient bandwid%. For example, consider the case where ------------------------------------------------
the first-level cache is a write-through cache. Since ccom 31.5M 14.0&l 45.5M C co*ilar
storestypically occur at an averagerate of 1 in every 6 or g= 134.w 59.2M 193.411 PC board CAD
7 instructions, an unpipelined external cache would not Y8CC 51.OH 16.7H 67.71 Unix utility
have even enough bandwidth to handle the store traffic nut 99.41 50.3M 149.71 PC bo8rd CAD
for access times greater than seven instruction times. Unpack 144.8H 40.7~ 185.5&l 100x100 nwamric
Cacheshave been pipelined in mainframes for a number livar 23.6H 7.4M 31.OM Lm (n-rid
of years [12], but this is a recent development for ------_-------------------------------- ------m--
workstations. Recently cache chips with ECL I/O’s and tot81 484.51 186.3H 672.81
registers or latches on their inputs and outputs have ap-
peared; rheseare ideal for pipelined caches. The number Table 2-1: Test program characteristics

389
The effects of these miss rates are given graphically hierarchy at low cost are the subject of the remainder of
in Figure 2-2. The region below the solid line gives the this paper. Finally, in order to avoid compromising the
net performance of the system, while the region above performance of the CPU core (comprising of the CPU,
the solid line gives the performance lost in the memory FPU, MMU, and first level caches), any additional
hierarchy. For example, the difference between the top hardware required by the techniques to be investigated
dotted line and the bottom dotted line gives the petfor- should reside outside the CPU core (Le.. below the first
mance lost due to first-level data cache misses. As can level caches). By doing this the additional hardware will
be seenin Figure 2-2, most benchmarkslose over half of only be involved during cache misses,and therefore will
their potential performance in first level cache misses. not be in the critical path for normal instruction execu-
Only relatively small amounts of performance are lost to tion.
second-level cache misses. This is primarily due to the
large second-levelcachesize in comparison to the size of
the programs executed. Longer traces [2] of larger 3. Reducin Conflict Misses: Miss Caching and
programs exhibit significant numbers of second-level Victim 8 aching
cache misses. Since the test suite used in this paper is Misses in caches can be classified into four
too small for significant second-level cache activity, categories: conflict, compulsory, capacity [7], and
second-level cache misses will not be investigated in coherence. Conflict misses are misses that would not
detail, but will be left to future work. occur if the cache was fully-associative and had LRU
program baseline miss rate
replacement. Compulsory misses are missesrequired in
name in&r. data
any cache organization because the are the first
-------- .---------------------- .-e--w- references to an instruction or piece 0r data. Capacity
ccom 0.096 0.120 missesoccur when the cache size is not sufficient to hold
0.061 0.062 data between references. Coherence misses are misses
g==
yacc 0.028 0.040 that occur as a result of invalidation to preserve mul-
met 0.017 0.039 tiprocessor cacheconsistency.
linpack 0.000 0.144 Even though direct-mapped caches have more con-
liver 0.000 0.273 flict missesdue to their lack of associativity, their perfor-
------------------------------------ mance is still better than set-associativecacheswhen the
Table 2-2: Baseline systemfist-level cachemiss rates access time costs for hits are considered. In fact, the
direct-mapped cache is the only cache configuration
where the critical path is merely the time required to
accessa RAM [9]. Conflict missestypically account for
between 20% and 40% of all direct-mapped cache
misses[7]. Figure 3-1 details the percentageof misses
due to conflicts for our test suite. On average39% of the
first-level data cache misses are due to conflicts, and
29% of the first-level instruction cache missesare due to
conflicts. Since these are significant percentages, it
would be nice to “have our cake and eat it too” by some-
how providing additional associativity without adding to
the critical accesspath for a direct-mappedcache.

100 ccom grr Y- met linpsck liv

0
1 2 3Bcnchmarli+ 5

Figure 2-2: Baseline design performance

Since the exact parametersassumed are at the ex-
treme end of the ranges described (maximum perfor-
mance processorwith minimum size caches),other con-
figurations would lose proportionally less performancein
their memory hierarchy. Nevertheless,any configuration
in the range of interest will lose a substantial proportion
of its potential performance in the memory hierarchy.
This meansthat the greatest leverage on system perfor-
mance will be obtained by improving the memory hierar-
chy performance, and not by attempting to further in-
crease the performance of the CPU (e.g., by more ag- Figure 3-1: Conflict misses,4KB I and D, 16B lines
gressive parallel issuing of instructions). Techniques for
improving the performance of the baseline memory

390
3.1. Miss Caching would remove all of the conflict misses. Obviously this
We can add associativity to a direct-mapped cache is another extreme of performance and the results in
by placing a small miss cache on-chip between a first- Figure 3-3 show a range of performance based on the
level cache and the accessport to the second-level cache program involved. Nevertheless,for 4KB data cachesa
(Figure 3-2). A miss cache is a small fully-associative miss cache of only 2 entries can remove 25% percent of
cache containing on the order of two to five cache lines the data cacheconflict misseson average,’ or 13% of the
of data. When a miss occurs, data is returned not only to data cache missesoverall. If the miss cache is increased
the direct-mapped cache, but aiso to the miss cache un- to 4 entries, 36% percent of the conflict misses can be
der it, where it replaces the least recently used item. removed, or 18% of the data cache missesoverall. After
Each time the upper cache is probed, the miss cache is four entries the improvement from additional miss cache
probed as well. If a miss occurs in the up er cache but entries is minor, only increasing to a 25% overall reduc-
the addresshits in the miss cache,then the 8irect-mapped tion in data cache missesif 15 entries are provided.
cache can be reloaded in the next cycle from the miss
cache. This replaces a long off-chip miss penalty with a
short one-cycle on-chip miss. This arrangementsatisfies Key- - Ll I-cache
the requirement that the critical path is not worsened, - LlDeac
since the miss cache itself is not in the normal critical
path of processorexecution.

Oimd-mapped
cache

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Numk of entries in miss cache
Figure 3-3: Conflict missesremoved by miss caching
Since doubling the data cache size results in a 32%
reduction in misses (over this set of benchmarks when
Figure 3-2: Miss cacheorganization increasing data cache size from 4K to 8K), each ad-
ditional line in the fist level cache reduces the number
The successof different miss cache organizations at of misses by approximately 0.13%. AIthough the miss
removing conflict misses is shown in Figure 3-3. The cache requires more area per bit of storage than lines in
fvst observation to be made is that many more data con- the data cache, each line in a two line miss cache effects
flict missesare removed by the miss cache than instruc- a 50 times larger marginal improvement in the miss rate,
tion conflict misses. This can be explained as follows. so this should more than cover any differences in layout
instruction confIicts tend to be widely spaced because Size.
the instructions within one procedure will not conflict Comparing Figure 3-3 and Figure 3-1, we see that
with each other as long as the procedure size is less than the higher the percentageof misses due to conflicts, the
the cache size, which is almost always the case. Instruc- more effective the miss cache is at eliminating them. For
tion conflict misses are most likely when another proce- example, in Figure 3-1 met has by far the highest ratio of
dure is called. The target procedure may ma anywhere
with respect to the calling procedure, possibYy resulting confict missesto total data cache misses. Similarly, grr
in a large overlap. Assuming at least 60 different in- and yacc also have greater than average ercentagesof
stntctions are executed in each procedure, the conflict conflict misses,and the miss cache helps tRese programs
misses would span more than the 15 lines in the max- significantly as well. firzpack and ccom have the lowest
imum size miss cache tested. In other words, a small
miss cache could not contain the entire overlap and so
would be reloaded repeatedly before it could be used.
This type of reference pattern exhibits the worst miss ‘Throughout thispaperlhcaverage reduction in miss rates is used as
cacheperformance. a metric. This is computed by calculating the percent reduction in miss
Data conflicts, on the other hand, can be quite rate for each benchmark. and then taking the average of these per-
closely spaced. Consider the case where two character centages. This has the advantage that it is independent of the number
strings are being compared. If the points of comparison of memory references madeby each program. Furthermore, if two
of the two strings happen to map to the same line, alter- programs have widely different miss rates. the average percent reduc-
nating referencesto different strings will always miss in tion in miss rate gives equal weighting to each benchmark. This is in
the cache. In this case a miss cache of only two entries contrast with the percent reduction in average miss rate, which weights
the program with the highest miss rate most heavily.

391
percentage of conflict misses, and the miss cache flicting lines between the procedure and loop body were
removes the lowest percentageof conflict misses from larger than the miss cache,the miss cache would be of no
these programs. This results from the fact that if a value since missesat the beginning of the loop would be
program has a large percentageof data contlict misses flushed out by later misses before execution returned to
then they must be clustered to some extent becauseof the beginning of the loop. If a victim cache is used
their overall density. This does not prevent programs instead, however, the number of conflicts in the loo that
with a small number of conflict missessuch as liver from can be captured is doubled comparedto that storeB by a
benefiting from a miss cache, but it seemsthat as the miss cache. This is becauseone set of conflicting in-
percentageof conflict missesincreases,the percentageof structions lives in the direct-mapped cache, while the
these missesremovable by a miss cache increases. other lives in the victim cache. As execution proceeds
around the loop and through the procedure call these
items trade places.
3.2. Victim Caching
Consider a system with a direct-mappedcache and a The percentageof conflict misses removed by vic-
miss cache. When a miss occurs, data is loaded into both tim caching is given in Figure 3-5. Note that victim
the miss cache and the direct-mapped cache. In a sense, caches consisting of just one line are useful, in contrast
this duplication of data wastesstorage spacein the miss to miss caches which must have two lines to be useful.
cache. The number of duplicate items in the miss cache All of the benchmarks have improved performance in
can range from one (in the case where all items in the comparison to miss caches,but instruction cache perfor-
miss cache map to the same line in the direct-mapped mance and the data cache performance of benchmarks
cache) to all of the entries (in the case where a series of that have conflicting long sequential reference streams
missesoccur which do not hit in the miss cache). (e.g., ccom and linpack) improve the most.
To make better use of the miss cache we can use a
different replacement algorithm for the small fully-
associative cache [5]. Instead of loading the requested
data into the miss cache on a miss, we can load the
fully-associative cache with the victim line from the
direct-mappedcacheinstead. We call this victim caching
(see Figure 3-4). With victim caching, no data line ap-
pears both in the direct-map ed cache and the victim
cache. This follows from the Pact that the victim cacheis
loaded only with items thrown out from the direct-
mapped cache. In the case of a miss in the direct-
mapped cache that hits in the victim cache, the contents
of the direct-mappedcache line and the matching victim
cache line are swapped.

A-444

0 12 3 4 5 6 7 8 9 1011
Number of entries in victim cache
Figure 3-5: Conflict missesremoved by victim caching

3.3. The Effect of Direct-Mapped Cache Size on

Victim Cache Performance
Figure 3-6 shows the performanceof 1.2.4, and 15
entry victim cacheswhen backing up direct-mappeddata
caches of varying sizes. In general smaller direct-
mapped caches benefit the most from the addition of a
victim cache. Also shown for reference is the total per-
centageof conflict missesfor each cache size. There are
two factors to victim cache performance versus direct-
mapped cache size. First, as the direct-mapped cache
Figure 3-4: Victim cacheorganization increases in size, the relative size of the victim cache
Depending on the reference stream, victim caching becomes smaller. Since the direct-mapped cache gets
can either be a small or significant improvement over larger but keeps the sameline size (16B). the likelihood
of a tight mapping conflict which would be easily
miss caching. The magnitude of this benefit dependson removed by victim caching is reduced. Second,the r-
the amount of duplication in the miss cache. Victim centage of conflict misses decreasesslightly from r KB
caching is always an improvement over miss caching. to 32KB. As we have seenpreviously, as the percentage
As an example, consider an instruction reference of conflict misses decreases,the percentage of these
stream that calls a small procedure in its inner loop that missesremoved by the victim cache decreases.The first
conflicts with the loop body. If the total number of con- effect dominates, however, since as the percentage of

392
conflict misses increaseswith very large caches (as in number of entries is cut in half when the line size
[7]), the victim cache performance only improves doubles) the performance of the victim cache still im-
slightly. proves or at least breakseven when line sizesincrease.

1-7 3.5. Victim Caches and Second-Level Caches

Key: --- 1 entry victim crh
90 - 2emyvmimcxhe
As the size of a cache increases,a larger percentage
- - 4enuyvictimcach of its misses are due to conflict and compulsory misses
a0 - - - - -. 15cnuyvlctimcacbc and fewer are due to capacity misses. (Unless of course
--_- pacent conflict
mirses the cache is larger than the entire program, in which case
70 Q LI D-cab ordy compulsory misses remain.) Thus victim caches
might be expectedto be useful for second-levelcachesas
well. Since the number of conflict missesincreaseswith
increasing line sizes, the large line sizes of second-level
caches wodd also tend to increase the potential useful-
nessof victim caches.
One interesting aspectof victim caches is that they
violate inclusion properties [1] in cache hierarchies.
However, the line size of the second level cache in the
baseline design is 8 to 16 times larger than the first-level
cacheline sizes,so this violates inclusion as well.
Note that a first-level victim cachecan contain many
lines that conflict not only at the Fit level but also at the
“1 2 4 a 32 64 128 second level. Thus, using a first-level victim cache can
cads size ito6lcEl
also reduce the number of conflict missesat the second
Figure 3-6: Victim cache:vary direct-map cachesize level. In investigating victim caches for second-level
caches, both configurations with and without first-level
victim cacheswill needto be considered.
3.4. The Effect of Line Size on Victim Cache
Performance A thorough investigation of victim caches for
Figure 3-7 shows the performanceof victim caches megabytesecond-level cachesrequires traces of billions
for 4KB direct-mappeddata cachesof varying line sizes. of instructions. At this time we only have victim cache
As one would expect, as the line size at this level in- performancefor our smaller test suite, and work on ob-
creases,the number of conflict missesalso increases. taining victim cache performance for multi-megabyte
second-levelcachesis underway.

4. Reducing Capacity and Compulsory Misses

Compulsory missesare missesrequired in any cache
organizatton becausethey are the first references to a
piece of data. Capacity missesoccur when the cache size
Q Ll D-cdm IS not sufficient to hold data between references. One
way of reducing the number of capacity and compulsory
60 missesis to use prefetch techniquessuch as longer cache
line sizes or prefetching methods[13.6]. However, line
50 sizescan not be madearbitrarily large without increasing
the miss rate and greatly increasing the amount of data to
40 be transferred. In this section we investigate techniques
to reduce capacity and compulsory misseswhile mitigat-
30
ing traditional problems with long lines and excessive
20 prefetching.
A detailed analysis of three prefetch algorithms has
10 a ared in 1131. P&etch always prefetchesafter every
1 I erence. Needlessto sa this is impractical in our base
resp”
O4 a 16 32 64 128 256 system since many lever-one cache accessescan take
C&m Line Size in Bytes place in the time required to initiate a single level-two
cache reference. This is especially true in machinesthat
Figure 3-7: Victim cache:vary datacacheline size fetch multiple instructions per cycle from an instruction
The increasing percentageof conflict missesresults cache and can concurrently perform a load or store per
in an increasing ~mentage of these misses being cycle to a data cache. Prefetch on miss and tagged
removed by the vtctim cache. Systems with victim prefifch are more promising techniques. On a miss
cachescan benefit from longer line sizes more than sys- prcferch on miss always fetches the next line as well. It
tems without victim caches,since the victim cacheshelp can cut the number of misses for a purely sequential
remove misses caused by conflicts that result from reference stream in half. Tagged preferch can do even
longer cache lines. Note that even if the area used for better. In this technique each block has a tag bit as-
data storagein the victim cache is held constant (i.e., the sociatedwith it. When a block is prefetched,its tag bit is
set to zero. Each time a block is used its tag bit is set to

393
one. When a block undergoesa zero to one transition its skipping any lines. In this simple model non-sequential
successorblock is prefetched. This can reduce the num- line misses will cause a stream buffer to be flushed and
ber of missesin a purely sequential reference stream to restartedat the mtss addresseven if the requestedline is
zero, if fetching is fast enough. Unfortunately the large already presentfurther down in the queue.
latencies in the base system can make this impossible. When a line is moved from a stream buffer to the
Consider Figure 4-1, which gives the amount of time (in cache,the entries in the streambuffer can shift up by one
instruction issues) until a prefetched line is required and a new successiveaddressis fetched. The pipelined
during the execution of ccom. Not s risingly, since the interface to the secondlevel allows the buffer to be filled
“K ed lines must be
line size is four instructions, prefetc at the maximum bandwidth of the second level cache,
received within four instruction-times to keep up with and many cache lines can be in the process of being
the machine on uncachedstraight-line code. Becausethe fetched simultaneously. For example, assume the
basesystemsecond-levelcache takes many cycles to ac- latency to refill a 16B line on a instruction cachemiss is
cess, and the machine may actually issue many instruc- 12 cycles. Consider a memory interface that is pipelined
tions per cycle, tagged prefetch may only have a one- and can accept a new line request every 4 cycles. A
cycle-out-of-many head start on providing the required four-entry streambuffer can provide 4B instructions at a
instructions. rate of one per cycle by havmg three requestsoutstand-
ccom I-cache prcfctch, 16B lines ing at all times. Thus during sequentialinstruction execu-
l- tion long latency cache misseswill not occur. This is in
contrast to the performanceof tagged prefetch on purely
Key: sequentialreferencestreamswhere only one line is being
prefetched at a time. In that case sequential instructions
l- \ - prcfetch on miss will only be supplied at a bandwidth e ual to one instruc-
3 tion every three cycles (i.e., 12 cycle 4atency / 4 instntc-
:\ ------ taggcdpnfach
tions per line).
:\
I ‘$1 -- - - prefetch always
From omceowr To ~roauot

I
I

Dkod-mappod
o&m
I

L
I-
2 4 6 8 10 12 14 16 18 20 22 24 :
Instructions until prcfctch returns
Figure 4-1: Limited time for prefetch

4.1. Stream Buffers

What we really need to do is to start the prefetch
before a tag transition can take place. We can do this
with a mechanismcalled a streum buffer (Figure 4-2). A
streambuffer consistsof a seriesof entries, eachconsist- Figure 4-2: Sequential streambuffer design
ing of a tag, an available bit. and a data line.
When a miss occurs, the stream buffer begins Figure 4-3 shows the performance of a four-entry
prefetching successivelines starting at the miss target. instruction stream buffer backing a 4KB instruction
As each prefetch request is sent out, the tag for the ad- cache and a data stream buffer backing a 4KB data
dress is enteredinto the stream buffer, and the available cache,each with 16B lines. The graph gives the cumula-
bit is set to false. When the prefetch data returns it is tive number of missesremoved basedon the number of
placed in the entry with its tag and the available bit is set lines that the buffer is allowed to prefetch after the
to true. Note that lines after the line requested on the original miss. (In practice the streambuffer would prob-
miss are placed in the buffer and not in the cache. This ably be allowed to fetch until the end of a vtrtual
avoids polluting the cache with data that may never be memory page or a second-level cache line. The major
needed reason for lotting stream buffer performanceas a func-
tion of pre4etch length is to get a better idea of how far
Subsequentaccessesto the cache also comparetheir streams continue on average.) Most instruction
addressagainst the first item stored in the buffer. If a referencesbreak the purely sequential access attem by
reference missesin the cache but hits in the buffer the the time the 6th successiveline is fetched, w&‘le many
cache can be reloaded in a single cycle from the stream data reference patternsend even sooner. The exceptions
buffer. This is much faster than the off-chip miss to this appear to be instruction referencesfor liver and
penalty. The stream buffers considered in this section data referencesfor linpuck liver is probably an anomaly
are simple FIFO queues, where only the head of the since the 14 loops of the program are executed squen-
queue has a tag comparator and elements removed from tially, and the first 14 loops do not generally call other
the buffer must be removed strictly in sequencewithout procedures or do excessive branching, which would

394
cause the sequential miss pattern to break. The data experience the greatest improvement (it changes from
reference pattern of /inpack can be understood as fol- 7% to 60% reduction), ah of the programs benefit to
lows. Rememberthat the stream buffer is only respon- someextent.
sible for providing lines that the cache misseson. The
inner loop of finpuck (i.e., saxpy) performs an inner
roduct betweenone row and the other rows of a matrix.
hl e first use of the one row loads it into the cache. After
that subsequentmissesin the cache (except for mapping
conflicts with the first row) consist of subsequentlines of
the matrix. Since the matrix is too large to fit in the
onchip cache, the whole matrix is passedthrough the
cacheon eachiteration. The streambuffer can do this at
the maximum bandwidth provided by the second-level
cache. Of course one prerequisite for this is that the
reference stream is unit-stride or at most skips to every
other or every third word. If an array is accessedin the
non-unit-stride direction (and the other dimensions have
non-trivial extents) then a stream buffer as presented
here will be of little benefit.
loo
IKay? - LlI-cache
- LIDach
OCWUI

To “al 1011~ each.

Figure 4-4: Four-way streambuffer design

- Ll D-x&

“0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IS 16
Lengthofstrcamnm
Figure 4-3: Sequentialstreambuffer performance

4.2. Multi-Way Stream Buffers

Overall, the stream buffer presentedin the previous
section could remove 72% of the instruction cache
misses,but it could only remove 25% of the data cache
misses. One reasonfor this is that data referencestend to
consist of interleaved streams of data from different 0 12 3 4 5 6 7 8 9 10111213141516
sources. In order to improve the performanceof stream Length of Cway smeamNn
buffers for data references, a multi-way stream buffer Figure 4-5: Four-way streambuffer performance
was simuiated (Figure 44). It consists of four stream
buffers in parallel. When a miss occurs in the data cache
that does not hit in any stream buffer, the stream buffer 43. Stream Buffer Performance vs. Cache Size
hit least recently is cleared (i.e., LRU replacement)and it Figure 4-6 gives the erformance of single and 4-
is startedfetching at the miss address. way streambuffers with 1f!B lines as a function of cache
Figure 4-5 shows the performanceof the multi-way size. The instruction stream buffers have remarkably
stream buffer on our benchmark set. As expected, the constant performance over a wide range of cache sizes.
performance on the instruction stream remains virtually The data stream buffer performance generally improves
unchanged. This means that the simpler single stream as the cache size increases. This is especially true for the
buffer wiIl suffice for instruction streams. The multi- single stream buffer, whose performance increasesfrom
way streambuffer does signticantl improve the erfor- a 15% reduction in missesfor a data cache size of 1KB
manceon the data side, however. d verall, the muP* n-way to a 35% reduction in misses for a data cache size of
stream buffer can remove 43% of the missesfor the six 128KB. This is probably becauseas the cache size in-
programs, almost twice the performance of the single creases, it can contain data for reference patterns that
stream buffer. Although the matrix operations of liver accessseveral sets of data, or at least all but one of the

395
sets. What missesthat remain are more likely to consist The instruction streambuffers perform well even out
of very long single sequential streams. For example, as to 128B line sizes. Both the 4-way and the single stream
the cache size increases the percentageof compulsory buffer still remove at least 40% of the misses at 128B
missesincrease, and these are more likely to be sequen- line sizes,coming down from an 80% reduction with 8B
tial in nature than data conflict or capacity misses. lines. This is probably due to the large granularity of
conflicting instruction reference streams, and the fact
1GQr 1 that many proceduresare more than 128B long.
90 I Key:
OLIIssfhe
Cl Ll D-each
---
-
rmgle scquauid stream buffer
‘l-way sequenti stream buffer
5. Conclusions
a0 Small miss caches (e.g., 2 to 5 entries) have been
B
shown to be effective in reducing data cache conflict
misses for direct-mapped caches in range of 1K to 8K
bytes. They effectively remove tight conflicts where
misses alternate between several addressesthat map to
the sameline in the cache. Miss cachesare increasingly
beneficial as Iine sizes increase and the percentage of
conflict misses increases. In general it appearsthat as
the percentageof conflict missesincreases,the percentof
these missesremovable by a miss cache also mcreases,
resulting in an even steeper slope for the performance
improvement possible by using miss caches.
Victim cachesare an improvement to miss caching
that saves the victim of the cache miss instead of the
“1 2 4 a 16 32 64 128
target in a small associative cache. Victim caches are
CachdizeiaKB even more effective at removing conflict misses than
miss caches.
Figure 4-6: Streambuffer performancevs. cachesize
Stream buffers prefetch cache lines after a missed
cache line. They store the line until it is requestedby a
4.4. Stream Buffer Performance vs. Line Size cachemiss (if ever) to avoid unnecessarypollution of the
Figure 4-7 gives the performance of single and 4- cache. They are particularly useful at reducing the num-
way stream buffers as a function of the line size in the ber of capacity and compulsory misses. They can take
stream buffer and 4KB cache. The reduction in misses full advantage of the memory bandwidth available in
provided by a single data streambuffer falls by a factor ipelined memory systemsfor sequential references,un-
of 6.8 going from a line size of 8B to a line size of 128B, Pilce previously discussed prefetch techniques such as
while a Cway stream buffer’s contribution falls by a tagged prefetch or prefetch on miss. Streambuffers can
factor of 4.5. This is not too surprising since data also tolerate longer memory system latencies since they
referencesare often fairly widely distributed. In other prefetch data much in advance of other prefetch tech-
words if a piece of data is accessed,the odds that another niques (even prefetch always). Stream buffers can also
iece of data 128B away will be neededsoon are fairly compensatefor instruction conflict misses, since these
Pow. The single data stream buffer performanceis es - tend to be relatively sequential in nature as well.
cially hard hit comparedto the multi-way stream buF er Multi-wa streambuffers are a set of streambuffers
becauseof the increase in conflict misses at large line that can prefyetch down several streams concurrently.
Sizts. Multi-way stream buffers are useful for data references
that contain interleaved accessesto several different
Kay: - - single requcntid rueam buffer large data structures, such as in array operations.
- 4-way SqLwntial ItraM tuffu However, since the prefetching is of sequential lines,
0 Ll I-c&The only unit stride or near unit stride (2 or 3) accesspatterns
Cl Ll D-x&c benefit.
1 8o
The performance improvements due to victim
[ 70 caches and due to stream buffers are relatively or-
thogonal for data references. Victim cacheswork well
.I 60 where references alternate between two locations that
map to the same line in the cache. They do not prefetch
i 5o
data but only do a better job of keeping data fetched
31b40
available for use. Stream buffers, however, achieve per-
c formance improvements by prefetching data. They do
g 30 not remove conflict misses unIess the conflicts are
widely spaced in time, and the cache miss reference
d: 20 stream consists of many sequential accesses.These are
10
precisely the conflict missesnot handled well by a victim
I
cachedue to its relatively small capacity. Over the set of
six benchmarks, on average only 2.5% of 4KB direct-
“4 a 16 32 64 128 256 mappeddata cache missesthat hit in a four-entry victim
Cache Line Size. in Bytes cache also hit in a four-way streambuffer for ccom;met,
Figure 4-7: Streambuffer performancevs. line size yucc, grr, and liver. In contrast, linpuck, due to its se-

396
quential data accesspatterns, has 50% of the hits in the References
victim cache also hit in a four-way stream buffer.
However only 4% of linpack’s cache misses hit in the 1. Baer, Jean-Loup,and Wang, Wenn-Harm. On the In-
victim cache (it benefits least from victim caching clusion Propertiesfor Multi-Level CacheHierarchies.
among the six benchmarks), so this is still not a sig- The 15th Annual Symposium on ComputerArchitecture,
nificant amount of overlap between stream buffers and IEEE Computer Society Press,June, 1988,pp. 73-80.
victim caching.
Figure 5-l shows the erformance of the base sys- 2. Borg, Anita, Kessler,Rick E.. Lazana,Georgia, and
tem with the addition of a Pour entty data victim cache,a Wall, David W. Long AddressTraces from RISC
instruction stream buffer, and a four-way data stream Machines: Generation and Analysis. Tech. Rept. 89/14,
buffer. (The base system has on-chip 4KB instruction Digital Equipment Corporation WesternResearch
and 4KB data caches with 24 cycle miss penalties and Laboratory, September,1989.
16B lines to a three-stagepipelined second-level 1MB
cache with 128B lines and 320 cycle miss penalty.) The 3. Digital Equipment Corporation, Inc. VAX Hardware
lower solid line in Figure 5-1 gives the performance of Handbook, volume I - 1984. Maynard, Massachusetts,
the original base system without the victim caches or 1984.
buffers while the upper solid line gives the performance
with buffers and victim caches. The combination of 4. Emer, Joel S., and Clark, Douglas W. A Charac-
these techniquesreducesthe first-level miss rate to less terization of ProcessorPerformancein the VAX-l l/780.
than half of that of the baseline system, resulting in an The 1lth Annual Symposiumon ComputerArchitecture,
average of 143% improvement in system performance
for the six benchmarks. These results show that the ad- JEEEComputer Society Press,June, 1984,pp. 301-310.
dition of a small amount of hardware can dramatically
reduce cache miss rates and improve system perfor- 5. Eustace,Alan. Private communication.
mance. 6. Farrens,Matthew K., and Pleszkun,Andrew R. Im-
proving Performanceof Small On-Chip Instruction
Caches. The 16th Annual Symposium on Computer
Architecture, IEEE Computer Society Press,May, 1989,
pp. 234-241.
7. Hill, Mark D. Aspects of Cache Memory and Insrruc-
tion Buffer Performonce. Ph.D. Th., University of Cali-
fornia, Berkeley, 1987.
8. Jouppi, Norman P., and Wall, David W. Available
Instruction-Level Parallelism For Superpipelinedand Su-
perscalarMachines. Third International Conferenceon
Architectural Support for Programming Languagesand
Operating Systems,IEEE Computer Society Press,April,
1989, pp. 272-282.
9. Jouppi, Norman P. Architectural and Organizational
v- Tradeoffs in the Design of the MultiTitan CPU. The
1 2 3 5 6 16th Annual Symposiumon Computer Architecture,
lkachm2
IEEE Computer Society Press,May, 1989,pp. 281-289.
Figure S-1: Improved systemperformance
10. Nielsen, Michael J. K. Titan SystemManual. Tech.
This study has concentrated on ap lying victim Rept. 86/l, Digital Equipment Corporation Western
cachesand streambuffers to fist-level catf es. An inter- ResearchLaboratory, September,1986.
esting area for future work is the ap lication of these
techniques to second-level caches. wlso, the numeric 11. Ousterhout,John. Why Aren’t Operating Systems
programs used in this study used unit stride accesspat- Getting FasterAs FastAs Hardware? Tech. Rept. Tech-
terns. Numeric programswith non-unit stride and muted
stride accesspatternsalso need to be simulated. Finally, note 11, Digital Equipment Corporation Western
the tformance of victim caching and stream buffers ResearchLaboratory, October, 1989.
neecr to be investigated for o rating system execution 12. Smith, Alan J. “Sequential program prefetching in
and for multiprogramming worE oads.
memory hierarchies.” IEEE Computer I I, 12
(December 1978). 7-21.
Acknowledgements
Mary Jo Doherty, John Ousterhout. Jeremy Dion, 13. Smith, Alan J. “Cache Memories.” Computing
Anita Borg, Richard Swan, and the anonymous referees Surveys (September1982).473-530.
pmvided many helpful comments on an early. draft of
this paper. Alan Eustacesuggestedvictim cachmg as an
improvementto miss caching.

397

Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Cache_optimizations
No ratings yet
Cache_optimizations
29 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
ACA Unit-5
No ratings yet
ACA Unit-5
54 pages
Computer Architecture - Quantitative Approach
No ratings yet
Computer Architecture - Quantitative Approach
7 pages
Cache
No ratings yet
Cache
34 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
115 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Cache Optimizations
No ratings yet
Cache Optimizations
23 pages
Chapter 2 Adv 2007 PPTV 4
No ratings yet
Chapter 2 Adv 2007 PPTV 4
54 pages
L07-MemoryII
No ratings yet
L07-MemoryII
27 pages
Crictical Word First For Cache Misses
No ratings yet
Crictical Word First For Cache Misses
21 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
10-cacheperf
No ratings yet
10-cacheperf
24 pages
Cache Miss Penalty Reduction: #1 - Multilevel Caches
No ratings yet
Cache Miss Penalty Reduction: #1 - Multilevel Caches
8 pages
Cache 1 54
No ratings yet
Cache 1 54
54 pages
Ca Sol PDF
No ratings yet
Ca Sol PDF
8 pages
Advance Computer Architecture Homework 2 Solution
No ratings yet
Advance Computer Architecture Homework 2 Solution
8 pages
MTP 01 Final J.raghunat b15216
No ratings yet
MTP 01 Final J.raghunat b15216
10 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
Memory Hierarchy 4.0
No ratings yet
Memory Hierarchy 4.0
50 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
CA_Lecture_08
No ratings yet
CA_Lecture_08
38 pages
Chapter 3 Cache
No ratings yet
Chapter 3 Cache
38 pages
IT3030E-CA-Chap6-Memory
No ratings yet
IT3030E-CA-Chap6-Memory
69 pages
Lecture16 PDF
No ratings yet
Lecture16 PDF
4 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
72 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
UNIT IV.ppt
No ratings yet
UNIT IV.ppt
61 pages
Chapter5 PDF
No ratings yet
Chapter5 PDF
95 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
No ratings yet
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
16 pages
Lect12 Cache
No ratings yet
Lect12 Cache
39 pages
Week 13 - Lecture 13 - Memory (cont)
No ratings yet
Week 13 - Lecture 13 - Memory (cont)
31 pages
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
No ratings yet
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
13 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
No ratings yet
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
7 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
Reconfigurable Cache Architecture: Major Technical Project On
No ratings yet
Reconfigurable Cache Architecture: Major Technical Project On
9 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Memory Hierarchy: Two Principles
No ratings yet
Memory Hierarchy: Two Principles
68 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Cache Org
No ratings yet
Cache Org
19 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
Cache Performance Average Memory Access Time
No ratings yet
Cache Performance Average Memory Access Time
23 pages
Improving and Measuring Cache Performance
No ratings yet
Improving and Measuring Cache Performance
8 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
NVMe Performance Hacks
From Everand
NVMe Performance Hacks
Mei Gates
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SQLNOTES
No ratings yet
SQLNOTES
10 pages
Module - Physical Design
No ratings yet
Module - Physical Design
9 pages
Lecture 3.5 TRC and DRC Comparison
No ratings yet
Lecture 3.5 TRC and DRC Comparison
35 pages
Ete A6 en Ug 01 PDF
No ratings yet
Ete A6 en Ug 01 PDF
72 pages
Packet Tracer: Conexión de Un Router A Una LAN
No ratings yet
Packet Tracer: Conexión de Un Router A Una LAN
12 pages
Bi and Data Sharing Sherman
No ratings yet
Bi and Data Sharing Sherman
12 pages
Wa0005.
No ratings yet
Wa0005.
24 pages
Network Programming
No ratings yet
Network Programming
23 pages
Mcqs Computer 10th
No ratings yet
Mcqs Computer 10th
1 page
Asus Aspire 7250
No ratings yet
Asus Aspire 7250
223 pages
Best Tutorial of C#
No ratings yet
Best Tutorial of C#
97 pages
Database Management Systems Practice Problem Set: Recovery
No ratings yet
Database Management Systems Practice Problem Set: Recovery
2 pages
Hyper-V Medium Fast Track For IBM Flex System With V7000
No ratings yet
Hyper-V Medium Fast Track For IBM Flex System With V7000
99 pages
MA5616 Basic Operation
No ratings yet
MA5616 Basic Operation
115 pages
2024 Hocawebde Coğrafya Zor Sorular
No ratings yet
2024 Hocawebde Coğrafya Zor Sorular
143 pages
Midrange+Storage+Performance+Planning+ +Participant+Guide (PDF) 2
No ratings yet
Midrange+Storage+Performance+Planning+ +Participant+Guide (PDF) 2
235 pages
Computer Organization and Architecture (18EC35) - Input/Output Organization (Module 3)
50% (2)
Computer Organization and Architecture (18EC35) - Input/Output Organization (Module 3)
82 pages
P14x TC2 EN 2.1 PDF
No ratings yet
P14x TC2 EN 2.1 PDF
9 pages
Zipato MQTTCloud
No ratings yet
Zipato MQTTCloud
34 pages
CDJ-850 Update Guide EN
No ratings yet
CDJ-850 Update Guide EN
5 pages
OBIEE - Aggregate Tables
No ratings yet
OBIEE - Aggregate Tables
3 pages
Neo4j - Graph Database PDF
No ratings yet
Neo4j - Graph Database PDF
19 pages
Project On SQL Query Processor Using NLP
No ratings yet
Project On SQL Query Processor Using NLP
23 pages
ControlEdge RTU DNP3 Device Profile RTDOC-X346-en-B
No ratings yet
ControlEdge RTU DNP3 Device Profile RTDOC-X346-en-B
45 pages
Js Handbook
No ratings yet
Js Handbook
60 pages
Sequential Circuit: Shreyas Patel M.Tech VLSI Design (VIT, Vellore) SVNIT, Surat
No ratings yet
Sequential Circuit: Shreyas Patel M.Tech VLSI Design (VIT, Vellore) SVNIT, Surat
21 pages
TC-S3154 Spec T-6-4T-O-V2.0
No ratings yet
TC-S3154 Spec T-6-4T-O-V2.0
4 pages
ADMO Brochure ENU PDF
No ratings yet
ADMO Brochure ENU PDF
12 pages
FTP Doc
No ratings yet
FTP Doc
5 pages
Trees
No ratings yet
Trees
76 pages

Jouppi Improving Direct Mapped Cache Performance

Uploaded by

Jouppi Improving Direct Mapped Cache Performance

Uploaded by

Improving Direct-Mapped Cache Performance by the Addition

of a Small Fully-Associative Cache and Prefetch Buffers

Abstract dous increasesin miss cost. For example, a cache miss

100 ccom grr Y- met linpsck liv

Figure 2-2: Baseline design performance

3.3. The Effect of Direct-Mapped Cache Size on

1-7 3.5. Victim Caches and Second-Level Caches

4. Reducing Capacity and Compulsory Misses

4.1. Stream Buffers

To “al 1011~ each.

Figure 4-4: Four-way streambuffer design

4.2. Multi-Way Stream Buffers

You might also like