Week6 Slides
Week6 Slides
Introduc)on
• Programmers want unlimited amount of memory with very low latency.
• Fast memory technology is more expensive per bit than slower memory.
– SRAM is more expensive than DRAM, DRAM is more expensive than disk.
• Possible soluHon?
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture 28: MEMORY
Lecture HIERARCHY
1: EVOLUTION DESIGN
OF COMPUTER (PART 1)
SYSTEM – Organize the memory system in several levels, called memory hierarchy.
– Exploit temporal and spaHal locality on computer programs.
– Try to keep the commonly accessed segments of program / data in the faster
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA memories.
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA – Results in faster access Hmes on the average.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA
1 2 2 2
3 2 4 2
5 2 6 2
1
12/08/17
Processor
Increasing Registers Increasing Increasing
Size Cost Speed Registers
Level-1 Cache
Instruction Data Level-1 Cache
Cache Cache
Level-2 Cache
Level-2 Cache
Level-3 Cache
Level-3 Cache
Main Memory
Main Memory
MagneHc Disk
Magnetic Disk / Flash Drive
7 2 8 2
9 2 10 2
11 2 12 2
2
12/08/17
Locality of Reference
• Programs tend to reuse data and instrucHons they have used recently. • The 90/10 rule has two dimensions:
– Rule of thumb: 90% of the total execuHon Hme of a program is spent in only 10% a) Temporal Locality (locality in Hme)
of the code (also called 90/10 rule).
• If an item is referenced in memory, it will tend to be referenced again
– Reason: nested loops in a program, few procedures calling each other repeatedly, soon.
arrays of data items being accessed sequenHally, etc.
b) SpaHal locality (locality in space)
• Basic idea to exploit this rule:
• If an item is referenced in memory, nearby items will tend to be
– Based on a program’s recent past, we can predict with a reasonable accuracy referenced soon.
what instrucHons and data will be accessed in the near future.
13 2 14 2
15 2 16 2
17 2 18 2
3
12/08/17
19 2 20 2
21 2 22 2
23 2 24 2
4
12/08/17
END OF LECTURE 28
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture 29: MEMORY
Lecture HIERARCHY
1: EVOLUTION DESIGN
OF COMPUTER (PART 2)
SYSTEM
25 2 26 2
Example 1 Example 2
• Consider a 2-level memory hierarchy with separate instrucHon and data
• Consider a 2-level memory hierarchy consisHng of a cache memory M1 and
caches in level 1, and main memory in level 2.
the main memory M2. Suppose that the cache is 6 Hmes faster than the main
memory, and the cache can be used 90% of the Hme. How much speedup do
we gain by using the cache? I-Cache Main
– Here, r = 6 and H = 0.90 CPU Memory
– Thus, S = 1 / [H / r + (1 – H)] = 1 / (0.90 / 6 + 0.10) = 1 / 0.25 = 4 D-Cache
27 2 28 2
29 2 30 2
5
12/08/17
31 2 32 2
Implica)ons of a Memory Hierarchy to the CPU • Some mechanism is required to transfer blocks between consecuHve
levels.
• Processors designed without memory hierarchy are simpler because all – If the block transfer requires 10’s of clock cycles (like in cache / main memory
memory accesses take the same amount of Hme. hierarchy), it is controlled by hardware.
– Misses in a memory hierarchy implies variable memory access Hmes as seen – If the block transfer requires 1000’s of clock cycles (like in main memory /
by the CPU. secondary memory hierarchy), it can be controlled by soqware.
• Some mechanism is required to determine whether or not the requested • Four main quesHons:
informaHon is present in the top level of the memory hierarchy. 1. Block Placement: Where to place a block in the upper level?
– Check happens on every memory access and affects hit Hme. 2. Block Iden6fica6on: How is a block found if present in the upper level?
– Implemented in hardware to provide acceptable performance. 3. Block Replacement: Which block is to be replaced on a miss?
4. Write Strategy: What happens on a write?
33 2 34 2
• Main objecHve: provide fast average memory access. • 256 KB, 8-way set associaHve
L3 unified cache:
• Access: 11 cycles L3 unified cache 8 MB, 16-way,
– The Main Memory / Secondary Memory hierarchy, which consists of 2 levels and (shared by all cores)
• L3 unified cache: Access: 30-40
is managed by soqware (operaHng system). cycles
• 8 MB, 16-way set associaHve
• Main objecHve: provide large memory space for users (virtual memory). • Access: 30-40 cycles Block size: 64 bytes for
• Block size: 64 bytes for all caches Main memory all caches.
6
12/08/17
Core-i7 Sandybridge
37 2 38 2
Introduc)on
• Let us consider a single-level cache, and that part of the memory hierarchy
consisHng of cache memory and main memory.
39 2 40 2
• Cache memory is logically divided into blocks or lines, where every block
Q1. Where can a block be placed in the cache?
(line) typically contains 8 to 256 bytes.
• When the CPU wants to access a word in memory, a special hardware first • This is determined by some mapping algorithms.
checks whether it is present in cache memory. – Specifies which main memory blocks can reside in which cache memory blocks.
– If so (called cache hit), the word is directly accessed from the cache memory. – At any given Hme, only a small subset of the main memory blocks can be held in
– If not, the block containing the requested word is brought from main memory to main memory.
cache. • Three common block mapping techniques are used:
– For writes, someHmes the CPU can also directly write to main memory. a) Direct Mapping
• ObjecHve is to keep the commonly used blocks in the cache memory. b) AssociaHve Mapping
– Will result in significantly improved performance due to the property of locality c) (N-way) Set AssociaHve Mapping
of reference. • The algorithms shall be explained with the help of an example.
41 2 42 2
7
12/08/17
• Consider a 2-level cache memory / main memory hierarchy. • Each main memory block can be placed in only one block in the cache.
– The cache memory consists of 256 blocks (lines) of 32 words each. • The mapping funcHon is:
• Total cache size is 8192 (8K) words. Cache Block = (Main Memory Block) % (Number of cache blocks)
– Main memory is addressable by a 24-bit address. • For the example,
• Total size of the main memory is 224 = 16 M words. Cache Block = (Main Memory Block) % 256
• Number of 32-word blocks in main memory = 16 M / 32 = 512K • Some example mappings:
0 à 0, 1 à 1, 255 à 255, 256 à 0, 257 à 1, 512 à 0, 512 à 1, etc.
43 2 44 2
Block 255 – May lead to poor performance if both the blocks are frequently used.
Tag Block 255 Block 256
• The MM address is divided into three fields: TAG, BLOCK and WORD.
Cache Memory Block 257
– When a new block is loaded into the cache, the 8-bit BLOCK field determines the
cache block where it is to be stored.
...
45 2 46 2
• The memory address is divided into two fields: TAG and WORD.
...
Block 255
– When a block is loaded into the cache from MM, the higher order 19 bits of the
Tag Block 255 Block 256
address are stored into the TAG register corresponding to the cache block.
Cache Memory Block 257
– When accessing memory, the 19-bit TAG field of the address is compared with
...
47 2 48 2
8
12/08/17
...
...
• This algorithm is a balance of direct mapping and associaHve mapping.
Block 255
– Like direct mapping, a MM block is mapped to a set.
Block 256
Set Number = (MM Block Number) % (Number of Sets in Cache)
...
Block 257
– The block can be placed anywhere within the set (there are N choices)
Set 63
...
• The value of N is a design parameter: TAG SET WORD Cache Memory
– N = 1 :: same as direct mapping. Block 512K - 1
13 6 5
– N = number of cache blocks :: same as associaHve mapping. Main Memory
Memory Address
– Typical values of N used in pracHce are: 2, 4 or 8.
49 2 50 2
51 2 52 2
53 2 54 2
9
12/08/17
• It may be verified that the counter values of occupied blocks are all disHnct.
• To implement the LRU algorithm, the cache controller must track the LRU
block as the computaHon proceeds. • An example:
• Example: Consider a 4-way set associaHve cache. x Block 0 x Block 0 0 Block 0 1 Block 0 2 Block 0 0 Block 0
– For tracking the LRU block within a set, we use a 2-bit counter with every block. x Block 1 x Block 1 x Block 1 x Block 1 0 Block 1 1 Block 1
– When hit occurs: x Block 2 0 Block 2 1 Block 2 2 Block 2 3 Block 2 3 Block 2
• Counter of the referenced block is reset to 0. x Block 3 x Block 3 x Block 3 0 Block 3 1 Block 3 2 Block 3
• Counters with values originally lower than the referenced one are incremented by 1, Ini)al Miss: Block 2 Miss: Block 0 Miss: Block 3 Miss: Block 1 Hit: Block 0
and all others remain unchanged.
1 Block 0 2 Block 0 2 Block 0 0 Block 0 1 Block 0 1 Block 0
– When miss occurs:
2 Block 1 3 Block 1 3 Block 1 3 Block 1 0 Block 1 0 Block 1
• If the set is not full, the counter associated with the new block loaded is set to 0, and
all other counters are incremented by 1. 0 Block 2 1 Block 2 0 Block 2 1 Block 2 2 Block 2 2 Block 2
• If the set is full, the block with counter value 3 is removed, the new block put in its 3 Block 3 0 Block 3 1 Block 3 2 Block 3 3 Block 3 3 Block 3
place, and the counter set to 0. The other three counters are incremented by 1. Miss: Block 2 Hit: Block 3 Hit: Block 2 Hit: Block 0 Miss: Block 1 Hit: Block 1
55 2 56 2
57 2 58 2
59 2 60 2
10
12/08/17
61 2 62 2
63 2 64 2
65 2 66 2
11
12/08/17
67 2 68 2
69 2 70 2
71 2 72 2
12
12/08/17
73 2 74 2
Example 1
• Consider a CPU with average CPI of 1.1.
L1 – Assume an instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
iCache Main – Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
L2 L3
CPU Memory
L1 Cache Cache – Calculate the effecHve CPI for a unified L1 cache, using write through and no write
dCache M2 allocate, with:
M3 a) No write buffer
M1 M4 b) Perfect write buffer
c) RealisHc write buffer that eliminates 85% of write stalls.
For Intel Core-i7 Sandybridge: M1 & M2 – within core, M3 – within chip, M4 – outside chip
Number of memory accesses per instrucHon = 1 + 15% + 15% = 1.3
% Reads = (1 + 0.15) / 1.3 = 88.5% % Writes = 0.15 / 1.3 = 11.5%
75 2 76 2
• SoluHon: Example 2
a) With no write buffer (i.e. stall on all writes)
• Consider a CPU with average CPI of 1.1.
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 11.5%) = 8.33 cycles
– Assume the instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
• CPI = CPIavg + Memory stalls / instr. = 1.1 + 8.33 = 9.43
– Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
b) With perfect write buffer (i.e. all write stalls are eliminated) – Calculate the effecHve CPI for a unified L1 cache, using write back and write
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5%) = 0.86 cycles allocate, with the probability of a cache block being dirty is 10%.
• CPI = 1.1 + 0.86 = 1.96
Number of memory accesses per instrucHon = 1 + 15% + 15% = 1.3
c) With realisHc write buffer (85% of write stalls are eliminated)
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 15% x 11.5%) = 1.98 cycles
• CPI = 1.1 + 1.98 = 3.08
77 2 78 2
13
12/08/17
• SoluHon:
– Memory accesses per instrucHon = 1.3
– Stalls / access = (1 – HL1) . (tMM x % clean + 2tMM x % dirty)
= 1.5% x (50 x 90% + 100 x 10%) = 0.825 cycles
– Average memory access Hme = 1 + stalls / access = 1 + 0.825 = 1.825 cycles END OF LECTURE 31
– Memory stalls / instr. = 1.3 x 0.825 = 1.07 cycles
– Thus, effecHve CPI = 1.1 + 1.07 = 2.17
79 2 80 2
Example 1
• Consider a CPU with average CPI of 1.1.
– Assume an instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
– Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
– Calculate the effecHve CPI for a unified L1 cache, using write through and no write
allocate, with:
Lecture 1: EVOLUTION OF COMPUTER SYSTEM a) No write buffer
Lecture 32: IMPROVING
Lecture 1: EVOLUTION CACHE PERFORMANCE
OF COMPUTER SYSTEM b) Perfect write buffer
c) RealisHc write buffer that eliminates 85% of write stalls.
81 2 82 2
• SoluHon: Example 2
a) With no write buffer (i.e. stall on all writes)
• Consider a CPU with average CPI of 1.1.
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 11.5%) = 8.33 cycles
– Assume the instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
• CPI = CPIavg + Memory stalls / instr. = 1.1 + 8.33 = 9.43
– Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
b) With perfect write buffer (i.e. all write stalls are eliminated) – Calculate the effecHve CPI for a unified L1 cache, using write back and write
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5%) = 0.86 cycles allocate, with the probability of a cache block being dirty is 10%.
• CPI = 1.1 + 0.86 = 1.96
Number of memory accesses per instrucHon = 1 + 0.15 + 0.15 = 1.3
c) With realisHc write buffer (85% of write stalls are eliminated)
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 15% x 11.5%) = 1.98 cycles
• CPI = 1.1 + 1.98 = 3.08
83 2 84 2
14
12/08/17
• SoluHon: Introduc)on
– Memory accesses per instrucHon = 1.3
– Stalls / access = (1 – HL1) . (tMM x % clean + 2tMM x % dirty) • We shall discuss various techniques using which the performance of cache
= 1.5% x (50 x 90% + 100 x 10%) = 0.825 cycles memory can be improved.
– Memory stalls / instr. = 1.3 x 0.825 = 1.07 cycles • We consider the following expression for average memory access Hme
– Thus, effecHve CPI = 1.1 + 1.07 = 2.17 (AMAT):
AMAT = Hit 6me + Miss rate x Miss penalty
• When we talk about improving the performance of cache memory systems,
we can try to reduce one or more of the three parameters: Hit 6me, Miss
rate, Miss penalty.
85 2 86 2
• We can categorize the techniques into three categories based on the • Increasing the block size helps in reducing the miss rate.
parameter that is being opHmized: – See plot on the next slide.
– Reducing the miss rate: we can use larger block size, larger cache size, and higher • Larger blocks also reduce compulsory misses.
associaHvity.
– Since larger blocks can take beuer advantage of spaHal locality.
– Reducing the miss penalty: we can use mulH-level caches and giving priority to
• Drawbacks:
reads over writes.
– The miss penalty increases, as it is required to transfer larger blocks.
– Reducing the cache hit 3me: we can avoid the address translaHon when indexing
the cache. – Since the number of cache blocks decreases, the number of conflict misses and
even capacity misses can increase.
– The overheads may outweigh the gain.
87 2 88 2
89 2 90 2
15
12/08/17
91 2 92 2
93 2 94 2
95 2 96 2
16
12/08/17
Example 1
• Suppose that in 1000 memory references there are 60 misses in L1-cache
• The local miss rate is large for L2 cache because the L1 cache takes out a
and 15 misses in L2-cache. What are the various miss rates?
major fracHon of the total memory accesses.
Assume that MissPenaltyL2 is 180 clock cycles, HitTimeL1 is 1 clock cycle, and
• For this purpose, the global miss rate is a more useful measure.
HitTimeL2 is 12 clock cycles.
– FracHon of memory accesses generated by the processor that goes all the way
to main memory. What will be the average memory access Hme? Ignore the impact of writes.
• A useful measure:
Average Memory Stalls per Instr. = Misses-per-instrL1 x HitTimeL2
+ Misses-per-instrL2 x MissPenaltyL2
97 2 98 2
99 2 100 2
• Simplest soluHon is to make the read miss to wait unHl the write buffer is • Assume a direct-mapped write-through cache that maps both the words at addresses
512 and 1024 to the same block, and a 4-word write buffer that is not checked on a
empty.
read miss. Will the value of $t1 and $t3 be always equal?
– As an alternaHve, check the contents of the write buffer for any conflict; and if
none, the read miss can conHnue à reduces read miss penalty.
– The data in $t1 is stored in the write buffer aqer the SW.
– Most desktops and servers follow this approach, giving priority to reads over
writes. – Without proper precauHons, the second LW may be loading the wrong value, and
thus $t1 and $t3 may be unequal.
101 2 102 2
17
12/08/17
(f) Avoiding Address Transla)on during Cache Indexing Some Addi)onal Cache Op)miza)ons
• Even a small and simple cache must cope with the translaHon of a virtual 1. Use small and simple first-level caches to reduce hit Hme
address to a physical address to access memory. 2. Way predicHon to reduce hit Hme
• An idea to make the common case fast: 3. Pipelined cache access to increase cache bandwidth
– We use virtual addresses for cache, since hits are much more common than misses. 4. MulH-banked caches to increase cache bandwidth
– Such caches are termed as virtual caches.
5. CriHcal Word First and Early Restart to reduce miss penalty
• Drawback:
6. Compiler opHmizaHons to reduce miss rate
– Page level protecHon is not possible.
7. Prefetching of instrucHons and data to reduce miss penalty or miss rate
– Context switching and I/O (that uses physical addresses) further complicates the
design.
103 2 104 2
END OF LECTURE 32
105 2
18