0% found this document useful (0 votes)
8 views

Week6 Slides

The document discusses the memory hierarchy design, which organizes memory into multiple levels from fastest and smallest (registers and cache) to slowest and largest (magnetic disk). It explains that the memory hierarchy exploits temporal and spatial locality to keep frequently accessed data in faster memory closer to the CPU. The memory hierarchy aims to reduce the performance gap between fast processors and relatively slow main memory.

Uploaded by

Vansh Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week6 Slides

The document discusses the memory hierarchy design, which organizes memory into multiple levels from fastest and smallest (registers and cache) to slowest and largest (magnetic disk). It explains that the memory hierarchy exploits temporal and spatial locality to keep frequently accessed data in faster memory closer to the CPU. The memory hierarchy aims to reduce the performance gap between fast processors and relatively slow main memory.

Uploaded by

Vansh Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

12/08/17

Introduc)on
• Programmers want unlimited amount of memory with very low latency.
• Fast memory technology is more expensive per bit than slower memory.
– SRAM is more expensive than DRAM, DRAM is more expensive than disk.
• Possible soluHon?
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture 28: MEMORY
Lecture HIERARCHY
1: EVOLUTION DESIGN
OF COMPUTER (PART 1)
SYSTEM – Organize the memory system in several levels, called memory hierarchy.
– Exploit temporal and spaHal locality on computer programs.
– Try to keep the commonly accessed segments of program / data in the faster
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA memories.
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA – Results in faster access Hmes on the average.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

1 2 2 2

Quick Review of Memory Technology


• StaHc RAM: • MagneHc disk:
– Very fast but expensive memory technology (requires 6 transistors / bit). – Provides large amount of storage, with very low cost per bit.
– Packing density is limited. – Much slower than DRAM, and also flash memory.
• Dynamic RAM: – Requires mechanical moving parts, and uses magneHc recording technology.
– Significantly slower than DRAM, but much less expensive (1 transistor / bit).
– Requires periodic refreshing.
• Flash memory:
– Non-volaHle memory technology that uses floaHng-gate MOS transistors.
– Slower than DRAM, but higher packing density, and lower cost per bit.

3 2 4 2

Memory Hierarchy • Typical hierarchy (starHng with closest to the processor):


1. Processor registers
• The memory system is organized in several levels, using progressively 2. Level-1 cache (typically divided into separate instrucHon and data cache)
faster technologies as we move towards the processor. 3. Level-2 cache
– The enHre addressable memory space is available in the largest (but slowest) 4. Level-3 cache
memory (typically, magneHc disk or flash storage). 5. Main memory
– We incrementally add smaller (but faster) memories, each containing a subset 6. Secondary memory (magneHc disk / flash drive)
of the data stored in the memory below it.
• As we move away from the processor:
– We proceed in steps towards the processor.
– Size increases
– Cost decreases
– Speed decreases

5 2 6 2

1
12/08/17

Processor
Increasing Registers Increasing Increasing
Size Cost Speed Registers
Level-1 Cache
Instruction Data Level-1 Cache
Cache Cache
Level-2 Cache
Level-2 Cache
Level-3 Cache
Level-3 Cache
Main Memory
Main Memory
MagneHc Disk
Magnetic Disk / Flash Drive

7 2 8 2

A Comparison Major Obstacle in Memory System Design


Level Typical Access Typical Other Features • Processor is much faster than
Time Capacity main memory.
Register 300-500 ps 500-1000 B On-chip – Has to spend much of the
Hme waiHng while
Level-1 cache 1-2 ns 16-64 KB On-chip
instrucHons and data are
Level-2 cache 5-20 ns 256 KB – 2 MB On-chip being fetched from main
Level-3 cache 20-50 ns 1-32 MB On or off chip memory.
Main memory 50-100 ns 1-16 GB – Memory speed cannot be
increased beyond a certain
MagneHc disk 5-50 ms 100 GB – 16 TB point.

9 2 10 2

Impact of Processor / Memory Performance Gap • Memory Latency ReducHon Techniques:


– Faster DRAM cells (depends on VLSI technology)
Year CPU Clock Memory Minimum CPU Stall
Clock Cycle Access Cycles – Wider memory bus width (fewer memory accesses needed)
1986 8 MHz 125 ns 190 ns 190 / 125 – 1 = 0.5 – MulHple memory banks
Ideal memory access
1989 33 MHz 30 ns 165 ns 165 / 30 – 1 = 4.5 Hme = 1 CPU cycle – IntegraHon of memory controller with processor
1992 60 MHz 16.6 ns 120 ns 120 / 16.6 – 1 = 6.2
– New emerging RAM technologies
Real memory access • Memory Latency Hiding Techniques
1996 200 MHz 5 ns 110 ns 110 / 5 – 1 = 21.0
Hme >> 1 CPU cycle
1998 300 MHz 3.33 ns 100 ns 100 / 3.33 – 1 = 29.0 – Memory hierarchy (using SRAM-based cache memories)
2000 1 GHz 1 ns 90 ns 90 / 1 – 1 = 89.0 – Pre-fetching instrucHons and/or data from memory before they are actually
needed (used to hide long memory access latency)
2002 2 GHz 0.5 ns 80 ns 80 / 0.5 – 1 = 159.0
2004 3 GHz 0.33 ns 60 ns 60 / 0.33 – 1 = 179.0

11 2 12 2

2
12/08/17

Locality of Reference
• Programs tend to reuse data and instrucHons they have used recently. • The 90/10 rule has two dimensions:
– Rule of thumb: 90% of the total execuHon Hme of a program is spent in only 10% a) Temporal Locality (locality in Hme)
of the code (also called 90/10 rule).
• If an item is referenced in memory, it will tend to be referenced again
– Reason: nested loops in a program, few procedures calling each other repeatedly, soon.
arrays of data items being accessed sequenHally, etc.
b) SpaHal locality (locality in space)
• Basic idea to exploit this rule:
• If an item is referenced in memory, nearby items will tend to be
– Based on a program’s recent past, we can predict with a reasonable accuracy referenced soon.
what instrucHons and data will be accessed in the near future.

13 2 14 2

(a) Temporal Locality (b) Spa)al Locality


• Recently executed instrucHons are likely to be executed again very soon. • InstrucHons residing close to a recently execuHng instrucHon are likely to
be executed soon.
• Example: compuHng factorial of a number. SUB $t1,$t1,$t1
• Example: accessing elements of an array. ADDI $t2,$zero,N
ADDI $t1,$zero,1
ADDI $t3,$zero,1
fact = 1; ADDI $t2,$zero,N
sum = 0; ADDI $t5,$zero,A
ADDI $t3,$zero,1
for k = 1 to N for k = 1 to N Loop: LW $t8,0($t5)
Loop: MUL $t1,$t1,$t3
fact = fact * k; ADDI $t3,$t3,1 sum = sum + A[k]; ADD $t1,$t1,$t8
ADDI $t3,$t3,1
SGT $t4,$t3,$t2 SGT $t4,$t3,$t2
BNEZ $t4,Loop BNEZ $t4,Loop
• The four instrucHons in the loop are executed more frequently than the others. • Performance can be improved by copying the array into cache memory.

15 2 16 2

Performance of Memory Hierarchy


• Cost:
• We first consider a 2-level hierarchy consisHng of two levels of memory, say, – Let ci denote the cost per bit of memory Mi, and Si denote the storage
M1 and M2. capacity in bits of Mi.
– The average cost per bit of the memory hierarchy is given by:
CPU M1 M2 c1S1 + c2S2
Cost c =
S1 + S2

– In order to have c à c2, we must ensure that S1 << S2 .

17 2 18 2

3
12/08/17

• Hit RaHo / Hit Rate: • Access Time:


– The hit raHo H is defined as the probability that a logical address generated – Let tA1 and tA2 denote the access Hmes of M1 and M2 respecHvely, relaHve
by the CPU refers to informaHon stored in M1. to the CPU.
– We can determine H experimentally as follows: – The average Hme required by the CPU to access a word in memory can be
• A set of representaHve programs is executed or simulated. expressed as:
• The number of references to M1 and M2, denoted by N1 and N2 respecHvely, are tA = H.tA1 + (1 – H).tMISS
recorded. where tMISS denotes the Hme required to handle the miss, called miss
N1
H = penalty.
N1 + N2

– The quanHty (1 – H) is called the miss raHo.

19 2 20 2

• The miss penalty tMISS can be esHmated in various ways:


• Efficiency:
a) The simplest approach is to set TMISS = tA2 , that is, when there is a miss the data
is accessed directly from M2. – Let r = tA2 / tA1 denote the access Hme raHo of the two levels of memory.
b) A request for a word not in M1 typically causes a block containing the requested – We define the access efficiency as e = tA1 / tA , which is the factor by
word to be transferred from M2 to M1. Aqer compleHon of the block transfer, which tA differs from its minimum possible value.
the word can be accessed in M1.
If tB denotes the block transfer Hme, we can write tA1 1
Efficiency e = =
tMISS = tB + tA1 [since tB >> tA1 , tA2 ≈ tB] H.tA1 + (1 – H).tA2 H + (1 – H).r
Thus, tA = H.tA1 + (1 – H).(tB + tA1)
c) If tHIT denotes the Hme required to check whether there is a hit, we can write
tMISS = tHIT + tB + tA1

21 2 22 2

Some Common Terminologies Used


• Speedup:
– The speedup gained by using the memory hierarchy is defined as S = tA2 / tA . • Block: The smallest unit of informaHon transferred between two levels.
– We can write: • Hit Rate: The fracHon of memory accesses found in the upper level.
tA2 1 • Hit Time: Time to access the upper level
S = H.t + (1 – H).t = H / r + (1 – H)
A1 A2 – Upper level access Hme + Time to determine hit/miss
– The same result follows from Amadahl’s law. • Miss: Data item needs to be retrieved from a block in the lower level.
• Miss Rate: The fracHon of memory accesses not found in the upper level.
• Miss Penalty: Overhead whenever a miss occurs.
– Time to replace a block in the upper level + Time to transfer the missed block

23 2 24 2

4
12/08/17

END OF LECTURE 28
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture 29: MEMORY
Lecture HIERARCHY
1: EVOLUTION DESIGN
OF COMPUTER (PART 2)
SYSTEM

DR. KAMALIKA DATTA


DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

25 2 26 2

Example 1 Example 2
• Consider a 2-level memory hierarchy with separate instrucHon and data
• Consider a 2-level memory hierarchy consisHng of a cache memory M1 and
caches in level 1, and main memory in level 2.
the main memory M2. Suppose that the cache is 6 Hmes faster than the main
memory, and the cache can be used 90% of the Hme. How much speedup do
we gain by using the cache? I-Cache Main
– Here, r = 6 and H = 0.90 CPU Memory
– Thus, S = 1 / [H / r + (1 – H)] = 1 / (0.90 / 6 + 0.10) = 1 / 0.25 = 4 D-Cache

27 2 28 2

All memory access


• The following parameters are given: 0.80 0.20
– The clock cycle Hme is 2 ns.
– The miss penalty is 15 clock cycles (for both read and write). Instruc)on access Data access
– 1 % of instrucHons are not found in I-cache. 0.99 0.01 0.92 0.08
– 8 % of data references are not found in D-cache. Miss D-cache Miss
I-cache
– 20 % of the total memory accesses are for data.
– Cache access Hme (including hit detecHon) is 1 clock cycle.
tMISS = 1 + 15 = 16 cycles
Average number of cycles per access:
0.80 x (0.99 x 1 + 0.01 x 16) + 0.20 x (0.92 x 1 + 0.08 x 16)
= 0.92 + 0.44 = 1.36
Thus, average access Hme tA = 1.36 x 2 ns = 2.72 ns

29 2 30 2

5
12/08/17

Performance Calcula)on for Mul)-Level Hierarchy tL1 : access Hme of M1


L1- L2- Main tL2 : access Hme of M2
CPU Memory
Cache Cache HL1 : hit raHo of M1
• Most of the pracHcal memory systems use more than 2 levels of hierarchy.
M1 M2 HL2 : hit raHo of M2 with respect
M3 to the residual accesses that try
to access M2
L1- L2- L3- Main • Consider a 3-level hierarchy consisHng of L1-cache, L2-cache and main memory.
CPU Memory MagneHc • Whenever there is a miss in L1, we go to L2.
Cache Cache Cache
Disk • Average access Hme can be calculated as:
M1 M2
M3 tA = HL1.tL1 + (1 – HL1) . [HL2.tL2 + (1 – HL2).tMISS2]
M4 • Here, tMISS2 is the miss penalty when the requested data is found neither in M1 nor in M2.
M1 to M4 managed by hardware M5
M4 to M5 managed by operaHng system

31 2 32 2

Implica)ons of a Memory Hierarchy to the CPU • Some mechanism is required to transfer blocks between consecuHve
levels.
• Processors designed without memory hierarchy are simpler because all – If the block transfer requires 10’s of clock cycles (like in cache / main memory
memory accesses take the same amount of Hme. hierarchy), it is controlled by hardware.
– Misses in a memory hierarchy implies variable memory access Hmes as seen – If the block transfer requires 1000’s of clock cycles (like in main memory /
by the CPU. secondary memory hierarchy), it can be controlled by soqware.
• Some mechanism is required to determine whether or not the requested • Four main quesHons:
informaHon is present in the top level of the memory hierarchy. 1. Block Placement: Where to place a block in the upper level?
– Check happens on every memory access and affects hit Hme. 2. Block Iden6fica6on: How is a block found if present in the upper level?
– Implemented in hardware to provide acceptable performance. 3. Block Replacement: Which block is to be replaced on a miss?
4. Write Strategy: What happens on a write?

33 2 34 2

Intel Core i7 Cache Hierarchy Cache Performance 14


Processor package

Common Memory Hierarchies Intel Core-i7 Cache


Core 0
Regs
Core 3
Regs
Hierarchy L1 i-cache and d-cache:
32 KB, 8-way,
• In a typical computer system, the memory system is managed as two L1 L1 L1 L1
• L1 i-cache and d-cache: Access: 4 cycles
different hierarchies. • 32 KB, 8-way set associaHve
d-cache i-cache
… d-cache i-cache
L2 unified cache:
• Access: 4 cycles L2 unified L2 unified
– The Cache / Main Memory hierarchy, which consists of 2 to 4 levels and is 256 KB, 8-way,
• L2 unified cache: cache cache
managed by hardware. Access: 11 cycles

• Main objecHve: provide fast average memory access. • 256 KB, 8-way set associaHve
L3 unified cache:
• Access: 11 cycles L3 unified cache 8 MB, 16-way,
– The Main Memory / Secondary Memory hierarchy, which consists of 2 levels and (shared by all cores)
• L3 unified cache: Access: 30-40
is managed by soqware (operaHng system). cycles
• 8 MB, 16-way set associaHve
• Main objecHve: provide large memory space for users (virtual memory). • Access: 30-40 cycles Block size: 64 bytes for
• Block size: 64 bytes for all caches Main memory all caches.

CS@VT Computer Organization II ©2005-2013 CS:APP & McQuain


35 2 36 2

6
12/08/17

I-7 Sandybridge Cache Performance 15

Core-i7 Sandybridge

• Every core has its


own L1 and L2
caches.
END OF LECTURE 29
• The L3 cache is
shared by all the
cores and is also
on-chip.

37 2 38 2

CS@VT Computer Organization II ©2005-2013 CS:APP & McQuain

Introduc)on
• Let us consider a single-level cache, and that part of the memory hierarchy
consisHng of cache memory and main memory.

Lecture 1: EVOLUTION OF COMPUTER SYSTEM Cache Main


Lecture1:30:
Lecture CACHE MEMORY
EVOLUTION (PART
OF COMPUTER 1)
SYSTEM CPU
Memory Memory

DR. KAMALIKA DATTA


DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

39 2 40 2

• Cache memory is logically divided into blocks or lines, where every block
Q1. Where can a block be placed in the cache?
(line) typically contains 8 to 256 bytes.
• When the CPU wants to access a word in memory, a special hardware first • This is determined by some mapping algorithms.
checks whether it is present in cache memory. – Specifies which main memory blocks can reside in which cache memory blocks.
– If so (called cache hit), the word is directly accessed from the cache memory. – At any given Hme, only a small subset of the main memory blocks can be held in
– If not, the block containing the requested word is brought from main memory to main memory.
cache. • Three common block mapping techniques are used:
– For writes, someHmes the CPU can also directly write to main memory. a) Direct Mapping
• ObjecHve is to keep the commonly used blocks in the cache memory. b) AssociaHve Mapping
– Will result in significantly improved performance due to the property of locality c) (N-way) Set AssociaHve Mapping
of reference. • The algorithms shall be explained with the help of an example.

41 2 42 2

7
12/08/17

Example: A 2-level memory hierarchy (a) Direct Mapping

• Consider a 2-level cache memory / main memory hierarchy. • Each main memory block can be placed in only one block in the cache.
– The cache memory consists of 256 blocks (lines) of 32 words each. • The mapping funcHon is:
• Total cache size is 8192 (8K) words. Cache Block = (Main Memory Block) % (Number of cache blocks)
– Main memory is addressable by a 24-bit address. • For the example,
• Total size of the main memory is 224 = 16 M words. Cache Block = (Main Memory Block) % 256
• Number of 32-word blocks in main memory = 16 M / 32 = 512K • Some example mappings:
0 à 0, 1 à 1, 255 à 255, 256 à 0, 257 à 1, 512 à 0, 512 à 1, etc.

43 2 44 2

• Block replacement algorithm is trivial, as there is no choice.


Block 0
Direct Mapping Tag Block 0 Block 1 • More than one MM block is mapped onto the same cache block.
Tag Block 1 – May lead to contenHon even if the cache is not full.
...

– New block will replace the old block.


...

Block 255 – May lead to poor performance if both the blocks are frequently used.
Tag Block 255 Block 256
• The MM address is divided into three fields: TAG, BLOCK and WORD.
Cache Memory Block 257
– When a new block is loaded into the cache, the 8-bit BLOCK field determines the
cache block where it is to be stored.
...

TAG BLOCK WORD


Block 512K - 1
– The high-order 11 bits are stored in a TAG register associated with the cache
11 8 5 block.
Main Memory
– When accessing a memory word, the corresponding TAG fields are compared.
Memory Address
• Match implies HIT.

45 2 46 2

(b) Associa)ve Mapping Fully Associa)ve Block 0


Mapping Tag Block 0 Block 1
• Here, a MM block can potenHally reside in any cache block posiHon. Tag Block 1
...

• The memory address is divided into two fields: TAG and WORD.
...

Block 255
– When a block is loaded into the cache from MM, the higher order 19 bits of the
Tag Block 255 Block 256
address are stored into the TAG register corresponding to the cache block.
Cache Memory Block 257
– When accessing memory, the 19-bit TAG field of the address is compared with
...

all the TAG registers corresponding to all the cache blocks.


• Requires associaHve memory for storing the TAG values. Block 512K - 1
TAG WORD
– High cost / lack of scalability. Main Memory
19 5
• Because of complete freedom in block posiHoning, a wide range of
replacement algorithms is possible. Memory Address

47 2 48 2

8
12/08/17

(c) N-way Set Associa)ve Mapping Block 0


4-way Set Set 0
Set 1 Block 1
• A group of N consecuHve blocks in the cache is called a set.
Associa)ve Mapping

...
...
• This algorithm is a balance of direct mapping and associaHve mapping.
Block 255
– Like direct mapping, a MM block is mapped to a set.
Block 256
Set Number = (MM Block Number) % (Number of Sets in Cache)

...
Block 257
– The block can be placed anywhere within the set (there are N choices)
Set 63

...
• The value of N is a design parameter: TAG SET WORD Cache Memory
– N = 1 :: same as direct mapping. Block 512K - 1
13 6 5
– N = number of cache blocks :: same as associaHve mapping. Main Memory
Memory Address
– Typical values of N used in pracHce are: 2, 4 or 8.

49 2 50 2

• IllustraHon for N = 4: Q2. How is a block found if present in cache?


– Number of sets in cache memory = 64.
– Memory blocks are mapped to a set using modulo-64 operaHon.
• Caches include a TAG associated with each cache block.
– The TAG of every cache block where the block being requested may be present
– Example: MM blocks 0, 64, 128, etc. all map to set 0, where they can occupy any
needs to be compared with the TAG field of the MM address.
of the four available posiHons.
– All the possible tags are compared in parallel, as speed is important.
• MM address is divided into three fields: TAG, SET and WORD.
• Mapping Algorithms?
– The TAG field of the address must be associaHvely compared to the TAG fields of
the 4 blocks of the selected set. – Direct mapping requires a single comparison.
– This instead of requiring a single large associaHve memory, we need a number of – AssociaHve mapping requires a full associaHve search over all the TAGs
very small associaHve memories only one of which will be used at a Hme. corresponding to all cache blocks.
– Set associaHve mapping requires a limited associated search over the TAGs of
only the selected set.

51 2 52 2

Q3. Which block should be replaced on a cache miss?


• With fully associaHve or set associaHve mapping, there can be several blocks
• Use of valid bit: to choose from for replacement when a miss occurs.
– There must be a way to know whether a cache block contains valid or garbage • Two primary strategies are used:
informaHon.
a) Random: The candidate block is selected randomly for replacement. This simple
– A valid bit can be added to the TAG, which indicates whether the block contains strategy tends to spread allocaHon uniformly.
valid data.
b) Least Recently Used (LRU): The block replaced is the one that has not been used
– If the valid bit is not set, there is no need to match the corresponding TAG. for the longest period of Hme.
• Makes use of a corollary of temporal locality:
“If recently used blocks are likely to be used again, then the best
candidate for replacement is the least recently used block”

53 2 54 2

9
12/08/17

• It may be verified that the counter values of occupied blocks are all disHnct.
• To implement the LRU algorithm, the cache controller must track the LRU
block as the computaHon proceeds. • An example:
• Example: Consider a 4-way set associaHve cache. x Block 0 x Block 0 0 Block 0 1 Block 0 2 Block 0 0 Block 0
– For tracking the LRU block within a set, we use a 2-bit counter with every block. x Block 1 x Block 1 x Block 1 x Block 1 0 Block 1 1 Block 1
– When hit occurs: x Block 2 0 Block 2 1 Block 2 2 Block 2 3 Block 2 3 Block 2
• Counter of the referenced block is reset to 0. x Block 3 x Block 3 x Block 3 0 Block 3 1 Block 3 2 Block 3
• Counters with values originally lower than the referenced one are incremented by 1, Ini)al Miss: Block 2 Miss: Block 0 Miss: Block 3 Miss: Block 1 Hit: Block 0
and all others remain unchanged.
1 Block 0 2 Block 0 2 Block 0 0 Block 0 1 Block 0 1 Block 0
– When miss occurs:
2 Block 1 3 Block 1 3 Block 1 3 Block 1 0 Block 1 0 Block 1
• If the set is not full, the counter associated with the new block loaded is set to 0, and
all other counters are incremented by 1. 0 Block 2 1 Block 2 0 Block 2 1 Block 2 2 Block 2 2 Block 2
• If the set is full, the block with counter value 3 is removed, the new block put in its 3 Block 3 0 Block 3 1 Block 3 2 Block 3 3 Block 3 3 Block 3
place, and the counter set to 0. The other three counters are incremented by 1. Miss: Block 2 Hit: Block 3 Hit: Block 2 Hit: Block 0 Miss: Block 1 Hit: Block 1

55 2 56 2

Q4. What happens on a write?


• To be discussed next.
END OF LECTURE 30

57 2 58 2

Types of Cache Misses


1. Compulsory Miss
– On the first access to a block, the block must be brought into the cache.
– Also known as cold start misses, or first reference misses.
– Can be reduced by increasing cache block size or prefetching cache blocks.
Lecture 1: EVOLUTION OF COMPUTER SYSTEM 2. Capacity Miss
Lecture1:31:
Lecture CACHE MEMORY
EVOLUTION (PART
OF COMPUTER 2)
SYSTEM – Blocks may be replaced from cache because the cache cannot hold all the blocks
needed by a program.
DR. KAMALIKA DATTA – Can be reduced by increasing the total cache size.
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

59 2 60 2

10
12/08/17

Q4. What happens on a write?


3. Conflict Miss • StaHsHcal data suggests that read operaHons (including instrucHon fetches)
– In case of direct mapping or N-way set associaHve mapping, several blocks dominate processor cache accesses.
may be mapped to the same block or set in the cache. – All instrucHon fetch operaHons are read.
– May result in block replacements and hence access misses, even though all – Most instrucHons do not write to memory.
the cache blocks may not be occupied.. • Making the common case fast:
– Can be reduced by increasing the value of N (cache associaHvity). – OpHmize cache accesses for reads.
– But Amadahl’s law reminds that for high performance designs we cannot ignore
the speed of write operaHons.

61 2 62 2

Cache Write Strategies


• The common case (read operaHons) is relaHvely easy to make faster.
– A block(s) can be read at the same Hme while the TAG is being compared with
the block address. • Cache designs can be classified based on the write and memory update
– If the read is a HIT the data can be passed to the CPU; if it is a MISS ignore it. strategy being used.
1. Write Through / Store Through
• Problems with write operaHons:
2. Write Back / Copy Back
– The CPU specifies the size of the write (between 1 and 8 bytes), and only that
porHon of a block has to be changed.
• Implies a read-modify-write sequence of operaHons on the block.
Cache Main
• Also, the process of modifying the block cannot begin unHl the TAG is checked to see CPU
Memory Memory
if it is a hit.
– Thus, cache write operaHons take more Hme than cache read operaHons.

63 2 64 2

(a) Write Through Strategy


• Perfect Write Buffer:
• InformaHon is wriuen to both the cache – All writes are handled by write buffer; no stalling for write operaHons.
block and the main memory block.
Cache Main – For unified L1 cache.
• Features: CPU
– Easier to implement
Memory Memory Stall Cycles / Memory Access = % Reads x (1 – HL1) . tMM
– Read misses do not result in writes to the Write • RealisHc Write Buffer:
lower level (i.e. MM).
Buffer – A percentage of write stalls are not eliminated when the write buffer is
– The lower level (i.e. MM) has the most
updated version of the data – important for full.
I/O operaHons and mulHprocessor systems. – For unified L1 cache,
– A write buffer is oqen used to reduce CPU Stall Cycles / Memory Access = (% Reads x (1 – HL1) + % write stalls not
write stall Hme while data is wriuen to main
memory. eliminated) x tMM

65 2 66 2

11
12/08/17

(b) Write Back Strategy


• InformaHon is wriuen only to the cache block.
• A modified cache block is wriuen to MM only when it is replaced.
• Features: Cache Main
CPU
– Writes occur at the speed of cache memory. Memory Memory
– MulHple writes to a cache block requires only one write to MM.
– Uses less memory bandwidth, makes it auracHve to mulHprocessors.
Several writes Single write
• Write-back cache blocks can be clean or dirty. to a block during
– A status bit called dirty bit or modified bit is associated with each cache block, replacement
which indicates whether the block was modified in the cache (0: clean, 1: dirty).
– If the status is clean, the block is not wriuen back to MM while being replaced.

67 2 68 2

Cache Write Miss Policy


• Since informaHon is usually not needed immediately on a write miss, two
opHons are possible on a cache write miss: • Typical usage:
a) Write Allocate a) Write-back cache with write-allocate
• The missed block is loaded into cache on a write miss, followed by write hit • In order to capture subsequent writes to the block in cache.
acHons. b) Write-through cache with no-write-allocate
• Requires a cache block to be allocated for the block to be wriuen into. • Since subsequent writes sHll have to to go to MM.
b) No-Write Allocate
• The block is modified only in the lower level (i.e. MM), and not loaded into
cache.
• Cache block is not allocated for the block to be wriuen into.

69 2 70 2

Es)ma)on of Miss Penal)es


• Write-Back Cache (with Write Allocate)
– Write Hit OperaHon
• Write-Through Cache
• Miss penalty = 0
– Write Hit OperaHon:
– Read or Write Miss OperaHon
• Without write buffer, miss penalty = tMM
• If the replaced block is clean, miss penalty = tMM
• With perfect write buffer, miss penalty = 0
– No need to write the block back to MM.
• Write-Back Cache – New block to be brought into MM (tMM).
– Write Hit OperaHon • If the replaced block is dirty, miss penalty = 2 tMM
• Miss penalty = 0 – Write the block to be replaced to MM (tMM).
– New block to be brought into MM (tMM).

71 2 72 2

12
12/08/17

Choice of Block Size in Cache Instruc)on-only and Data-only Caches


• Larger block sizes reduce compulsory misses. • Caches are someHmes divided into instrucHon-only and data-only caches.
• Larger block sizes also reduce the number of blocks in cache, increasing – The CPU knows whether it is issuing an instrucHon address or a data address.
conflict misses. – There are two separate ports, thereby doubling the bandwidth between the CPU
• Typical block size: 16 to 32 bytes. and the cache.
– Typical L1 caches are separated into L1 i-cache and L1 d-cache.
• Separate caches also offers the opportunity of opHmizing each cache
separately.
– InstrucHon and data reference pauerns are different.
– Different capaciHes, block sizes, and associaHvity (i.e. N).

73 2 74 2

Example 1
• Consider a CPU with average CPI of 1.1.
L1 – Assume an instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
iCache Main – Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
L2 L3
CPU Memory
L1 Cache Cache – Calculate the effecHve CPI for a unified L1 cache, using write through and no write
dCache M2 allocate, with:
M3 a) No write buffer
M1 M4 b) Perfect write buffer
c) RealisHc write buffer that eliminates 85% of write stalls.
For Intel Core-i7 Sandybridge: M1 & M2 – within core, M3 – within chip, M4 – outside chip
Number of memory accesses per instrucHon = 1 + 15% + 15% = 1.3
% Reads = (1 + 0.15) / 1.3 = 88.5% % Writes = 0.15 / 1.3 = 11.5%

75 2 76 2

• SoluHon: Example 2
a) With no write buffer (i.e. stall on all writes)
• Consider a CPU with average CPI of 1.1.
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 11.5%) = 8.33 cycles
– Assume the instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
• CPI = CPIavg + Memory stalls / instr. = 1.1 + 8.33 = 9.43
– Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
b) With perfect write buffer (i.e. all write stalls are eliminated) – Calculate the effecHve CPI for a unified L1 cache, using write back and write
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5%) = 0.86 cycles allocate, with the probability of a cache block being dirty is 10%.
• CPI = 1.1 + 0.86 = 1.96
Number of memory accesses per instrucHon = 1 + 15% + 15% = 1.3
c) With realisHc write buffer (85% of write stalls are eliminated)
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 15% x 11.5%) = 1.98 cycles
• CPI = 1.1 + 1.98 = 3.08

77 2 78 2

13
12/08/17

• SoluHon:
– Memory accesses per instrucHon = 1.3
– Stalls / access = (1 – HL1) . (tMM x % clean + 2tMM x % dirty)
= 1.5% x (50 x 90% + 100 x 10%) = 0.825 cycles
– Average memory access Hme = 1 + stalls / access = 1 + 0.825 = 1.825 cycles END OF LECTURE 31
– Memory stalls / instr. = 1.3 x 0.825 = 1.07 cycles
– Thus, effecHve CPI = 1.1 + 1.07 = 2.17

79 2 80 2

Example 1
• Consider a CPU with average CPI of 1.1.
– Assume an instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
– Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
– Calculate the effecHve CPI for a unified L1 cache, using write through and no write
allocate, with:
Lecture 1: EVOLUTION OF COMPUTER SYSTEM a) No write buffer
Lecture 32: IMPROVING
Lecture 1: EVOLUTION CACHE PERFORMANCE
OF COMPUTER SYSTEM b) Perfect write buffer
c) RealisHc write buffer that eliminates 85% of write stalls.

DR. KAMALIKA DATTA


DR. KAMALIKA DATTA Number of memory accesses per instrucHon = 1 + 0.15 + 0.15 = 1.3
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA % Reads = (1 + 0.15) / 1.3 = 88.5% % Writes = 0.15 / 1.3 = 11.5%
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

81 2 82 2

• SoluHon: Example 2
a) With no write buffer (i.e. stall on all writes)
• Consider a CPU with average CPI of 1.1.
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 11.5%) = 8.33 cycles
– Assume the instrucHon mix: ALU – 50%, LOAD – 15%, STORE – 15%, BRANCH – 20%
• CPI = CPIavg + Memory stalls / instr. = 1.1 + 8.33 = 9.43
– Assume a cache miss rate of 1.5%, and miss penalty of 50 cycles (= tMM).
b) With perfect write buffer (i.e. all write stalls are eliminated) – Calculate the effecHve CPI for a unified L1 cache, using write back and write
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5%) = 0.86 cycles allocate, with the probability of a cache block being dirty is 10%.
• CPI = 1.1 + 0.86 = 1.96
Number of memory accesses per instrucHon = 1 + 0.15 + 0.15 = 1.3
c) With realisHc write buffer (85% of write stalls are eliminated)
• Memory stalls / instr. = 1.3 x 50 x (88.5% x 1.5% + 15% x 11.5%) = 1.98 cycles
• CPI = 1.1 + 1.98 = 3.08

83 2 84 2

14
12/08/17

• SoluHon: Introduc)on
– Memory accesses per instrucHon = 1.3
– Stalls / access = (1 – HL1) . (tMM x % clean + 2tMM x % dirty) • We shall discuss various techniques using which the performance of cache
= 1.5% x (50 x 90% + 100 x 10%) = 0.825 cycles memory can be improved.
– Memory stalls / instr. = 1.3 x 0.825 = 1.07 cycles • We consider the following expression for average memory access Hme
– Thus, effecHve CPI = 1.1 + 1.07 = 2.17 (AMAT):
AMAT = Hit 6me + Miss rate x Miss penalty
• When we talk about improving the performance of cache memory systems,
we can try to reduce one or more of the three parameters: Hit 6me, Miss
rate, Miss penalty.

85 2 86 2

Basic Cache Op)miza)on Techniques (a) Use Larger Block Size

• We can categorize the techniques into three categories based on the • Increasing the block size helps in reducing the miss rate.
parameter that is being opHmized: – See plot on the next slide.
– Reducing the miss rate: we can use larger block size, larger cache size, and higher • Larger blocks also reduce compulsory misses.
associaHvity.
– Since larger blocks can take beuer advantage of spaHal locality.
– Reducing the miss penalty: we can use mulH-level caches and giving priority to
• Drawbacks:
reads over writes.
– The miss penalty increases, as it is required to transfer larger blocks.
– Reducing the cache hit 3me: we can avoid the address translaHon when indexing
the cache. – Since the number of cache blocks decreases, the number of conflict misses and
even capacity misses can increase.
– The overheads may outweigh the gain.

87 2 88 2

• SelecHon of block size:


– The opHmal selecHon of the block size depends on both the latency
and the bandwidth of the lower-level memory.
– High latency and high bandwidth
• Encourages large block size since the cache gets many more bytes for a
miss for a nominal increase in miss penalty.
– Low latency and low bandwidth
• Encourages smaller block sizes since more Hme is required to transfer
larger blocks.
• Larger number of smaller blocks may also reduce conflict misses.
[Hennessy & Pauerson, “Computer Architecture: A QuanHtaHve Approach” (4/e)]

89 2 90 2

15
12/08/17

(b) Use Larger Cache Memory

• Increasing the size of the cache is a straighzorward way to reduce the


capacity misses.
• Drawbacks:
– Increases the hit Hme since the number of TAGs to be searched in parallel will
be possibly larger.
– Results in higher cost and power consumpHon.
• TradiHonally popular for off-chip caches.

[Hennessy & Pauerson, “Computer Architecture: A QuanHtaHve Approach” (4/e)]

91 2 92 2

(c) Use Higher Associa)vity (d) Use Mul)-level Caches


• For N-way associaHve cache, the miss rate reduces as we increase N.
– Reduces conflict misses, as there are more choices to place a block in cache. • Here we try to reduce the miss penalty, and not the miss rate.
• General rule of thumb: • Performance gap between processors and memory increases with Hme.
– 8-way set associaHve cache is as effecHve as fully associaHve for pracHcal – Use faster cache to keep pace with the speed of the processor?
scenarios. – Make the cache larger to bridge the widening gap between processor and MM?
– A direct mapped cache of size N has about the same miss rate as a 2-way set • We can use both in a mulH-level cache system:
associaHve cache of size N/2. – The L1 cache can be small enough to match the clock cycle Hme of the fast
• Drawbacks: processor.
– Increases the hit Hme as we have to search a larger associaHve memory. – The L2 cache can be large enough to capture many accesses that would go to
– Increases power consumpHon due to higher complexity of associaHve memory. MM, thereby reducing the miss penalty.

93 2 94 2

• We define the following for a 2-level cache system:


• Consider a 2-level cache system, consisHng of L1 cache and L2 cache. – Local Miss Rate
• The average memory access Hme can be computed as: • This is defined as the number of misses in a cache divided by the total
AMAT = HitTimeL1 + MissRateL1 x MissPenaltyL1 number of accesses to this cache.
where MissPenaltyL1 = HitTimeL2 + MissRateL2 x MissPenaltyL2 • For the first level, this is MissRateL1
• Thus, • For the second level, this is MissRateL2
AMAT = HitTimeL1 + MissRateL1 x (HitTimeL2 + MissRateL2 x MissPenaltyL2 ) – Global Miss Rate
• This is defined as the number of misses in a cache divided by the total
• The second-level miss rate MissRateL2 is measured on the leXovers from the number of memory accesses generated by the processor.
first-level cache.
• For the first level, this is MissRateL1
• For the second level, this is MissRateL1 x MissRateL2

95 2 96 2

16
12/08/17

Example 1
• Suppose that in 1000 memory references there are 60 misses in L1-cache
• The local miss rate is large for L2 cache because the L1 cache takes out a
and 15 misses in L2-cache. What are the various miss rates?
major fracHon of the total memory accesses.
Assume that MissPenaltyL2 is 180 clock cycles, HitTimeL1 is 1 clock cycle, and
• For this purpose, the global miss rate is a more useful measure.
HitTimeL2 is 12 clock cycles.
– FracHon of memory accesses generated by the processor that goes all the way
to main memory. What will be the average memory access Hme? Ignore the impact of writes.
• A useful measure:
Average Memory Stalls per Instr. = Misses-per-instrL1 x HitTimeL2
+ Misses-per-instrL2 x MissPenaltyL2

97 2 98 2

• SoluHon: • MulH-level inclusion versus MulH-level exclusion


– MissRateL1 = 60 / 1000 = 6 % (both local or global) – Mul3-level inclusion requires that L1 data are always present in L2.
– LocalMissRateL2 = 15 / 60 = 25 % • Desirable because consistency between I/O and caches can be determined
– GlobalMissRateL2 = 15 / 1000 = 1.5 % just by checking the L2 cache.
– Mul3-level exclusion requires that L1 data is never found in L2.
AMAT = HitTimeL1 + MissRateL1 x (HitTimeL2 + MissRateL2 x MissPenaltyL2 ) • Typically, a cache miss in L1 results in a swap of blocks between L1 and l2
= 1 + 6 % x (12 + 25 % x 180) rather than a replacement of a L1 block with a L2 block.
= 1 + 6 % x 57 = 4.42 clock cycles • This policy prevents wasHng space in the L2 cache.
• May make sense if the designer can only afford a L2 cache that is slightly
bigger than the L1 cache.

99 2 100 2

(e) Giving Priority to Read Misses Over Writes Example 2


• Consider the code sequence:
• The presence of write buffers can complicate memory accesses. SW $t1, 512($zero)
– The buffer may be holding the updated value of a locaHon needed on a read LW $t2, 1024($zero)
miss. LW $t3, 512($zero)

• Simplest soluHon is to make the read miss to wait unHl the write buffer is • Assume a direct-mapped write-through cache that maps both the words at addresses
512 and 1024 to the same block, and a 4-word write buffer that is not checked on a
empty.
read miss. Will the value of $t1 and $t3 be always equal?
– As an alternaHve, check the contents of the write buffer for any conflict; and if
none, the read miss can conHnue à reduces read miss penalty.
– The data in $t1 is stored in the write buffer aqer the SW.
– Most desktops and servers follow this approach, giving priority to reads over
writes. – Without proper precauHons, the second LW may be loading the wrong value, and
thus $t1 and $t3 may be unequal.

101 2 102 2

17
12/08/17

(f) Avoiding Address Transla)on during Cache Indexing Some Addi)onal Cache Op)miza)ons
• Even a small and simple cache must cope with the translaHon of a virtual 1. Use small and simple first-level caches to reduce hit Hme
address to a physical address to access memory. 2. Way predicHon to reduce hit Hme
• An idea to make the common case fast: 3. Pipelined cache access to increase cache bandwidth
– We use virtual addresses for cache, since hits are much more common than misses. 4. MulH-banked caches to increase cache bandwidth
– Such caches are termed as virtual caches.
5. CriHcal Word First and Early Restart to reduce miss penalty
• Drawback:
6. Compiler opHmizaHons to reduce miss rate
– Page level protecHon is not possible.
7. Prefetching of instrucHons and data to reduce miss penalty or miss rate
– Context switching and I/O (that uses physical addresses) further complicates the
design.

103 2 104 2

END OF LECTURE 32

105 2

18

You might also like