Lecture 13- Introduction to Cache
Lecture 13- Introduction to Cache
Cache
Rose Gomar
Department of Systems and Computer Engineering
Textbook/Copyright
• Hennessy, John L., and David A. Patterson. Computer architecture: a
quantitative approach. Elsevier, 6th edition, 2017, Chapter 2.
• Hennessy, John L., and David A. Patterson. Computer architecture: a
quantitative approach. Elsevier, 6th edition, Appendix B.
• Hennessy, John L., and David A. Patterson, Computer Organization and
Design: RISC-V edition, Chapter 5.
• Part of the slides are provided by Elsevier (Copyright © 2019, Elsevier
Inc. All rights reserved)
2
What we learn in this lecture?
• Motivation for cache
• Direct-map caches
3
Principle of Locality
• Principles of locality is applied in memory system design
• Programs access a small proportion of their address space at
any time
• Temporal locality
• Items accessed recently are likely to be accessed
again soon
e.g., instructions in a loop, induction variables
• Spatial locality
• Items near those accessed recently are likely to be
accessed soon
• E.g., sequential instruction access, array data
Principle of Locality
Memory
address
Time
[From Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual
Memory. IBM Systems Journal 10(3): 168-192 (1971)]
Taking Advantage of Locality
• Memory hierarchy
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to
smaller DRAM memory
• Main memory
• Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
• Cache memory attached to CPU
Memory Hierarchy Levels
• Block: unit of copying
• May be multiple words
◼ #Blocks is a
power of 2
◼ Use low-order
address bits
Tags and Valid Bits
• How do we know which particular block is stored in a cache location?
• Store block address as well as the data
• Actually, only need the high-order bits
• Called the tag
Block 0
Assume:
Block 1
• 64 bit address
• A direct map cache ..
• Cache size= 2n blocks ..
• Block size = 2m (2m+2 bytes) assuming .
word size= 4 bytes
Then:
• 𝑻𝒂𝒈 𝒔𝒊𝒛𝒆 = 𝟔𝟒 − (𝒏 + 𝒎 + 𝟐)
• Total number of bits in a direct-map
cache:
• For write-back
• Usually fetch the block
Example: Intrinsity FastMATH
• Embedded MIPS processor
• 12-stage pipeline
• Instruction and data access on each cycle
• For writes, the Intrinsity FastMATH offers both write-through and write-back,
leaving it up to the operating system to decide which strategy to use for an
application.
• Performance
• Instruction miss rate: 0.4%
• Data miss rate: 11.4%
• Combined miss rate: 3.2%
Measuring Cache Performance
• Components of CPU time
• Program execution cycles
• Includes cache hit time
• Memory stall cycles
• Mainly from cache misses
• With simplifying assumptions:
Calculate miss cycles for I-cache and D-cache and find actual CPI.
Example: Cache Performance
• Given
• I-cache miss rate = 2%
• D-cache miss rate = 4%
• Miss penalty = 100 cycles
• Base CPI (ideal cache) = 2
• Load & stores are 36% of instructions
• Miss cycles per instruction
• I-cache: 0.02 × 100 = 2 (2% of instructions with 100 cycles miss
penalty)
• D-cache: 0.36 × 0.04 × 100 = 1.44 (36% of the instructions are
load/store with miss rate of 4% and miss penalty of 100 cycles)
• Actual CPI = 2 + 2 + 1.44 = 5.44
• CPI for a perfect cache =2
• What happens if the processor is made faster but the memory
system is not?
Discussion: Cache Performance
• In the previous example assume the CPI is improved and it is 1. How the
performance is compared against the previous example?
• Which portion of the execution time is spent on memory stalls?
Discussion: Cache Performance
• In the previous example assume the CPI is improved and it is 1. How
the performance is compared against the previous example?
• Which portion of the execution time is spent on memory stalls?
• CPI = 1 + 3.44 = 4.44
• CPI with stall/ CPI with perfect cache = 4.44
• CPI=5.44
• Execution time spent on memory stalls = 3.44/5.44 = 63%
• CPI = 4.44
Execution time spent on memory stalls = 3.44/4.44 = 77%
Average Access Time
• Hit time is also important for performance
• Average memory access time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty
• Example
• CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache
miss rate = 5%
• AMAT = 1 + 0.05 × 20 = 2ns
• 2 cycles per instruction
Performance Summary
• When CPU performance increased
• Miss penalty becomes more significant