0% found this document useful (0 votes)
16 views24 pages

Lec-8 Memory-3 CompArch

This document summarizes a lecture on memory caches. It reviews direct mapped caches and discusses how they work. It covers the size of direct mapped caches and how the number of bits required depends on the cache size, address size, block size, and word size. Examples are provided to show how to calculate the number of bits required for different cache configurations. The document also discusses mapping memory addresses to cache blocks and the steps involved in cache read and write hits and misses.

Uploaded by

Shariul Ekab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views24 pages

Lec-8 Memory-3 CompArch

This document summarizes a lecture on memory caches. It reviews direct mapped caches and discusses how they work. It covers the size of direct mapped caches and how the number of bits required depends on the cache size, address size, block size, and word size. Examples are provided to show how to calculate the number of bits required for different cache configurations. The document also discusses mapping memory addresses to cache blocks and the steps involved in cache read and write hits and misses.

Uploaded by

Shariul Ekab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Lecture-8

Memory
Part-3
CSE-2823
Computer Architecture
Dr. Md. Waliur Rahman Miah
Associate Professor, CSE, DUET
1
Today’s Topic

Memory
Part-3

Ref:
Hennessy-Patterson 5e-Ch-5; 4e-Ch-7
Stallings 8e-Ch-4-5-6

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 2


Direct Mapped Cache (review)

• Addressing scheme in direct mapped cache:


– cache block address = memory block address mod cache size
(unique)
– if cache size = 2m, cache address = lower m bits of n-bit
memory address
– remaining upper n-m bits kept as a tag bits at each cache
block
– also need a valid bit to recognize a valid entry

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Direct Mapped Cache (review)
Address showing
Address bit positions
(showing bit positions)
31 30 13 12 11 210
Byte
offset

20 10
Hit Data
Tag
Index

Index Valid Tag Data


0
1
2

1021
1022
1023
20 32

Cache with 1024 1-word blocks: byte offset


(least 2 significant bits) is ignored and
next 10 bits used to index into cache

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Size of Direct Mapped Cache-1
• The total number of bits needed for a cache is a
function of :
• the cache size and
• the address size,
• the cache includes both the storage for
• the data and the tags
• For the following situation:
– 32-bit addresses
– A direct-mapped cache
– The cache size is 2n blocks, so n bits are used for the
index
– The block size is 2m words (2m+2 bytes), so
• m bits are used for the word within the block, and
• two bits are used for the byte part of the address
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 5
Size of Direct Mapped Cache-2
• … Continued from previous slide
• the size of the tag field is:
32  (n + m + 2);
• The total number of bits in a direct-mapped cache is:
2^n  (block size + tag size + valid field size)
• Since the block size is 2m words (2m+2 bits), and we
need 1 bit for the valid field, the number of bits in such
a cache is:
= 2n  ((2m  32) + (32  (n + m + 2)) + 1)
= 2n  ((2m  32) + (31  n  m)).
• Although this is the actual size in bits, the naming
convention is to exclude the size of the tag and
valid bit field and to count only the size of the data.
• [Thus, the cache in Figure 5.10 is called a 4 KiB cache]
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 6
EXAMPLE (Bits in a Cache)
How many total bits are required for a direct-mapped cache with 16 KiB of data
and 4-word blocks, assuming a 32-bit address?
Soln.:
1 word = 32 bits = 4  8 bits = 4 bytes;
16 Ki Bytes = 16/4 Ki words = 4 Ki Words
= 4  210 words = 212 words
With a block size of 4 words (2m = 22), there are 1024 (2n = 210) blocks.
Thus n = 10, m = 2;
Each block has 4  32 = 128 bits of data plus a tag,
Tag is: 32  (n + m + 2)
= 32  10  2  2 bits, plus a valid bit.
Thus, the total cache size is:
= 2n  ((2m  32) + (32  (n + m + 2)) + 1)
= 210  (4  32 + (32  10  2  2) + 1) = 210  147 = 147 Kibits = 147/8 KiB
= 18.4 KiB for a 16 KiB cache.
For this cache, the total number of bits in the cache is about 1.15 times as many
as needed just for the storage of the data.
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 7
Example Problem 1
• How many total bits are required for a direct-mapped cache with 128
KB of data and 1-word block size, assuming a 32-bit address?

• Cache data = 128 KB = 217 bytes = 215 words = 215 blocks [1-word block size]
• Cache entry size = block data bits + tag bits + valid bit
= 32 + (32 – 15 – 2) + 1 = 48 bits
• Therefore, cache size =
= 215  48 bits
= 215  (1.5  32) bits
= 1.5  220 bits
= 1.5 Mbits
– data bits in cache = 128 KB  8 = 1 Mbits
– total cache size/actual cache data = 1.5

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Example Problem 2
• How many total bits are required for a direct-mapped cache with 128
KB of data and 4-word block size, assuming a 32-bit address?

• Cache size = 128 KB = 217 bytes = 215 words = 213 blocks [4-word block size]
• Cache entry size = block data bits + tag bits + valid bit
= 128 + (32 – 13 – 2 – 2) + 1 = 144 bits
• Therefore, cache size
= 213  144 bits
= 213  (1.25  128) bits
= 1.25  220 bits
= 1.25 Mbits
– data bits in cache = 128 KB  8 = 1 Mbits
– total cache size/actual cache data = 1.25

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


EXAMPLE
(Mapping an Address to a Multiword Cache Block-1)
Byte memory Memory organized in blocks
(this is not cache but main memory organized in blocks )
Byte Memory
address content Byte address
0000 1 byte 00 01 02 03
0001 1 byte 00 1 byte 1 byte 1 byte 1 byte
… … Block 01
0011 1 byte address 02
0012 1 byte 03
04
• the address of the block is
Byte address
= 0012/4 = 03
Bytes per block
• this block address is the block containing all addresses between: 12 to 15
Byte address
• Bytes per block × Bytes per block = 12
to
Byte address
• Bytes per block × Bytes per block + Bytes per block − 1 = 12+3 = 15
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 10
EXAMPLE
(Mapping an Address to a Multiword Cache Block-2)
Consider a cache with 64 blocks and a block size of 16 bytes. To what
block number does byte address 1200 map?
Soln:
The block is given by [ref: page 384]:
(Block address) modulo (Number of blocks in the cache)
• where the address of the block is
Byte address
= Bytes per block = 1200/64 = 75
• which maps to cache block number (75 modulo 64) = 11.
• this block maps all addresses between:
1200 to 1200 + 16 − 1 = 1215

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 11


Cache Read Hit/Miss
• Steps to be taken on a cache hit/miss:
• Cache read hit: no action needed
• Instruction cache read miss:
1. Send original PC value (current PC – 4, as PC has already
been incremented in first step of instruction cycle) to
memory [PC = Program Counter]
2. Instruct main memory to perform read and wait for
memory to complete access – stall on read
3. After read completes write cache entry
4. Restart instruction execution at first step to refetch
instruction
• Data cache read miss:
– Similar to instruction cache miss
– To reduce data miss penalty allow processor to execute
instructions while waiting for the read to complete until the
word is required – stall on use (why won’t this work for
instruction
Dr. Md. Waliur misses?)
Rahman Miah Dept of CSE, DUET
Cache Write Hit/Miss
• Write-through scheme
– on write hit: replace data in cache and memory with every
write hit to avoid inconsistency
– on write miss: write the word into cache and memory –
obviously first need to fetch missed word from memory!
– Write-through is slow because of always required memory
write
• performance is improved with a write buffer where words are
stored while waiting to be written to memory – processor can
continue execution until write buffer is full
• when a word in the write buffer completes writing into main that
buffer slot is freed and becomes available for future writes

• Write-back scheme
– write the data block only into the cache and write-back
the block to main memory only when it is replaced in
cache
– more efficient than write-through, more complex to
implement
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET
Direct Mapped Cache: Taking
Advantage of Spatial Locality
• Taking advantage of spatial locality with larger blocks:
Address
Address showing bit positions
(showing bit positions)
31 16 15 4 32 1 0

16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data

4K
entries

16 32 32 32 32

Mux
32

Cache with 4K 4-word blocks: byte offset (least 2 significant bits) is ignored,
next 2 bits are block offset, and the next 12 bits are used to index into cache
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET
Direct Mapped Cache: Taking Advantage of
Spatial Locality

• Cache replacement in large (multiword) blocks:


– word read miss: read entire block from main memory
– word write miss: cannot simply write word and tag!
– writing in a write-through cache:
• if write hit, i.e., tag of requested address and cache entry are equal,
continue as for 1-word blocks by replacing word and writing block to both
cache and memory
• if write miss, i.e., tags are unequal, fetch block from memory, replace word
that caused miss, and write block to both cache and memory
• therefore, unlike case of 1-word blocks, a write miss with a multiword
block causes a memory read

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Direct Mapped Cache: Taking Advantage of
Spatial Locality
40%
• Miss rate falls at first with 35%

increasing block size as 30%

25%
expected, but, as block size

Miss rate
20%

becomes a large fraction of 15%

total cache size, miss rate may 10%

5%
go up because 0%
4 16 64 256

– there are few blocks Block size (bytes) 1 KB


8 KB
Miss rate vs. block size for
– competition for blocks increases 16 KB
various cache sizes 64 KB
– blocks get ejected before most 256 KB

of their words are accessed


(thrashing in cache)

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Cache Performance
• Simplified model assuming equal read and write miss
penalties:
– CPU time = (execution cycles + memory stall cycles)  cycle time
– memory stall cycles = memory accesses  miss rate  miss penalty
[memory accesses for Reading and/or Writing]
• Therefore, two ways to improve performance in cache:
– decrease miss rate
– decrease miss penalty
– what happens if we increase block size?

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Example Problems
• Assume for a given machine and program:
– instruction cache miss rate 2%
– data cache miss rate 4%
– miss penalty always 40 cycles
– CPI of 2 without memory stalls
– frequency of load/stores 36% of instructions

1. How much faster is a machine with a perfect cache that never misses?
2. What happens if we speed up the machine by reducing its CPI to 1
without changing the clock rate?
3. What happens if we speed up the machine by doubling its clock rate, but
if the absolute time for a miss penalty remains same?

[CPI = clock cycles per instruction ]


Dr. Md. Waliur Rahman Miah Dept of CSE, DUET
Solution instruction cache miss rate 2%
data cache miss rate 4%
1. How much faster is a machine with a miss penalty always 40 cycles
perfect cache that never misses? CPI of 2 without memory stalls
frequency of load/stores 36%
• Assume instruction count = I of instructions
• Instruction miss cycles = I  2%  40 = 0.8  I
• Data miss cycles = I  36%  4%  40 = 0.576  I
• So, total memory-stall cycles = 0.8  I + 0.576  I = 1.376  I
– in other words, 1.376 stall cycles per instruction
• Therefore, CPI with memory stalls = 2 + 1.376 = 3.376
• Assuming instruction count and clock rate remain same for a perfect
cache and a cache that misses:
CPU time with stalls / CPU time with perfect cache
= (I  CPIstall  clock cycle ) / (I  CPIperfect  clock cycle )
= 3.376 / 2
= 1.688
• Performance with a perfect cache is better by a factor of 1.688
[this value lower is better]
• The amount of execution time spent on memory stalls
= 1.376/3.376 = 40.75%
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET
Solution (cont.)
instruction cache miss rate 2%
2. What happens if we speed up the machine data cache miss rate 4%
by reducing its CPI to 1 without changing miss penalty always 40 cycles
the clock rate? [Changed] CPI of 2 without
memory stalls
frequency of load/stores 36%
• CPI without stall = 1 of instructions
• CPI with stall = 1 + 1.376 = 2.376 (clock has not changed so
stall cycles per instruction remains same)
• CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 2.376 / 1
= 2.376
• Performance with a perfect cache is better by a factor of 2.376
• The amount of execution time spent on memory stalls
1.376/2.376 = 57.91% (increased from 40.75%)

• Conclusion: with higher CPI cache misses “hurt more” than with lower CPI

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Solution (cont.)
instruction cache miss rate 2%
data cache miss rate 4%
3. What happens if we speed up the machine by miss penalty always 40 cycles
doubling its clock rate, but if the absolute time CPI of 2 without memory stalls
for a miss penalty remains same? frequency of load/stores 36%
of instructions

• With doubled clock rate, miss penalty = 2  40 = 80 clock cycles


• Stall cycles per instruction = (I  2%  80) + (I  36%  4%  80)
= 2.752  I
• So, faster machine with cache miss has CPI = 2 + 2.752 = 4.752
• CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 4.752 / 2 = 2.376
• Performance with a perfect cache is better by a factor of 2.376
• Conclusion: with higher clock rate cache misses “hurt more” than with lower
clock rate

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET


Average memory access time (AMAT)

• Average memory access time (AMAT) is the


average time to access memory considering
both hits and misses and the frequency of
different accesses;
• it is equal to the following:
AMAT = Time for a hit + Miss rate  Miss penalty

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 22


AMAT Example
• Find the AMAT for a processor with:
• clock cycle time = 1 ns
• miss penalty = 20 clock cycles
• miss rate = 0.05 misses per instruction, and
• cache access time = 1 clock cycle (including hit detection)
• Assume that the read and write miss penalties are the same
and ignore other write stalls.
• Soln:
AMAT (per instruction)
= Time for a hit + Miss rate  Miss penalty
= 1 + 0.05  20
= 2 clock cycles
or 2 ns.
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 23
Reference
[1] Patterson, D. A., & Hennessy, J. L. (2014). Computer
organization and design: The hardware/software
interface (5th ed.). Burlington, MA: Morgan Kaufmann
Publishers.

[2] William Stallings, (2010), Computer Organization and


Architecture,(8th Ed), Prentice Hall Upper Saddle River, NJ
07458.

[3] Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Naraig


Manjikian, (2012), Computer Organization and
Embedded Systems (6th Ed), McGraw-Hill, New York, NY
10020.

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 24

You might also like