Lec 4b
Lec 4b
5th
Edition
The Hardware/Software Interface
Chapter 5
Large and Fast:
Exploiting Memory
Hierarchy (cont.)
Caching Example Block size = 16 bytes,
4 blocks in cache.
Request 0 164 83 192 10 90 175 673 168 59
(byte addr. in (00101
decimal) 0,0100)
Block addr. 000000 001010 000101 001100 000000 000101 001010 101010 001010 000011
(binary) 0 10 5 12 0 5 10 42 10 3
Index 00 10 01 00 00 01 10 10 10 11
(direct-map)
Cache 0000 0000 0000 0011 0000 0000 0000 0000 0000 0000
Set 0
Miss type CM CM CM CM CF - - CM CF CM
Block addr. 000000 001010 000101 001100 000000 000101 001010 101010 001010 000011
(binary) 0 10 5 12 0 5 10 42 10 3
Index 0 0 1 0 0 1 0 0 0 1
(2-way cache)
Cache 00000 00101 00101 00110 00000 00000 00101 10101 00101 00101
Set 0 - 00000 00000 00101 00110 00110 00000 00101 10101 10101
Hit/Miss M M M M M H M M H M
Miss type CM CM CM CM CF - CF CM - CM
Radix sort
Quick (Instr/key)
800
Radix (Instr/key)
700
600
500
400
300
200 Quick
sort Instructions/key
100
0
1000 10000 100000 1000000 1E+07
Radix sort
Quick (Instr/key)
800
Radix (Instr/key)
700 Quick (Clocks/key)
600 Radix (clocks/key)
Time
500
400
300
Quick
200
sort
100
Instructions
0
1000 10000 100000 1000000 1E+07
3
Cache misses
2
1
Quick
0 sort
1000 10000 100000 1000000 10000000
memory access
patterns
Algorithm behavior Clock cycles/item
Compiler
optimization for
memory access
Cache miss/item
More misses
Lec20.8
Blocked Matrix Multiply
Consider A,B,C to be N-by-N matrices of b-by-b subblocks where
b=n / N is called the block size
for i = 1 to N
for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{read block A(i,k) into fast memory}
{read block B(k,j) into fast memory}
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on
blocks} {write block C(i,j) back to slow memory}
= + * B(k,j)
9
Lec20.9
Blocked Matrix Multiply
m is amount memory traffic between slow and fast memory
matrix has nxn elements, and NxN blocks each of size bxb
Unoptimized Blocked