0% found this document useful (0 votes)
34 views

Advanced Computer Architecture

This document discusses memory hierarchies and caches. It provides an overview of memory hierarchies, including registers, caches, and main memory. Caches work to bridge the gap between fast processor speeds and slower memory. Caches are only effective if programs exhibit temporal and spatial locality, meaning the same data is accessed repeatedly or nearby data is accessed. The document reviews cache concepts like hit rate, replacement strategies, and ensuring data coherence. It also discusses how programmers can improve cache performance through techniques like blocking to exploit data locality.

Uploaded by

core9418
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Advanced Computer Architecture

This document discusses memory hierarchies and caches. It provides an overview of memory hierarchies, including registers, caches, and main memory. Caches work to bridge the gap between fast processor speeds and slower memory. Caches are only effective if programs exhibit temporal and spatial locality, meaning the same data is accessed repeatedly or nearby data is accessed. The document reviews cache concepts like hit rate, replacement strategies, and ensuring data coherence. It also discusses how programmers can improve cache performance through techniques like blocking to exploit data locality.

Uploaded by

core9418
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Advanced Computer

Architecture
Dr. Angela C. Sodan
Lecture 8
Overview
 Memory hierarchy / caches
Memory Hierarchy
 Recall: memory/processor gap is
increasing
 Technological background:
 DRAM – dynamic, to be (automatically)
rewritten, improved performance if BiCMOS
-> the technology for main memory
 SRAM – static, no refreshing, faster, more
expensive
-> the technology for caches
 In addition access speed reduces with size
Memory Hierarchy
 Trade-off between cost/size and speed
(and space on chip)
 Basic solution: memory hierarchy
 Registers
 Caches (multilevel, bridging)
 Main memory
 Other measures:
 Memory interleaving
 (Multiple) register sets
Review Caches
 Which characteristics are important
 Cache is implicit (typically transparent and all
transfer automatic)
Conversely to registers/memories which are
explicitly accessed (load/store)
 In speed and size between registers and
memory
 Works as a “buffer” to memory (load/store)
 Why can this work?
Caches and Locality
 Caches – only – work if program has
temporal and/or spatial locality
 This typically is the the case but
 What does this mean?

 Temporal: the same data is accessed


multiple times in short time intervals
e.g. a loop
 Spatial: spatially close data is accessed
e.g. a vector/array, sequentially accessed
and instructions if not branch…
Caches and Locality (cont.)
 Thus it helps to
 Keep the recently used data in the cache
 To fetch more than one word and even
pipeline memory access
 Modern processors heavily depend on the
cache
 High hit rates (about 90%) required for
good performance
Review Caches (cont.)
 Different approaches
 Associative or direct
 Associative: stores arbitrarily, “searches”,
complex and expensive, works for all kind of
accesses
 Direct: hashes, potential conflicts, can work
poorly with certain access sequences
 Practically applied is hybrid set-associative:
direct access but a few associative slots per
entry – still cheap and works very well,
typical is 4-way
Review Caches (cont.)
 Replacement strategy required for
associative and set-associative
 Least-Recently Used: applies locality idea
 Random: easy and often applied in caches

(Not much difference for larger caches and


penalty for miss not too high)
 Ensuring data coherence:
 Reading is uncritical
 Writes need to be done in memory, too
– Store through: every change immediately in memory
– Write back: change in memory only if replaced,
applied in modern processors
Review Caches (cont.)
 Entity is cache line (multiple words)
 Exploits spatial locality
 Longer cache line: good if much locality
Danger: Useless fetch cost, wasting cache space
 Shorter cache line: good if little locality
 Typical is 32, recently sometimes 64 bytes
 Larger cache size compensates for longer
cache lines, direct caches and simpler
replacement
The Miss Sources
 Conflict misses -> include associativity,
make larger
 Capacity misses -> make larger
 Initial misses -> make cache lines longer
A Cost Metric
 Teff – Effective access time

Teff = h1 t1 + (1-h1) h2 t2 + (1-h1) (1-h2) h3 t3 +…


with hi is hit rate at level i and ti is access time level i

 Useto calculate average access time or to


configure systems according to
requirements
Some Comments About
Machine Configuration
 Ifyou purchase a machine, think about your
requirements (estimates of typical
applications)
E.g. will you have large problems which should fit
into memory or even cache, how fast needs your
disk to be (or may you want multiple disks)?
 This may be a question about how to spend
your available budget…
 More later when we talk about performance
Caches are Implicit – Are They
Really?
 Answer:
 Yes, with respect to semantic correctness –
enough for standard programming
 No, with respect to performance

- for high performance computing


 Proper loop organization
(user or compiler)
Getting Good Cache Performance
 Temporal: Partition into small loops
for (i = 0; l++; l < k)
for (l = 0; i++; i<n)
for (j = 0; j++; j < m)
c [i,l] += a[i,j] * b[j,l]
may re-use row i of a (if fits into cache)
 Spatial: use vectors and sequential
access if possible
Significant speedup in bio-computing app
Cache Locality (cont.)
A simple improvement to at least
exploit spatial locality for second matrix
for (i = 0; i++; i < k) Loop interchange
for (j = 0; j++; j < n)
for (l = 0; l++; l < m)
c [i,l] += a[i,j] * b[j,l]
Note that this code now accesses lines of the
second matrix in the innermost loop!
can increase performance by up to e.g. a factor of
32/4=8 if only 1 row/column of B fits into
memory
Cache Locality (cont.)
 Sequential access on vector outperforms
 Arbitrary access on vector
 Linked data structures
(unless fit completely into cache)
 If multiple related data items
 If arbitrary access, put them in single array
with struct
 If sequential access, it does not matter

struct {double real, imag;} COMPLEX [n] data;


double data_r [n], data_i [n];
Cache Locality (cont.)
 The matrix example revisited: further
possible improvement:
 Try to keep the second matrix in the cache
to exploit temporal locality
 Since the whole matrix most likely is too
large, partition it into submatrices
Cache Locality Advanced
blocking

A11 A12 B11 B12 C11 C12


x =
A21 A22 B21 B22 C21 C22

Calculate submatrices and combine them:


Cij = Ai1*B1j + Ai2*B2j
Cache Locality Advanced (cont.)
 Assume that matrix is n*n,
m is number of blocks per dimension
and that submatrix fits into cache!!!
 We have to load A m times instead of
once, but B only m instead of n times
 Which gives (2m)*n*n elements instead of
(n+1)*n*n elements to load from memory into
cache
E.g. n=1000, m=2 gives
4 Mio vs. 1001 Mio elements
Final Comment
 Again, you do not need to change all your
programming
 But its worth if you write programs which
will run many times and take a lot of
runtime:
imagine your want to test 100 scenarios in a
simulation and each runs 10 hours vs. 1 hour!

You might also like