exercices memory-caches
exercices memory-caches
Contents
1 Exercises week 8 3
1.1 Home Assignment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Memory systems, Cache II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Exercises week 9 6
2.1 Memory systems, Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Exercises week 10 8
3.1 Storage systems, I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Exercises week 11 11
4.1 Multiprocessors I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Exercises week 12 14
5.1 Home Assignment 2 - online quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Multiprocessors II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Exercises week 13 16
6.1 Old exam 2003-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7 Exercises week 14 19
7.1 Questions and answers session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8 Brief answers 20
8.1 Memory systems, Cache II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.2 Memory systems, Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.3 Storage systems, I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.4 Multiprocessors I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.5 Multiprocessors II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.6 Old exam 2003-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1
2
1 Exercises week 8
1.1 Home Assignment 1
OPTIONAL!
However - An approved home assignment will give you 2 extra points on the
exam.
Select one item from the list below, and describe it as detailed as possible. You should
characterize it along all relevant subjects (like ISA, pipeline type, stages, registers, register
renaming, type of instruction issue, type of instruction commit, type of scheduling, branch
prediction, superscalar, VLIW, caches, cache optimizations, bandwidth, size, no of transistors,
etc) covered by this course. Your report have to include all references used to find the information
included.
• AMD Barcelona
• Intel Core 2
• Intel Core i7
• Intel/HP Itanium 2
• IBM Power6
• IBM Power7
• ARMv7
• ARM Cortex-A8
Home assignments are individual. You are not allowed to copy from each other.
You should aim at producing 2-5 A4 pages text (more if many figures are used) including
references.
The text can be either in English or Swedish.
You are allowed to cite 1-3 sentences with explicit reference to the source. Larger pieces
of text copied from any source will be detected by the Urkund system to which your home
assignments will be submitted, and are of course not allowed. Any detected plagiarism will
automatically generate a fail on the assignment.
The home assignment must at least contain:
• Your name
All the sources you have used must be listed in a ’References’ section at the end of your home
assignment. You can use any number of sources to find the information you need to complete
the assignment, for example:
3
• E-huset physical library (http://www.ehuset.lth.se/english/library/)
• Direct mapping
• Set-associative mapping
Exercise 1.2 Draw a schematics of how the following cache memory can be implemented:
Total size 16kB, 4-way set-associative. Blocksize 16 byte, replacement algorithm LRU, uses
copy-back.
The schematics should among others show central MUXes, comparators and connections. It
should clearly indicate how a physical memory address is translated to a cache position.
The greater detail shown the better.
Exercise 1.3 What is a write-through cache? Is it faster/slower than a write-back cache with
respect to the time it takes for writing.
4
Exercise 1.5 Hennessy/Patterson, Computer Architecture, 4th ed., exercise 5.5
Exercise 1.6 Three ways with hardware and/or software to decrease the time a program spends
on (data) memory accesses are:
a) miss rate
Exercise 1.7 In systems with a write-through L1 cache backed by a write-back L2 cache instead
of main memory, a merging write buffer can be simplified.
b) Are there situations where having a full write buffer (instead of the simple version you
have just proposed) could be helpful?
Exercise 1.9 Explain where replacement policy fits into the three C’s model, and explain why
this means that misses caused by a replacement policy are “ignored”- or more precisely cannot
in general be definitively classified - by the three C’s model.
5
2 Exercises week 9
2.1 Memory systems, Virtual Memory
Exercise 2.1 As caches increase in size, blocks often increase in size as well.
a) If a large instruction cache has larger blocks, is there still a need for pre-fetching? Explain
the interaction between pre-fetching and increased block size in instruction caches.
b) Is there a need for data pre-fetch instructions when data blocks get larger?
Exercise 2.2 Some memory systems handle TLB misses in software (as an exception), while
others use hardware for TLB misses.
a) What are the trade-offs between these methods for handling TLB misses?
b) Will TLB miss handling in software always be slower than TLB misses in hardware?
Explain!
c) Are there page table structures that would be difficult to handle in hardware, but possible
in software? Are there any such structures that would be difficult for software to handle
but easy for hardware to manage?
Exercise 2.3 The difficulty of building a memory system to keep pace with faster CPUs is
underscored by the fact that the raw material for main memory is the same as that found in
the cheapest computer. The performance difference is rather based on the arrangement.
a) List the four measures in answer to questions on block placement, identification, replace-
ment and writing strategy that rule the hierarchical construction for virtual memory.
b) What is the main purpose of the Translation-Lookaside Buffer within the memory hierar-
chy? Give an appropriate set of construction rules and explain why.
Exercise 2.4 Designing caches for out-of-order (OOO) superscalar CPUs is difficult for several
reasons. Clearly, the cache will need to be non-blocking and may need to cope with several
outstanding misses. However, the access pattern for OOO superscalar processors differs from
that generated by in-order execution.
What are the differences, and how might they affect cache design for OOO processors?
Exercise 2.5 Consider the following three hypothetical, but not atypical, processors, which we
run with the SPEC gcc benchmark
1. A simple MIPS two-issue static pipe running at a clock rate of 4 GHz and achieving
a pipeline CPI of 0.8. This processor has a cache system that yields 0.005 misses per
instruction.
6
2. A deeply pipelined version of a two-issue MIPS processor with slightly smaller caches and
a 5 GHz clock rate. The pipeline CPI of the processor is 1.0, and the smaller caches yield
0.0055 misses per instruction on average.
3. A speculative MIPS, superscalar with a 64-entry window. It achieves one-half of the ideal
issue rate measured for this window size (9 instruction issues per cycle). This processor
has the smallest caches, which leads to 0.01 misses per instruction, but hides 25scheduling.
This processor has a 2.5 GHz clock.
Assume that the main memory time (which sets the miss penalty) is 50 ns. Determine the
relative performance of these three processors.
Exercise 2.6 a) Give three arguments for larger pages in virtual memory, and one against.
b) Describe the concepts ’page’, ’page fault’, virtual address’, ’physical address’, ’TLB’, and
’memory mapping’ and how how they are related.
c) How much memory does the page table, indexed by the virtual page number, take for a
system using 32 bit virtual addresses, 4 KB pages, and 4 bytes per page table entry. The
system has 512 MB of physical memory.
d) In order to save memory sometimes inverted page tables are used. Briefly describe how
they are structured. How much memory would inverted page tables take for the above
system.
Exercise 2.7 A) Describe two cache memory optimization techniques that may improve hit
performance (latency and throughput). For each technique, specify how it affects hit time
and fetch bandwidth.
B) Describe two cache memory optimization techniques that may reduce miss rate, and define
the miss type (compulsory, capacity, conflict) that is primarily affected by each technique.
C) Describe two cache memory optimization techniques that may reduce miss penalty.
7
3 Exercises week 10
3.1 Storage systems, I/O
For Hennessy/Patterson exercises 6.8 - 6.14.
8
For Hennessy/Patterson exercises 6.19 - 6.22.
9
Exercise 3.7 Hennessy/Patterson, Computer Architecture, 4th ed., exercise 6.22
10
4 Exercises week 11
4.1 Multiprocessors I
Exercise 4.1 There are two main varieties (classes) of hardware-based cache coherence proto-
cols. Which are they and what are the main differences, strengths and weaknesses?
Exercise 4.2 Briefly describe MIMD and SIMD computers outlining the differences. Give
examples of computers (or type of computers) from each class.
Exercise 4.3 Assume a directory-based cache coherence protocol. The directory currently has
information that indicates that processor P1 has the data in “exclusive” mode.
If the directory now gets a request for the same cache block from processor P1, what could
this mean? What should the directory controller do?
Exercise 4.4 Although it is widely believed that buses are the ideal way to interconnect small-
scale multiprocessors, this may not always be the case. For example, increases in processor
performance are lowering the processor count at which a more distributed implementation be-
comes attractive. Because a standard bus-based implementation uses the bus both for access to
memory and for inter-processor coherency traffic, it has a uniform memory access time for both.
In comparison, a distributed memory implementation may sacrifice on remote memory access,
but it can have a much better local memory access time.
Consider the design of a multi-processor with 16 processors. Each CPU is driven by a 150
MHz clock. Assume that a memory access takes 150 ns from the time the address is available
from either the local processor or a remote processor until the first word is delivered. The bus
is driven by a 50 MHz clock. Each bus transaction takes five bus clock cycles, each 20 ns in
length, to perform arbitration, resolution, address, decode and acknowledge.
The detection of the miss and the generation of the memory request by the processor consists
of three steps: detecting a miss in the primary on-chip cache; initiating a secondary (off-chip)
cache access and detecting a miss in the secondary cache; and driving the complete address
off-chip through the bus. This process takes about 40 processor clock cycles.
For the bus and memory component, the initial read request is one bus transaction of 5 bus
cycles. The latency until memory is ready to transfer is 12 bus clock cycles. The reply will then
transfer all 128 bytes of a cache block in one reply transaction, taking 5 bus clock cycles. The
total is 22 bus clock cycles, which equals 66 processor clocks. It takes 16 bus cycles to reload
the cache line, while restarting the pipeline takes 10 processor cycles. The total is 58 cycles
b) Assume that the interconnect is a 2-D grid with links that are 16 bits wide and clocked
at 100 MHz, with a start-up time of five cycles for a message. Assume one clock cycle
between nodes in the network, and ignore overhead in the messages and contention (i.e.
assume that the network bandwidth is not the limit). Find the average remote memory
access time, assuming a uniform distribution of remote requests.
Exercise 4.5 Nearly all computer manufacturers offer today multi-core microprocessors. This
assignment focuses on concepts central to how thread-level parallelism can be exploited to offer
a higher computational performance.
11
a) The performance of a superscalar processor is limited by the amount of instruction-level
parallelism in the program. In particular, when a load instruction must fetch data from
memory, it can be difficult to find a sufficient number of independent instructions to
execute while the data is being fetched from memory. Multithreading is a technique to
do useful work while waiting for the data to be returned from memory. Explain how the
following concepts can keep the processor busy doing useful work:
– Fine-grain multithreading
– Coarse-grain multi-threading
– Simultaneous multithreading
c) Flynn classifies computer architectures that leverage thread-level parallelism into four
categories. Which ones?
– How is the lock primitive in a critical section implemented using test-and-set instruc-
tions?
b) What is cache coherence? Give an example of what can happen if cache coherence is
missing.
c) A commonly used cache coherence protocol relies on snooping and invalidations. Below you
find a list of requests that arrive to the cache coherence mechanism. Connect all requests,
A-N, with the correct cache action and explanation, 1-14. Hint: each request matches
exactly one action/explanation. Your answer should be a table listing all connections like
A-3, B-2, C-8, etc...
12
Request Source State of addressed cache block
Read hit Processor shared or modified A
Read miss invalid B
Read miss shared C
Read miss modified D
Write hit modified E
Write hit shared F
Write miss invalid G
Write miss shared H
Write miss modified I
Read miss Bus shared J
Read miss modified K
Invalidate shared L
Write miss shared M
Write miss modified N
Type of cache action Function and explanation
1 normal hit Write data in cache.
2 coherence Place invalidate on bus.
3 coherence Attempt to write block that is shared; invalidate the cache
block
4 normal hit Read data in cache.
5 replacement Address conflict miss: write back block, then place write
miss on bus.
6 normal miss Place read miss on bus.
7 replacement Address conflict miss: write back block, then place read miss
on bus.
8 normal miss Place write miss on bus.
9 coherence Attempt to write shared block; invalidate the block.
10 coherence Attempt to write block that is exclusive elsewhere: write
back the cache block and make its state invalid
11 coherence Attempt to share data: place cache block on bus and change
state to shared.
12 replacement Address conflict miss: place write miss on bus.
13 replacement Address conflict miss: place read miss on bus.
14 no action Allow memory to service read miss.
13
5 Exercises week 12
OPTIONAL!
However - An approved quiz will give you 2 extra points on the exam.
Take the quiz available for Computer Architecture EIT090, at http://courses.eit.lth.se/
It will be open during week 12 and 13 (weeks 5 and 6 in HT2 - 2009-11-23 – 12-06). You have
to log in to be able to see it.
Every student has a username and password based on your official mailaddress. Example
for [email protected] it will be:
username: et01xy9
password: ePWt01xy9
If you have a problem contact the course coordinator, Anders Ardö. You can take the quiz
any number of times during the time mentioned above.
When you have logged in, choose ’Computer Architecture EIT090’ and click on the quiz.
Then you can start answering questions. After all questions are answered you can send in
your answers by clicking on ’Submit all and finish’. You will get a feedback saying how many
correct answers you have. Both questions and numeric values in the quiz are selected randomly
each time you try the quiz. Redo the test until you have at least 90 % correct in order to be
approved.
5.2 Multiprocessors II
Exercise 5.1 Assume that we have a function for an application of the form F (i, p) which gives
the fraction of time that exactly i processors are usable given that a total of p processors are
available. This means that
p
X
F (i, p) = 1
i=1
Assume that when i processors are in use, the application runs i times faster. Rewrite
Amdahl’s Law so that it gives the speedup as a function of p for some application.
Exercise 5.2 One proposed solution for the problem of false sharing is to add a valid bit per
word (or even for each byte). This would allow the protocol to invalidate a word without
removing the entire block, letting a cache keep a portion of a block in its cache while another
processor writes a different portion of the block. what extra complications are introduced into
the basic snooping cache coherence protocol (see figure below) if this capability is included?
remember to consider all possible protocol actions.
14
Exercise 5.3 Some systems do not use multiprocessing for performance. Instead they run the
same program in lockstep on multiple processors. What potential benefit is possible on such
multiprocessors.
Exercise 5.4 When trying to perform detailed performance evaluation of a multiprocessor sys-
tem, system designers use one of three tools: analytical models, trace-driven simulation, and
execution-driven simulation. Analytical models use mathematical expressions to model the be-
havior of programs. Trace driven simulations run the applications on a real machine and generate
a trace, typically of memory operations. These traces can then be replayed through a cache sim-
ulator or a simulator with a simple processor model to predict the performance of the system
when various parameters are changed. Execution driven simulators simulate the entire execution
including maintaining an equivalent structure for the processor state, and so on. Discuss the
accuracy/speed trade-offs between these approaches.
15
6 Exercises week 13
6.1 Old exam 2003-12-17
Exercise 6.1
Computer Architecture, EIT
090
Final Exam Department of Information Technology
17 December 2003 8 – 13
The exam consists of a number of problems with a total of 50 points.
Grading 20 p ≤ grade 3 < 30 p ≤ grade 4 < 40 p ≤ grade 5
Instructions:
• You may use a pocket calculator and an English dictionary on this exam, but no other
aids
• Please start answering each problem on a new sheet – New problem =⇒ New sheet
• Write your name on each sheet of paper that you hand in – Name on each sheet
• You must motivate your answers thoroughly. If there, in your opinion, is not enough
information to solve a problem, you can make reasonable assumptions that you need in
order to solve the problem. State these assumptions clearly!
Briefly (1-2 sentences) describe the following items/concepts concerning computer architecture:
(10)
a) dominance
b) basic block
e) register renaming
f) data dependency
16
g) way prediction
h) unified cache
i) sequential consistency
Problem 2
a) Describe the concept “memory hierarchy”, and state why it is important. State the func-
tion of each part, normally used hardware components, and what problems they solve (if
any). (5)
Problem 3
(4)
Example:
IN OUT
- Tunpipelined -
clock
IN ? ? OUT
- Tpipestage1 - T - Tpipestage2 - T Tpipestage3
latch latch
- -
– noofstages: number of pipe stages (assume equal pipe stage execution time).
– branch freq (bf): relative frequency of branches in the program.
– branch penalty (bp): number of clock cycles lost due to a branch.
(3)
c) Give an example of a piece of assembly code that contains WAW, RAW and WAR hazards
and identify them. (Use for example the assembly instruction
ADDD Rx,Ry,Rz
which stores (Ry+Rz) in Rx) (3)
17
Problem 4
A: a Celeron 2.4 GHz processor, 128 KByte cache, 128 byte blocks, copy-back (write-back)
with an average of 30 % dirty blocks, price 650 SEK
B: a P4 2.4 GHz processor, 512 KByte cache, 128 byte blocks, copy-back (write-back) with
an average of 35 % dirty blocks, price 1495 SEK
C: a P4 3.0 GHz processor, 512 KByte cache, 128 byte blocks, copy-back (write-back) with
an average of 35 % dirty blocks, price 2595 SEK
The main application is program development, so the compiler gcc is considered being the most
used program and is therefore used as the performance indicator. Assume that the processors
have the same architecture, and that the base CPI (for gcc) without effects from the above
mentioned cache (but including other caches and TLB) is 1.1
Some statistics for gcc:
Cache size Miss rate
512 KB 0.0075
256 KB 0.0116
128 KB 0.0321
64 KB 0.09
Instruction frequencies
load store uncond branch cond branch int fp
25.8 % 13.4 % 4.8 % 15.5 % 40.5 % 0 %
Main memory take 50 ns to set up and each transfer of 128 bits from main memory to the
cache takes 4 ns. Assume that the memory system can handle memory at these speeds and
widths.
Which of the three computers (A, B, C) have the best price/performance ratio? Motivate
your answer thoroughly. (10)
Problem 5
a) There are two main varieties (classes) of hardware-based cache coherence protocols. Which
are they and what are the main differences, strengths and weaknesses. (4)
b) Briefly describe MIMD and SIMD computers, outlining the differences. Give examples of
computers (or type of computers) from each class. (4)
c) Use Amdahl’s law to give a quantitative argument for keeping a computer system balanced
in terms of relative performance (for example processor speed versus I/O speed) as tech-
nological and methodological development improves various sub-systems of a computer.
(2)
18
7 Exercises week 14
7.1 Questions and answers session
19
8 Brief answers
8.1 Memory systems, Cache II
1.1 When a cache miss occurs, the controller must select a block to be replaced with the desired
data. Three primary strategies for doing this are: Random, Least-recently used (LRU), and
First-in first-out (FIFO).
A replacement algorithm is needed with set-associative and fully associative caches. For
direct mapped caches there is no choice, the block to replaced is uniquely determined by the
address.
Block TAG d Data Block TAG d Data Block TAG d Data Block TAG d Data
index Store i Store index Store i Store index Store i Store index Store i Store
r r r r
t t t t
y y y y
4 to 1 MUX
hit/miss
1.3 A cache write is called write-through when information is passed both to the block in the
cache and to the block in the lower-level memory; when information is only written to the block,
it is called write-back. Write-back is the fastest of the two as it occurs at the speed of the cache
memory, while multiple writes within a block require only one write to the lower-level memory.
– non-blocking: don’t affect miss rate, the main thing happening is that the processor
makes other useful things while the miss is handled
– hardware prefetching with stream buffers: a hit in the stream buffer cancels the cache
request, ie the memory reference is not counted as a miss, which means that the miss
rate will decrease.
– software prefetching: if correctly done miss rate will decrease
b) memory bandwidth:
– non-blocking: since the processor will have fewer stall cycles it will get a lower CPI
and consequently the requirements on memory bandwidth will increase
20
– hardware prefetch and software prefetch: prefetching is a form of speculation which
means that some of the memory traffic is unused which in turn might increase the
need for memory bandwidth
c) number of executed instructions: will be unchanged for non-blocking and hardware prefetch.
Software prefetch will add the prefetch–instructions, so number of executed instructions
will increase.
1.7 a) The merging write buffer links the CPU to the write-back L2 cache. Two CPU writes
cannot merge if they are to different sets in L2. So, for each new entry into the buffer a
quick check on only those address bits that determine the L2 set number need be performed
at first. If there is no match in this ‘screening’ test, then the new entry is not merged. If
there is a set number match, then all address bits can be checked for a definitive result.
b) As the associativity of L2 increases, the rate of false positive matches from the simplified
check will increase, reducing performance.
1.8 The three C’s model sorts the causes for cache misses into three categories:
• Compulsory – The very first access can never be in cache and is therefore bound to generate
a miss;
• Capacity – If the cache cannot contain all the blocks needed for a program, capacity misses
may occur;
1.9 The three C’s give insight into the cause of misses, but this simple model has its limits; it
gives you insight into average behavior but may not explain an individual miss. For example,
changing cache size changes conflict misses as well as capacity misses, since a larger cache spreads
out references to more blocks. Thus, a miss might move from a capacity miss to a conflict miss
as cache size changes. Note that the three C’s also ignore replacement policy, since it is difficult
to model and since, in general, it is less significant. In specific circumstances the replacement
policy can actually lead to anomalous behavior, such as poorer miss rates for larger associativity,
which is contradictory to the three C’s model.
b) Data structures often comprise lengthy sequences of memory addresses. Program access
of a data structure often takes the form of a sequential sweep. Large data blocks work well
21
with such access patterns; pre- fetching is likely still of value due to the highly sequential
access patterns. The efficiency of data pre-fetch can be enhanced through a suitable
grouping of the data items taking the block limitations into account. This is especially
noteworthy when the data-structure exceeds the cache size. Under such circumstances, it
will become of critical importance to limit the amount of out-of-cache block references.
2.2 a) We can expect software to be slower due to the overhead of a context switch to the
handler code, but the sophistication of the replacement algorithm can be higher for soft-
ware and a wider variety of virtual memory organizations can be readily accommodated.
Hardware should be faster, but less flexible.
b) Factors other than whether miss handling is done in software or hardware can quickly
dominate handling time. Is the page table itself paged? Can software implement a more
efficient page table search algorithm than hardware? What about hardware TLB entry
pre-fetching?
c) Page table structures that change dynamically would be difficult to handle in hardware
but possible in software.
2.3 a) – As miss penalty tends to be severe, one usually decides for a complex placement
strategy. Usually one takes for full association.
– To reduce address translation time, a cache is added to remember the most likely
translations, the Translation Lookaside Buffer.
– Almost all operating systems rely on a replacement of the least-recently used (LRU)
block indicated by a reference bit, which is logically set whenever a page is addressed.
– Since the cost of an unnecessary access to the next-lower level is high, one usually
includes a dirty bit. It allows blocks to be written to lower memory only if they have
been altered since reading.
b) The main purpose of the TLB is to accelerate the address translation for reading/writing
virtual memory. A TLB entry holds a portion of the virtual address, a physical page
frame number, a protection field, a valid bit, a use bit and a dirty bit. The latter two
is not always used. The size of the page table is inversely proportional to the page size;
choosing a large page size allows larger caches with fast cache hit times with a small TLB.
A small page size conserves storage, limiting the amount of internal fragmentation. Their
combined effect can be seen in process start-up time, where a large page size lengthens
invocation time but shortens page renewal times. Hence, the balance goes for large pages
in large computers and vice-versa.
TLB 1st-level cache 2nd-level cache Virtual memory
Block size (in bytes) 4-32 16-256 1-4k 4096-65,536
c) Block placement Full associative 2/4-way set 8/16-way set Direct mapped
associative associative
Overall size 32-8,192b 1 MB 2-16MB 32 MB – 1 TB
2.4 Out-of-order (OOO) execution will change both the timing of and sequence of cache access
with respect to that of in-order execution. Some specific differences and their effect on what
cache design is most desirable are explored in the following.
Because OOO reduces data hazard stalls, the pace of cache access, both to instructions and
data, will be higher than if execution were in order. Thus, the pipeline demand for available
22
cache bandwidth is higher with OOO. This affects cache design in areas such as block size, write
policy, and pre-fetching.
Block size has a strong effect on the delivered bandwidth between the cache and the next
lower level in the memory hierarchy. A write-through write policy requires more bandwidth
to the next lower memory level than does write back, generally, and use of a dirty bit further
reduces the bandwidth demand of a write-back policy. Pre-fetching increases the bandwidth
demand. Each of these cache design parameters – block size, write policy, and pre-fetching –
is in competition with the pipeline for cache bandwidth, and OOO increases the competition.
Cache design should adapt for this shift in bandwidth demand toward the pipeline.
Cache accesses for data and, because of exception, instructions occur during execution. OOO
execution will change the sequence of these accesses and may also change their pacing.
A change in sequence will interact with the cache replacement policy. Thus, a particular
cache and replacement policy that performs well on a chosen application when execution of the
superscalar pipeline is in order may perform differently – even quite differently – when execution
is OOO.
If there are multiple functional units for memory access, then OOO execution may allows
bunching multiple accesses into the same clock cycle. Thus, the instantaneous or peak memory
access bandwidth from the execution portion of the superscalar can be higher with OOO.
Imprecise exceptions are another cause of change in the sequence of memory accesses from
that of in-order execution. With OOO some instructions from earlier in the program order may
not have made their memory accesses, if any, at the time of the exception. Such accesses may
become interleaved with instruction and data accesses of the exception-handling code. This
increases the opportunity for capacity and conflict misses. So a cache design with size and/or
associativity to deliver lower numbers of capacity and conflict misses may be needed to meet
the demands of OOO.
2.5 First, we use the miss penalty and miss rate information to compute the contribution to
CPI from cache misses for each configuration. We do this with the formula:
23
We know the pipeline CPI contribution for everything but processor 3; its pipeline CPI is given
by
1
P ipeline CP I = 1/Issue rate = = 1/4.5 = 0.22
9 ∗ 0.5
Now we find the CPI for each processor by adding the pipeline and cache CPI contributions:
Since this is the same architecture, we can compare instruction execution rates in millions of
instructions per second (MIPS) to determine relative performance CR / CPI as
4000 M Hz
1: = 2222 M IP S
1.8
5000 M Hz
2: = 2083 M IP S
2.4
2500 M Hz
3: = 2155 M IP S
1.16
In this example, the simple two-issue static superscalar looks best. In practice, performance
depends on both the CPI and clock rate assumption.
2.6 a) For:
Against: Larger pages lead more wasted storage due to internal fragmentation.
b) In a virtual memory system: Virtual address is a logical address space for a process. This
is translated by a combination of hardware and software into a physical address which
access main memory. This process is called memory mapping. The virtual address space
is divided into pages (blocks of memory). Page fault: an access to a page which is not in
physical memory. TLB, Translation Lookaside Buffer is a cache of address translations.
232
c) Page table takes 212
∗ 4 = 222 = 4 Mbyte
d) An inverted page table is like a fully associative cache where each page table entry contains
29
the physical address and, as tag, the virtual address. It takes 2212 ∗ (4 + 4) = 220 = 1 Mbyte
24
8.3 Storage systems, I/O
3.1 See ’Case Study Solutions’ at http://www.elsevierdirect.com/companion.jsp?ISBN=9780123704900
8.4 Multiprocessors I
4.1 • Snooping
– Status for a block is stored in every cache that has a copy of the block.
– Sends all requests to all processors (broadcast).
– Caches monitor (snoop) the shared memory bus to update status and take actions.
– Popular with single shared memory.
• Directory based
4.3 The problem illustrates the complexity of cache coherence protocols. In this case, this
could mean that the processor P1 evicted that cache block from its cache and immediately
requested the block in subsequent instructions. Given that the write-back message is longer
than the request message, with networks that allow out-of-order requests, the new request can
arrive before the write-back arrives at the directory. One solution to this problem would be
to have the directory wait for the write-back and then respond to the request. Alternatively,
the directory can send out a negative acknowledge (NACK). Note that these solutions need to
be thought out very carefully since they have the potential to lead to deadlocks based on the
particular implementation details of the system. Formal methods are often used to check for
races and deadlocks.
25
4.4 The question is to consider a design that is based on a mesh interconnect rather than on a
bus. The idea behind such a design is that local accesses will be faster than a pure shared-memory
approach since access to local memory does not need to go across a shared bus. Additionally,
the cost of a remote access will be a function of start-up time and number of hops across the
network rather than the time to acquire the bus.
a) The cost of local references is easy to compute. Local references require 40 clocks to detect
L2 miss, 66 clock to deliver the data, and 58 clocks to reload caches, yielding a total of
164 clocks.
b) For this part of the problem, we are asked to factor in the cost of references that need to
travel across the mesh network and to compute the average remote memory access time
(ARMAT). Since network clocks and processor clocks take different amounts of time, we
refer to network clocks as ‘nclk’ and to processor clocks as ‘pclk’. From the above we
already know that the cost on a shared-memory design is 164 pclks.
For the case of the distributed memory, we will make the simple assumption that each
remote reference must make on average 1.5 hops in the X-direction and 1.5 hops in the
Y-direction to get to its target node, because the result depends on how one measures
the average number of hops that a reference must make in the network. Hence the total
average distance is 3 hops.
For the complete request, the following times must be added together: the time for the
L2 miss to be recognized locally, the time for the address request to go across the network
(assume only 32 bits are needed for this message), the time for the remote memory to
respond (150 ns), the time for the data to return over the network (a 128-byte cache line),
and the time for the caches to be reloaded. The time across the network is based on the
number of hops and the size of the message. We are given that 2 bytes can be sent every
nclk, and so the time through a switch is
Time To Send = (Number of Bytes ) / 2 bytes per nclk.
4.6 a) To realize SMT, we need to have a per-thread renaming table, having separate PC
registers, and provide the capability for instructions from multiple threads to commit.
b) Informally, cache coherence means that a value read from the memory systems should
reflect the latest write to that same memory location. For an example of what happens
when cache coherence is missing, refer to the book, Figure 4.3 (page 206).
26
Request Action
A 4
B 6
C 13
D 7
E 1
F 2
c) The correct connections: G 8
H 12
I 5
J 14
K 11
L 9 or 3
M 3 or 9
N 10
Note: 3 and 9 is equivalent
8.5 Multiprocessors II
5.1 The general form for Amdahl’s Law (as shown on the inside front cover of this text) is
all that needs to be done to compute the formula for speedup in this multiprocessor case is to
derive the new execution time. The exercise states that for the portion of the original execution
time that can use i processors is given by F (i, p). If we let Execution timeold be 1, then the
relative time for the application on p processors is given by summing the times required for each
portion of the execution time that can be sped up using i processors, where i is between 1 and
p. This yields
p
X F (i, p)
Execution timenew =
i=1
i
Substituting this value for Execution timenew into the speedup equation makes Amdahl’s
Law a function of the available processors, p.
5.2 An obvious complication introduced by providing a valid bit per word is the need to match
not only the tag of the block but also the offset within the block when snooping the bus. This
is easy, involving just looking at a few more bits. In addi- tion, however, the cache must be
changed to support write-back of partial cache blocks. When writing back a block, only those
words that are valid should be written to memory because the contents of invalid words are not
necessarily coherent with the system. Finally, given that the state machine of Figure 6.12 is
applied at each cache block, there must be a way to allow this diagram to apply when state can
be different from word to word within a block. The easiest way to do this would be to provide
the state information of the figure for each word in the block. Doing so would require much
more than one valid bit per word, though. Without replication of state information the only
solution is to change the coherence protocol slightly.
5.3 Executing the identical program on more than one processor improves system ability to
tolerate faults. The multiple processors can compare results and identify a faulty unit by its
mismatching results. Overall system availability is increased.
27
5.4 Analytical models can be used to derive high-level insight on the behavior of the system in a
very short time. Typically, the biggest challenge is in determining the values of the parameters.
In addition, while the results from an analytical model can give a good approximation of the
relative trends to expect, there may be significant errors in the absolute predictions.
Trace-driven simulations typically have better accuracy than analytical models, but need
greater time to produce results. The advantages are that this approach can be fairly accurate
when focusing on specific components of the system (e.g., cache system, memory system, etc.).
However, this method does not model the impact of aggressive processors (mispredicted path)
and may not model the actual order of accesses with reordering. Traces can also be very large,
often tak- ing gigabytes of storage, and determining sufficient trace length for trustworthy results
is important. It is also hard to generate representative traces from one class of machines that will
be valid for all the classes of simulated machines. It is also harder to model synchronization on
these systems without abstracting the synchronization in the traces to their high-level primitives.
Execution-driven simulation models all the system components in detail and is consequently
the most accurate of the three approaches. However, its speed of simulation is much slower than
that of the other models. In some cases, the extra detail may not be necessary for the particular
design parameter of interest.
Problem 1
b) Straight line code sequence with no branches in except at entry and no branches out except
at the exit.
c) A page table that uses hashing techniques to reduce the size of the page table so that the
length is equal to the number of physical pages in memory.
e) A set of physical registers holds both architecturally visible register as well as temporary
data. During instruction issue architectural registers are mapped to physical registers.
Register renaming is used to get rid of WAR and WAW hazards.
g) An attempt to predict which block the next cache access will go to. It allows you to early
set up the multiplexor that selects cache block.
28
i) Sequential consistency requires that the result of any execution be the same as if the
memory accesses executed by each processor were kept in program order.
j) GPR have only explicit operands, either memory locations or registers, as opposed to
implicit operands like stack top or accumulator.
Problem 2
a) In real life bigger memory is slower and faster memory is more expensive. We want to
simultaneously increase the speed and decrease the cost. Speed is important because the
widening performance gap between CPU and memory. Size is important since applica-
tions and data sets are growing bigger. Use several types of memory with varying speeds
arranged in a hierarchy that is optimized with respect to the use of memory. Mapping
functions provide address translations between levels.
Problem 3
T
a) Speedup = unpipelined
max(T
pipestage )+Tlatch
noof stages
b) Speedup = 1+bf ∗bp
29
c) 1: MOV R3, R7
2: LD R8,(R3)
3: ADDDI R3, R3, 4
4: LD R9, (R3)
5: BNE R8, R9, Loop
WAW: 1,3 RAW: 1,2; 1,3; 2,5; 3,4; 4,5 WAR: 2,3;
Problem 4
Problem 5
a) – Snooping
∗ Status for a block is stored in every cache that has a copy of the block.
∗ Send all requests to all processors (broadcast)
∗ Caches monitor (snoop) the shared memory bus to update status and take ac-
tions.
∗ Popular with single shared memory.
– Directory based
∗ Status for a block is stored in one location (the directory).
∗ Messages used to update status.
∗ Scales better than snooping
∗ Popular with distributed shared memory.
30
b) SIMD = (Single Instruction stream, Multiple Data stream)
vector processors
MIMD = (Multiple Instruction stream, Multiple Data stream)
multiprocessors - Symmetric shared memory Multiprocessors (SMP) with Uniform Mem-
ory Access time (UMA) and bus interconnect.
31