M116C 1 EE116C-Midterm2-w15 Solution
M116C 1 EE116C-Midterm2-w15 Solution
M116C
Computer Systems Architecture
Prof. Lei He
Solution
Problem 1. (10 points) Explain terms or answer short problems. For example,
Program counter (PC) is the register containing the address of the instruction in the
program being executed. (Hint: if you do not know how to explain the term
precisely, you may use examples to explain).
(1) Explain the concept of delayed load AND give an example piece of code
The loaded data is available one clock cycle after the load instruction. For
example in the code:
lw
add
$1,
$3,
0($2)
$1,
$4
The loaded data $1 from the first instruction is not available right after the
instruction. For the add instruction, we need to add nop before it or stall one
clock cycle to wait for $1 to be available.
(2) Explain the concept of loop unrolling and why we perform loop unrolling
Unroll the loop n times and rename the registers to make a big loop. Loop
unrolling leads to more instructional level parallelism and therefore improve
performance.
(3) Name three techniques (in either software or hardware) to resolve branch
cycle implementation
A piece of hardware can be shared if it is used by a same instruction in
different clock cyles. For instance, PC increment and R-type execution both
make use of the same ALU, but they do so on different cycles.
(5) A single-cycle implementation may be divided into five stages for
pipelining. Compare the average CPI between single-cycle and ideal pipelining
implementations and explain why pipelining may improve performance.
The average CPI of both single-cycle and ideal pipelining implementations is 1.
But the critical path length of the single-cycle implementation is much larger
(usually N times larger, where N is the number of pipeline stages, N=5 in the
problem) than the ideal pipeline. Therefore, pipelining can improve
performance.
Problem 2. (10 points) In this exercise, we examine how data dependences affect execution in the
basic 5-stage pipeline described in textbook. Problems in this exercise refer to the following sequence
of instructions:
lw $5, -16($5)
sw $5, -16($5)
add $5, $5, $5
Also, assume the following cycle times for each of the options related to forwarding:
Without Forwarding
220ps
240ps
230ps
I1: lw $5,-16($5)
I2: sw $5,-16($5)
I3: add $5,$5,$5
2) Assume there is no forwarding in this pipelined processor. Indicate hazards and add NOP
3) Assume there is full forwarding. Indicate hazards and add NOP instructions to eliminate them.
With full forwarding, an ALU instruction can forward a value to EX stage of the next instruction
without a hazard. However, a load cannot forward to the EX stage of the next instruction (by can to
the instruction after that). The code that eliminates these hazards by inserting nop instructions is:
lw $5,-16($5)
nop
Delay I2 to avoid RAW hazard on $5 from I1
sw $5,-16($5)
Value for $5 is forwarded from I2 now
add $5,$5,$5
Note: no RAW hazard from on $5 from I1 now
4) What is the total execution time of this instruction sequence WITHOUT forwarding and WITH full
forwarding? What is the speedup achieved by adding full forwarding to a pipeline that had no
forwarding?
The total execution time is the clock cycle time times the number of cycles. Without any stalls, a
three-instruction sequence executes in 7 cycles (5 to complete the first instruction, then one per
instruction). The execution without forwarding must add a stall for every nop we had in 4.13.2, and
execution forwarding must add a stall cycle for every nop we had in 4.13.3. Overall, we get:
Without Forwarding
(7 + 2) 220ps = 1980ps
(7 + 1) 240ps = 1920ps
1.03
5) Add NOP instructions to this code to eliminate hazards if there is ALU-ALU forwarding only
6) What is the total execution time of this instruction sequence with only ALU-ALU forwarding?
With ALU-ALU
forwarding only
(7 + 2) 230ps = 2070ps
Problem 3.
(10
points):
Assume that we have a five-stage machine same as the one in textbook. For
the following code,
sub $2, $5, $4
add $4, $2, $5 lw
$2, 100($4)
add $5, $2, $4
(a)
(b)
(c)
(d)
$2 depends on (a)
$4 depends on (b)
$2 depends on (c), $4 depends on (b)
(1)
(2)
Which data hazards can be resolved by renaming? Write down the code after renaming and
with minimal data hazards
Write
after
write
can
be
resolved
by
renaming.
After
renaming,
the
code
will
look
like
the
following,
$2
in
(c)
and
(d)
is
renamed
$6:
(a) sub
(b) add
(c) lw
(d) add
(3)
IM
$2,
$4,
$6,
$5,
$5,
$4
$2,
$5
100($4)
$6,
$4
After renaming, which data hazard can be resolved via forwarding? Illustrate all the
forwarding using 5-stage pipelining figures similar to those in the textbook.
Reg
Re
$2
IM
Reg
Re
$4
IM
Reg
Re
$6
IM
Reg
No data forwarding.
Neither condition below holds.
II. control signal = 01
Problem 5. (15 points) Media applications that play audio or video files are part of a
class of workloads called streaming workloads; i.e., they bring in large amounts of
data but do not reuse much of it. Consider a video streaming workload that accesses a
512 KB working set sequentially with the following address stream:
a. Assume a 64 KB direct-mapped cache with a 32-byte line. What is the miss rate for
the address stream above? How is this miss rate sensitive to the size of the cache or
the working set? How would you categorize the misses this workload is
experiencing, explain what causes the misses?
6.25% miss rate. The miss rate doesnt change with cache size or working set. These are cold
misses, which means that the data are brought back from the memory for the first time.
b. Re-compute the miss rate when the cache line (block) size is 16 bytes, 64 bytes, and
128 bytes. What kind of locality is this workload exploiting?
12.5%, 3.125% and 1.5625% miss rates for 16-byte, 64-byte and 128-byte blocks. Spatial
locality.
(1/8, 1/32, 1/64)
Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI
machine with an average of 1.35 references (both instruction and data) per
instruction, help find the optimal block size given the following miss rates for
various block sizes.
Size
16
32
64
Miss Rate
4%
3%
3%
1.5%
d. What is the optimal block size for a miss latency of 20 B cycles?
8-byte.
128
1%