q3.fa24
q3.fa24
2 /18
3 /18
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
4 /16
5 /19
6.191 Computation Structures
Fall 2024 6 /13
Quiz #3
Recitation section
o WF 10, 34-301 (Varun) o WF 2, 34-303 (Pleng) o WF 12, 34-303 (Ezra)
o WF 11, 34-301 (Varun) o WF 3, 34-303 (Pleng) o WF 1, 34-303 (Ezra)
o WF 12, 34-302 (Keshav) o WF 10, 34-302 (Hilary) o WF 2, 34-302 (Jessica)
o WF 1, 34-302 (Keshav) o WF 11, 34-302 (Hilary) o WF 3, 34-302 (Jessica)
o opt-out
Please enter your name, Athena login name, and recitation section above. Enter your answers
in the spaces provided below. Show your work for potential partial credit. You can use the extra
white space and the back of each page for scratch work.
Consider the following two processes, A and B, running on a standard RISC-V processor. Code
listings use virtual addresses.
. = 0x200 . = 0x600
loopA: loopB:
li a0, 0x10 li a0, 0x0
li a7, 0x1B li a7, 0x1B
ecall ecall
li a0, 0x150 li a0, 0x200
li a7, 0x1B li a7, 0x1B
ecall ecall
j loopA li t0, 0x790
sw a0, 0(t0)
j loopB
These processes run on a custom OS that supports segmentation-based (base and bound) virtual
memory, timer interrupts for scheduling processes, and a print_string system call for printing
strings.
Processes invoke syscalls with the ecall instruction. The print_string system call takes the
address of a string to print as the argument in register a0, and syscall number 0x1B in register a7.
It returns the length of the string that was printed. Note that the length of all of the strings
is 13.
Assume virtual addresses are translated with the following base and bound registers:
Process A: base register = 0x100, bound register = 0x180
Process B: base register = 0x700, bound register = 0x1000
(A) (2 points) What is the physical address of the start of the string located at 0x10 in Process A
and the start of the string located at 0x200 in Process B?
(C) (2 points) Assume that all segmentation faults have been fixed. You decide to test only
Process A first to see if it behaves as expected, and get the following incorrect output, where
“Process A: 1” is printed repeatedly.
Process A: 1
Process A: 1
Process A: 1
…
You isolate the issue to the handling of the exception specified by the mcause register. What
bug in the code that handles this exception is causing the incorrect behavior?
(D) (3 points) Assume that all issues have been fixed and the processes behave as expected. Say
that Process A was scheduled first, and runs until the first ecall instruction completes and
returns from the common handler. What are the values in the a0, a7, and pc (in virtual
address) registers? Write CAN’T TELL if you can’t tell a value from the information given.
a0: ________________________
a7: ________________________
pc: ________________________
Process A: 1
Process B: 1
Process B: 2
Process A: 2
You pause the program immediately after the last line finishes printing and returns from the
common handler. What are the values in the a0, t0, and pc (in virtual address) registers?
Write CAN’T TELL if you can’t tell a value from the information given.
a0: ________________________
t0: ________________________
pc: ________________________
(F) (3 points) Still assume that all issues have been fixed. Process A and B are now run until 4
lines are printed. For the following outputs, specify if that output could have been produced
by our programs or not.
Outputs:
Process B: 1 Process A: 1 Process A: 1
Process A: 1 Process A: 2 Process B: 2
Process B: 2 Process A: 1 Process B: 1
Process B: 1 Process B: 1 Process A: 2
Consider a RISC-V processor that has 16-bit virtual addresses, 2!" bytes of physical memory,
and uses a page size of 2#! bytes.
(A) (2 points) Calculate the following parameters relating to the size of the page table assuming a
single-level (flat) page table. Each page table entry contains a physical page number, a dirty
bit, and a resident bit. Your final answer can be a product or exponent.
(B) (1 point) Instead of using a page size of 2#! bytes, say we decide to use a page size of 2$
bytes. What is the ratio of the page table sizes with the new page size of 2$ bytes, compared
to the old page size of 2#! bytes? Your final answer can use fractions, products, and
exponents.
For the rest of the problem, keep the page size as 𝟐𝟏𝟐 bytes, and assume a hierarchical page
table structure of 2 levels, with address mapping as shown below:
It is given to you that the number of bits in the 1st and 2nd level indices are equal.
(C) (2 points) Calculate the following parameters relating to the size of each second-level page
table. Each second-level page table entry contains a physical page number, a dirty bit, and a
resident bit. Your final answer can be a product or exponent.
(D) (8 points) You now run a test program on this processor. Execution of this test program is
halted just before executing the following two instructions. The state of the hierarchical page
table is shown below; the least recently used page (“LRU”) and next least recently used page
(“next LRU”) are indicated where necessary. x1 has been set to 0x9000. Assume all
physical pages are in use. Execution resumes and the following two instructions are
executed:
. = 0xBFFC
lw x12, 0xF(x0)
sw x12, 0x0(x1) // x1 = 0x9000
(E) (2 points) Also, specify which PPN(s) were evicted, and which were written back to memory
during execution of the two instructions from part (D). If there are no pages to list, then enter
NONE.
Evicted PPN(s) (hex): ____________________
. = 0xB2A0
sw x3, 4(x4) // x4 = 0x9204
The contents of the TLB and the hierarchical page table are shown below. Assume that all
physical pages are currently in use. Assume that we use an LRU replacement policy on the
2nd level of Page Tables.
TLB
VPN V R D PPN
LRU ® 0x1 1 1 1 0x0
0x9 1 1 0 0xB
0x8 1 1 0 0x4
0x0 0 0 0 0x99
Fill out the updated state of the TLB after these operations. You may mark a row as “NO
CHANGE” if it remains unchanged. Please write all numerical values in hexadecimal.
TLB
VPN V R D PPN
We are trying to run the following piece of code. Unfortunately, our processor does not
implement the divide instruction. Instead, we choose to emulate the instruction within the
operating system.
. = 0x000
addi a0, x0, 0x1
addi a1, x0, 0x1
bnez a1, second
first:
div a2, a1, a0
addi a3, x0, 0x1
second:
div a2, a1, a0
add x0, x0, x0
add x0, x0, x0
add x0, x0, x0
…
// Kernel space
common_handler:
csrw mscratch, x1
lw x1, curProc
sw x2, 0x8(x1)
sw x3, 0xc(x1)
sw x4, 0x10(x1)
sw x5, 0x14(x1)
sw x6, 0x18(x1)
…
Note that the division instruction exception is detected in the decode stage.
(A) (4 points) Fill in all the white boxes in the 5-stage pipeline diagram for the execution of this
code. Assume branches are resolved in the execute stage, and there is full bypassing.
Assume that exceptions are handled lazily (at the commit point). You do not need to
include bypassing arrows.
Cycle 0 1 2 3 4 5 6 7 8 9
IF
DEC
EXE
MEM
WB
Cycle 0 1 2 3 4 5 6 7 8 9
IF
DEC
EXE
MEM
WB
(C) (2 points) Suppose Alice is writing a program. Alice forgets about the ecall instruction and
instead uses the jump instruction to call the common handler directly, instead of using the
ecall.
example_program:
li a0, 32
li a1, 10
li a7, SYS_SEMINIT
j common_handler
common_handler:
// Save all the registers into the curProc data structure
csrw mscratch, x1
...
// Setup the necessary registers to call the dispatcher
// Call the dispatcher
// Load all the registers from the curProc data structure
// Return to the calling process
mret
Assume:
• The page table initially has allocated one page for VPN 0x0.
• Each page is 212 bytes
• Page faults are handled by the OS
• None of these instructions cause a segmentation fault.
• If the process encounters a division by zero, it is immediately killed by the OS.
• Division instruction is implemented in hardware.
PC Instruction
0x00 lui a1, 1 // a1 = 0x1000
0x04 lui a2, 5 // a2 = 0x5000
0x08 lw a4, 0x40(a1)
0x0c lw a5, 0x0(a1)
0x10 lw a6, 0x0(a2)
0x14 li a0, 0
0x18 li a7, SYS_PUTCHAR
0x1c ecall
0x20 add a0, x0, x0
0x24 div a2, a2, a0
Page fault
Timer interrupt
System call
Division by zero
Illegal opcode
The Earth Space Research Organization (ESRO) has decided to launch a rocket to provide
supplies to the astronauts living in the World Space Station (WSS). ESRO has equipped the
rocket with five boosters.
All boosters run the same code. Each booster must complete a set of pre-launch checks. Each
booster can independently run the pre-launch checks. Finally, the boosters must ignite only
after all five boosters have completed the pre-launch checks. Each booster must call ignite
for itself.
Engineers at ESRO have written the following code to help ignite the boosters:
Shared Memory:
int num_ready_boosters = 0;
booster_code:
prelaunch_check()
num_ready_boosters = num_ready_boosters + 1
if( num_ready_boosters == 5) {
ignite()
}
(A) (4 points) Using the booster code given above, answer if the following conditions are
possible:
2. All boosters ignite after all five boosters have completed the pre-launch checks and
None of the boosters ignite before all five boosters have completed the pre-launch
checks.
3. A booster ignites before all five boosters have completed the pre-launch checks.
Complete the code below. Notice that the ignite() function is now outside the if condition.
Hint: Think carefully about how the if statement affects the semaphore values.
Shared Memory:
int num_ready_boosters = 0;
___________________________
___________________________
booster_code:
prelaunch_check()
num_ready_boosters = num_ready_boosters + 1
if( num_ready_boosters == 5) {
ignite()
Ben Bitdiddle has a four-core processor system, where each core has its own cache. Ben has the
option to use either a snoopy-based, write invalidate MSI or a snoopy-based, write invalidate
MESI protocol, and is trying to decide which is better for optimizing the following code where
S1 and S2 are semaphores initialized to 0. Assume that X and Y map to different lines of the
cache.
Initial state X: I Y: I X: I Y: I
A: lw a1, X X: Y: X: Y:
A: sw a1, Y X: Y: X: Y:
B: lw a1, Y X: Y: X: Y:
B: sw a1, Y X: Y: X: Y:
A: lw a1, Y X: Y: X: Y:
A: sw a1, X X: Y: X: Y:
(A) (4 points) For each protocol, how many of each of the following bus requests occur for the
series of accesses listed above?
MSI
MESI
After observing how his code performs on MSI and MESI with just 2 cores, he thinks he has
enough information to decide which is better for his 4 core system.
(C) (2 points) Ben’s MESI protocol takes 10ns longer than his MSI protocol per cache access.
Suppose all bus transactions(BusRd, BusRdX, BusWB) take an additional 80ns if they are
called. For Ben’s system that has a total of 10 data accesses(1 lw X, 1 sw X, 4 lw Y, 4 sw
Y), how many bus transactions need to be saved for the MESI protocol to take less total time
than the MSI protocol?
(D) (1 point) What is the maximum number of bus transactions that would be saved if Ben uses
the MESI protocol over the MSI protocol? (Hint: pay close attention to what the semaphores
guarantee about data access order)
(E) (1 point) Should Ben use his MSI or MESI protocol (circle one)?
MSI MESI
Fill in the following table using the protocol you selected in part E (this table WILL be
graded). Include all shared bus transactions. Include the address associated with each bus
transaction (e.g., BusRdX(Y)). If no bus transactions occur, write N/A in the corresponding
box. Cache states left blank will be assumed to be Invalid.
Initial state X: I Y: I X: I Y: I X: I Y: I X: I Y: I
A: lw a1, X X: Y: X: Y: X: Y: X: Y:
A: sw a1, Y X: Y: X: Y: X: Y: X: Y:
C: lw a1, Y X: Y: X: Y: X: Y: X: Y:
B: lw a1, Y X: Y: X: Y: X: Y: X: Y:
B: sw a1, Y X: Y: X: Y: X: Y: X: Y:
D: lw a1, Y X: Y: X: Y: X: Y: X: Y:
C: sw a1, Y X: Y: X: Y: X: Y: X: Y:
D: sw a1, Y X: Y: X: Y: X: Y: X: Y:
A: lw a1, Y X: Y: X: Y: X: Y: X: Y:
A: sw a1, X X: Y: X: Y: X: Y: X: Y:
Oh no, Didit is broken! The 6.004 TAs are scrambling to put together a grading script that
calculates student grades so that they can get class grades submitted before the deadline.
The student grades are stored in an array int G[S][N] such that S is the number of students in
the class, N is the total number of assignments in the class and G[s][n] corresponds to the grade
obtained by student number s on assignment number n.
Additionally, there’s an array called int W[N] which stores the per-assignment weights. The
TAs want to calculate the total score of the class in a register variable called sum. All the arrays
are stored in row-major configuration.
(A) (2 points) A couple of TAs suggest the following two versions of the script:
Version A: Version B:
void grade() { void grade() {
int sum = 0; int sum = 0;
for (int n = 0; n < N; n++) for (int s = 0; s < S; s++)
for (int s = 0; s < S; s++) for (int n = 0; n < N; n++)
sum += W[n] * G[s][n]; sum += W[n] * G[s][n];
} }
Explanation:
(B) (3 points) Why does increasing the cache size slightly have such a large improvement in the
number of misses on the W array? Briefly explain.
Explanation:
(C) (4 points) Unfortunately, the TAs will be unable to get more cache in time for the deadline.
Help the TAs tile Version B of the code with a tile size of T. Assume that S and N are
divisible by T. Your goal is to minimize the number of misses on W, and you can only tile
either S or N. Which one should you tile on? Complete the code below showing a tiled
implementation of the code.
Version C:
void grade() {
int sum = 0;
for ( )
for ( )
for ( )
sum +=
}
(D) (4 points) Calculate the total number of data-misses on the W array using your tiled code from
part C. Assume the tiling factor T = 4 and that you are using the 12 word (3 x 4) cache.
Cache diagrams and code are available on the following pages for assistance and scratch
work.
Version B:
void grade() { // S = 64, N = 16
int sum = 0;
for (int s = 0; s < S; s++)
for (int n = 0; n < N; n++)
sum += W[n] * G[s][n];
}
Extra Copies:
Version B:
void grade() { // S = 64, N = 16
int sum = 0;
for (int s = 0; s < S; s++)
for (int n = 0; n < N; n++)
sum += W[n] * G[s][n];
}
Extra Copies:
Version C:
void grade() { // S = 64, N = 16, T = 4
int sum = 0;
for ( )
for ( )
for ( )
sum +=
}
Extra Copies:
END OF QUIZ 3!