0% found this document useful (0 votes)
40 views8 pages

HW5S24 Sol

Uploaded by

ruweiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views8 pages

HW5S24 Sol

Uploaded by

ruweiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

University of Southern California

Department of Electrical Engineering


EE557 Spring 2024
Instructor: Murali Annavaram
Homework #5, Due: 11:59PM, Tuesday, April 27th
TOTAL SCORE: 110 Points
Problem 1 [Synchronization and Consistency] (20 pts.)
Consider the following program:
Lock L1
Read A
Read B
Write B
Write D
Read E
Unlock L1
Read F
Assume that there is a fence instruction that helps the programmer to order the instructions.
The fence instruction ensures that mem ops that are younger are not issued until the older
mem ops have globally performed.
(a) Assume a machine that does not guarantee any ordering. Insert fences between the
instructions to guarantee SC, SC with store-to-load relaxation and WO.
(b) Assume a machine that guarantee SC with store-to-load relaxation. Insert fences
between the instructions to guarantee SC and WO.

Sol.>
SC: SC with store-to-load relaxation: WO:
Lock L1 Lock L1 Lock L1
Fence Fence Fence
Read A Read A Read A
Fence Fence Read B
Read B Read B Write B
Write B Write B Write D
Fence Fence Read E
Write D Write D Fence
Fence Read E Unlock L1
Read E Fence Fence
Fence Unlock L1 Read F
Unlock L1 Read F
Fence
Read F

(b)
1
Sol.>
SC: WO:
Lock L1 Lock L1
Read A Read A
Read B Read B
Write B Write B
Write D Write D
Fence Read E
Read E Unlock L1
Unlock L1 Fence
Fence Read F
Read F

Problem 2 [Lock] (15 pts.)


Someone uses the fetch&add atomic primitive to implement a barrier synchronization suitable for
a shared memory multiprocessor. To use the barrier, a processor must execute BARRIER (BAR,
N), where BAR is the barrier name and N is the number of processes that need to arrive at the
barrier before any of them can proceed. Assume that N has the same value in each use of barrier
BAR. The barrier should be capable of supporting the following code:
while(condition) {
Compute for a while;
BARRIER (BAR,N)
A proposed solution for implementing the barrier is the following:
BARRIER (B: Barvariable, N: Integer)
{
if (F&A(B,1) = N-1)
{
B:=0;
}
while (B != 0) do {};
}

1. What is the problem with this code? Write the code for BARRIER in a way that avoids
the problem.
2. Assume that the processor does not have a Fetch&add. However, it has a LL (load
loacked) and SC (store conditional) instruction. Show how you would implement
Fetch&add, using LL and SC.

Solution:

1. The problem with this code is that one process could be delayed in the while loop after
executing the fetch and add. During this delay it is possible that all other processes
would go through the barrier and one of them would reach the barrier again. As a result,
this delayed process would end up one iteration behind all other processes, an
unintended outcome. A number of solutions are possible. One is the use of a toggle flag
to differentiate between consecutive iterations.

2
global boolean flag := true
BARRIER (B: Barvariable, N: Integer)
{
int local_flag = not flag;
if (F&A(B,1) = N-1)
{
B:=0;
flag := local_flag;
}
while (flag != local_flag) do {};
}
Here again a process could get delayed in the while loop. However, now no one can update the
global flag before all processes pass the barrier completely and finish their next iteration.

2.
F&A(A,R1):
LOOP: LL A,R2
ADD R3,R2,R1
SC A,R3
BEQZ R3,LOOP
RET

Problem 3 [Memory Consistency] (20 pts.)


Exercise 7.1 from the textbook

Solution:
(a)
-NOT coherent.
All possible outcomes are coherent because all accesses to each memory variable are made
by different threads. So interleaving of accesses to the same memory variable is arbitrary and
all outcomes are coherent.
-NOT sequentially consistent
The two stores are in different threads, thus they are not ordered. However P2 and P3 cannot
observe the two stores in different orders. According to SC, if P2 reads A=1 and P3 reads
B=1 then P3 cannot read A=0. So (R1, R2, R3)=(1,1,0) is not sequentially consistent. In this
outcome to load-to-store order has been violated since B:=1 is allowed to perform while the
preceding load of A is not globally performed.
-NOT TSO
The difference between IBM370 and TSO is that in TSO values can be returned from the
store buffer. There is no such occurrence here since no thread reads one of its own store.
Thus the execution behaves similarly on IBM370 and TSO and the only outcome that is not
TSO is (1,1,0)
-NOT weakly ordered
Since WO systems do not order regular loads and stores, all outcomes are possible.

(b)
This is the sequence in Dekker’s algorithm.
-NOT coherent
All possible outcomes are coherent because all accesses to each memory variable are made
by different processors. So interleaving is arbitrary.
-NOT sequentially consistent

3
This is well-known. All outcomes are SC except for (R1,R2)=(0,0). In this outcome the store-
to-load order is violated.
-NOT TSO
All outcomes are possible under TSO. Same reason as for IBM370.
-NOT weakly ordered
Since WO systems do not order regular loads and stores, all outcomes are possible.

(c)
To be at all correct, this sequence must return R1=1 and R3=1 (because of intra-process
dependencies). So all outcomes in which R1 or R3 are 0 are incorrect under any model. Thus
the only possible outcomes under ANY model are:
(R1,R2,R3,R4)=(1,0,1,0),(1,0,1,1),(1,1,1,0) or (1,1,1,1). In the following we restrict the
discussion to these four outcomes.
-NOT coherent
All four outcomes are coherent
-NOT sequentially consistent.
Both R2 and R4 cannot be 0. So the only non-SC outcome is (1,0,1,0)
Since all subsequent models are relaxed model, they must accept all SC outcomes. Therefore
we focus on this one non-SC outcome only in the following. -NOT TSO
In TSO, a Load returning a value from same thread’s Store is not ordered with the Store,
although it returns its value. Thus the two loads in each thread could be performed before the
first Store and (1,0,1,0) is possible.
-NOT W.O.
In WO, no order is imposed on regular Loads and Stores and thus all outcomes are possible.

(d)
This problem is more complex because up to 3 values can be returned by Loads. R1 and R3
could potential be 0,1, or 2 and R2 or R4 could be 0 or 1. The total number of possibilities is
3x3x2x2=36. However, any outcome that returns R1=0 or R3=0 is incorrect: Because of
intra-thread dependencies, which are enforced in all cases, R1 and R3 must be equal to 1 or 2.
So (0,x,x,x) and (x,x,0,x) are incorrect in all cases. The only possible values for R1 and R3 in
all cases are either 1 or 2. This leaves us with 2x2x2x2=16 possibilities.
-NOT coherent.
The only memory location that could cause coherence problems is C, because accesses to A
and B are in different threads, so are not ordered. If we look at all the possible orderings of
the 4 accesses to C, the following must be enforced: if R1=2 then R3 must be 2 and if R3=1
then R1 must be 1. So (2,x,1,x) is not coherent. We are left with 3x4 = 12 possibilities.
- NOT sequentially consistent.
SC must be at least coherent so (2,x,1,x) is not SC. If we look at the other accesses (on A and
B), they are the same as in Dekker’s algorithm. Thus we cannot have (R2,R4) =(0,0). We are
left with 3x3 = 9 possibilities.
-NOT TSO
TSO is coherent (forwarding store buffer), so (2,x,1,x) cannot be TSO. However, in TSO, the
store of C followed by the load of C are not ordered. The processor load may return its value
from the store buffer before the store has been globally performed. Thus, (R2,R4) = (0,0) is a
possible outcome under TSO.
-NOT WO
Since WO systems do not order regular loads and stores, all outcomes are possible.

Problem 4 [SIMD] (20 points)


SIMD utilization of an application/program running on a GPU is the fraction of SIMD lanes that
are kept busy with active threads during the execution of an application.
4
Consider the following code segment running on a GPU. Each thread executes a SINGLE
ITERATION of the loop shown. The data values in arrays A, B, C and D are already in vector
registers i.e. there is no load or store required. There are 64 SIMD lanes in the GPU and a warp
consists of 64 threads.
(Note: there are 7 instructions in each thread and assume that each instruction takes the same time
to execute)

for (i = 0; i < 16*1024*1024; i++) {


B[i] = A[i] - C[i]; // Instruction 1
D[i] = A[i] + C[i]; // Instruction 2
if (A[i] > 0){ // Instruction 3
A[i] = A[i] * C[i]; // Instruction 4
B[i] = A[i] + B[i]; // Instruction 5
C[i] = B[i] + 1; // Instruction 6
D[i] = B[i] – 1; // Instruction 7
}
}

a) How many warps does it take to execute the program?

2^18 warps.
Warps = (number of threads)/(number of threads per warp)
Number of threads = 2^24 (one thread per iteration)
Number of threads per warp = 64 = 2^6
Number of warps = 2^24 / 2^6 = 2^18

b) Is it possible for the SIMD utilization to be 50%? If yes what condition needs to be true
for arrays A, B, C and D?
Yes
ARRAY A:
8 out of every 64 consecutive elements of A are positive.
When A is positive then thread runs all the instructions, but if A is negative only 3 out of 7
instructions are executed by the thread i.e. 3/7 utilization.
So if 1/8 of the threads run at full utilization and 7/8 are run at 3/7 utilization, then the total
SIMD utilization becomes 1/8+7/8*3/7 = 50%.

ARRAY B:
No requirement

ARRAY C:
No requirement

ARRAY D:
No requirement

If it is not possible for the SIMD utilization to be 50%, please explain why.
N/A

c) Is it possible for the SIMD utilization to be 100%? If yes what condition needs to be true
for arrays A, B, C and D? [6 points]
Yes
ARRAY A:

5
All elements in every 64 consecutive values of A are either +ve or -ve

ARRAY B:
No requirement

ARRAY C:
No requirement

ARRAY D:
No requirement

If it is not possible for the SIMD utilization to be 100%, please explain why.
N/A

Problem 5 [Synchronization and Consistency] (15 pts.)


In the following code, R1 is a register in each thread. A is a memory address. The
memory system is coherent (store atomic). I01 and I11 are loads from A; I03 and I13
are stores to A.
INIT: A=2
T0 T1
I01: R1:=A I11: R1:=A
I02: R1:=R1+2 I12: R1:=R1+1
I03: A:=R1 I13: A:=R1
1. What are all the possible values of memory location A after this program is
executed?
Answer: (A) = 5,4,3
If instructions of T0 execute before instructions of T1 (and vice
versa) the result is 5 in both cases
The two loads could execute at the same time so that both threads
return 2. Depending on which thread writes last the final value of A is 4
or 3
2. We add a barrier synchronization at the beginning and at the end of the code:
INIT: A=2,bar=0,bar1=0
T0 T1
Barrier(bar,2) Barrier (bar,2)
I01: R1:=A I11: R1:=A
I02: R1:=R1+2 I12: R1:=R1+1
I03: A:=R1 I13: A:=R1
Barrier(bar1,2) Barrier(bar1,2)
What are all the possible values of memory location A after this program is
executed?
Answer: (A) = 5,4,3
The barriers do not change the situation.
Possible results are as in 1.
3. We include the code in a critical section:
INIT: A=2,l=0
T0 T1
Lock(l) Lock(l)
I01: R1:=A I11: R1:=A
I02: R1:=R1+2 I12: R1:=R1+1
I03: A:=R1 I13: A:=R1
Unlock(l) Unlock(l)
What are all the possible values of memory location A after this program is

6
executed?
Answer: (A) = 5
Here the critical section enforces that the code of T0 and T1 do not overlap
in time. So the only possible result is 5.
Problem 6 [Coherence and Consistency] (20 pts.)
For this problem we will be using the following sequences of instructions. These are small
programs, each executed on a different processor, each with its own cache and register set. In the
following R is a register and X is a memory location. Each instruction has been named (e.g., B3)
to make it easy to write answers.

Initial value in location X is 0.

Processor A Processor B Processor C


A1: ST X, 1 B1: R2 := LD X C1: ST X, 6
A2: R1 := LD X B2: R2 := ADD R2, 1 C2: R3 := LD X

A3: R1 := ADD R1, R1 B3: ST X, R2 C3: R3 := ADD R3, R3

A4: ST X, R1 B4: R2:= LD X C4: ST X, R3

B5: R2 := ADD R2, R2

B6: ST X, R2

(a) Can X hold the value of 4 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.

(b) Can X hold the value of 5 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.

(c) Can X hold the value of 6 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.

(d) For this particular problem, can a processor that reorders instructions but follows local
dependencies produce an answer that cannot be produced by the SC model?

Solution:

(a) Yes. C1, B1-B6, A1-A4, C2-C4

(b) No. All results must be even

(c) Yes. All of C, All of A, All of B

(d) No. All stores/loads must be done in order because they’re to the same address, so no new
results are possible.

7
Problem 7 [Consistency and Synchronization] (10 pts.)
In the following code snippet, what are the possible outcomes?
INIT: A=0;B=0;lock=1 /initial values in memory
T0 T1
ADDI R2,R0,#1 ADDI R3,R0,#1
ADDI R1,R0,#2; ADDI R1,R0,#1
SW R1,A SW R1,B
SW R0,lock while(R3!=0)T&S R3,lock;
LW R2,B LW R2,A
R1, R2 and R3 are registers in threads T0 and T1. R0 always contains value 0. A, B
and lock are shared-memory addresses. T&S is an instruction that atomically returns
the value in lock and stores 1 in lock.
To answer this questions give all the possible values of R1, R2 and R3 in T0 and T1
at the end of the execution.

ANSWER:
In Thread T0:
R1 = 2 /Local. As set by the ADDI instruction
R2 = 0 or 1 /The code of T0 could execute before or after the code in T1
In Thread T1:
R1 = 1 /Local. As set by ADDI instruction.
R2 = 2 /Synchronization. The LW into R2 of A must return the value set by T0
R3 = 0 /Because of the while loop

You might also like