HW5S24 Sol
HW5S24 Sol
Sol.>
SC: SC with store-to-load relaxation: WO:
Lock L1 Lock L1 Lock L1
Fence Fence Fence
Read A Read A Read A
Fence Fence Read B
Read B Read B Write B
Write B Write B Write D
Fence Fence Read E
Write D Write D Fence
Fence Read E Unlock L1
Read E Fence Fence
Fence Unlock L1 Read F
Unlock L1 Read F
Fence
Read F
(b)
1
Sol.>
SC: WO:
Lock L1 Lock L1
Read A Read A
Read B Read B
Write B Write B
Write D Write D
Fence Read E
Read E Unlock L1
Unlock L1 Fence
Fence Read F
Read F
1. What is the problem with this code? Write the code for BARRIER in a way that avoids
the problem.
2. Assume that the processor does not have a Fetch&add. However, it has a LL (load
loacked) and SC (store conditional) instruction. Show how you would implement
Fetch&add, using LL and SC.
Solution:
1. The problem with this code is that one process could be delayed in the while loop after
executing the fetch and add. During this delay it is possible that all other processes
would go through the barrier and one of them would reach the barrier again. As a result,
this delayed process would end up one iteration behind all other processes, an
unintended outcome. A number of solutions are possible. One is the use of a toggle flag
to differentiate between consecutive iterations.
2
global boolean flag := true
BARRIER (B: Barvariable, N: Integer)
{
int local_flag = not flag;
if (F&A(B,1) = N-1)
{
B:=0;
flag := local_flag;
}
while (flag != local_flag) do {};
}
Here again a process could get delayed in the while loop. However, now no one can update the
global flag before all processes pass the barrier completely and finish their next iteration.
2.
F&A(A,R1):
LOOP: LL A,R2
ADD R3,R2,R1
SC A,R3
BEQZ R3,LOOP
RET
Solution:
(a)
-NOT coherent.
All possible outcomes are coherent because all accesses to each memory variable are made
by different threads. So interleaving of accesses to the same memory variable is arbitrary and
all outcomes are coherent.
-NOT sequentially consistent
The two stores are in different threads, thus they are not ordered. However P2 and P3 cannot
observe the two stores in different orders. According to SC, if P2 reads A=1 and P3 reads
B=1 then P3 cannot read A=0. So (R1, R2, R3)=(1,1,0) is not sequentially consistent. In this
outcome to load-to-store order has been violated since B:=1 is allowed to perform while the
preceding load of A is not globally performed.
-NOT TSO
The difference between IBM370 and TSO is that in TSO values can be returned from the
store buffer. There is no such occurrence here since no thread reads one of its own store.
Thus the execution behaves similarly on IBM370 and TSO and the only outcome that is not
TSO is (1,1,0)
-NOT weakly ordered
Since WO systems do not order regular loads and stores, all outcomes are possible.
(b)
This is the sequence in Dekker’s algorithm.
-NOT coherent
All possible outcomes are coherent because all accesses to each memory variable are made
by different processors. So interleaving is arbitrary.
-NOT sequentially consistent
3
This is well-known. All outcomes are SC except for (R1,R2)=(0,0). In this outcome the store-
to-load order is violated.
-NOT TSO
All outcomes are possible under TSO. Same reason as for IBM370.
-NOT weakly ordered
Since WO systems do not order regular loads and stores, all outcomes are possible.
(c)
To be at all correct, this sequence must return R1=1 and R3=1 (because of intra-process
dependencies). So all outcomes in which R1 or R3 are 0 are incorrect under any model. Thus
the only possible outcomes under ANY model are:
(R1,R2,R3,R4)=(1,0,1,0),(1,0,1,1),(1,1,1,0) or (1,1,1,1). In the following we restrict the
discussion to these four outcomes.
-NOT coherent
All four outcomes are coherent
-NOT sequentially consistent.
Both R2 and R4 cannot be 0. So the only non-SC outcome is (1,0,1,0)
Since all subsequent models are relaxed model, they must accept all SC outcomes. Therefore
we focus on this one non-SC outcome only in the following. -NOT TSO
In TSO, a Load returning a value from same thread’s Store is not ordered with the Store,
although it returns its value. Thus the two loads in each thread could be performed before the
first Store and (1,0,1,0) is possible.
-NOT W.O.
In WO, no order is imposed on regular Loads and Stores and thus all outcomes are possible.
(d)
This problem is more complex because up to 3 values can be returned by Loads. R1 and R3
could potential be 0,1, or 2 and R2 or R4 could be 0 or 1. The total number of possibilities is
3x3x2x2=36. However, any outcome that returns R1=0 or R3=0 is incorrect: Because of
intra-thread dependencies, which are enforced in all cases, R1 and R3 must be equal to 1 or 2.
So (0,x,x,x) and (x,x,0,x) are incorrect in all cases. The only possible values for R1 and R3 in
all cases are either 1 or 2. This leaves us with 2x2x2x2=16 possibilities.
-NOT coherent.
The only memory location that could cause coherence problems is C, because accesses to A
and B are in different threads, so are not ordered. If we look at all the possible orderings of
the 4 accesses to C, the following must be enforced: if R1=2 then R3 must be 2 and if R3=1
then R1 must be 1. So (2,x,1,x) is not coherent. We are left with 3x4 = 12 possibilities.
- NOT sequentially consistent.
SC must be at least coherent so (2,x,1,x) is not SC. If we look at the other accesses (on A and
B), they are the same as in Dekker’s algorithm. Thus we cannot have (R2,R4) =(0,0). We are
left with 3x3 = 9 possibilities.
-NOT TSO
TSO is coherent (forwarding store buffer), so (2,x,1,x) cannot be TSO. However, in TSO, the
store of C followed by the load of C are not ordered. The processor load may return its value
from the store buffer before the store has been globally performed. Thus, (R2,R4) = (0,0) is a
possible outcome under TSO.
-NOT WO
Since WO systems do not order regular loads and stores, all outcomes are possible.
2^18 warps.
Warps = (number of threads)/(number of threads per warp)
Number of threads = 2^24 (one thread per iteration)
Number of threads per warp = 64 = 2^6
Number of warps = 2^24 / 2^6 = 2^18
b) Is it possible for the SIMD utilization to be 50%? If yes what condition needs to be true
for arrays A, B, C and D?
Yes
ARRAY A:
8 out of every 64 consecutive elements of A are positive.
When A is positive then thread runs all the instructions, but if A is negative only 3 out of 7
instructions are executed by the thread i.e. 3/7 utilization.
So if 1/8 of the threads run at full utilization and 7/8 are run at 3/7 utilization, then the total
SIMD utilization becomes 1/8+7/8*3/7 = 50%.
ARRAY B:
No requirement
ARRAY C:
No requirement
ARRAY D:
No requirement
If it is not possible for the SIMD utilization to be 50%, please explain why.
N/A
c) Is it possible for the SIMD utilization to be 100%? If yes what condition needs to be true
for arrays A, B, C and D? [6 points]
Yes
ARRAY A:
5
All elements in every 64 consecutive values of A are either +ve or -ve
ARRAY B:
No requirement
ARRAY C:
No requirement
ARRAY D:
No requirement
If it is not possible for the SIMD utilization to be 100%, please explain why.
N/A
6
executed?
Answer: (A) = 5
Here the critical section enforces that the code of T0 and T1 do not overlap
in time. So the only possible result is 5.
Problem 6 [Coherence and Consistency] (20 pts.)
For this problem we will be using the following sequences of instructions. These are small
programs, each executed on a different processor, each with its own cache and register set. In the
following R is a register and X is a memory location. Each instruction has been named (e.g., B3)
to make it easy to write answers.
B6: ST X, R2
(a) Can X hold the value of 4 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.
(b) Can X hold the value of 5 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.
(c) Can X hold the value of 6 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.
(d) For this particular problem, can a processor that reorders instructions but follows local
dependencies produce an answer that cannot be produced by the SC model?
Solution:
(d) No. All stores/loads must be done in order because they’re to the same address, so no new
results are possible.
7
Problem 7 [Consistency and Synchronization] (10 pts.)
In the following code snippet, what are the possible outcomes?
INIT: A=0;B=0;lock=1 /initial values in memory
T0 T1
ADDI R2,R0,#1 ADDI R3,R0,#1
ADDI R1,R0,#2; ADDI R1,R0,#1
SW R1,A SW R1,B
SW R0,lock while(R3!=0)T&S R3,lock;
LW R2,B LW R2,A
R1, R2 and R3 are registers in threads T0 and T1. R0 always contains value 0. A, B
and lock are shared-memory addresses. T&S is an instruction that atomically returns
the value in lock and stores 1 in lock.
To answer this questions give all the possible values of R1, R2 and R3 in T0 and T1
at the end of the execution.
ANSWER:
In Thread T0:
R1 = 2 /Local. As set by the ADDI instruction
R2 = 0 or 1 /The code of T0 could execute before or after the code in T1
In Thread T1:
R1 = 1 /Local. As set by ADDI instruction.
R2 = 2 /Synchronization. The LW into R2 of A must return the value set by T0
R3 = 0 /Because of the while loop