0% found this document useful (0 votes)

40 views8 pages

HW5S24 Sol

Uploaded by

ruweiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views8 pages

HW5S24 Sol

Uploaded by

ruweiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

University of Southern California

Department of Electrical Engineering

EE557 Spring 2024
Instructor: Murali Annavaram
Homework #5, Due: 11:59PM, Tuesday, April 27th
TOTAL SCORE: 110 Points
Problem 1 [Synchronization and Consistency] (20 pts.)
Consider the following program:
Lock L1
Read A
Read B
Write B
Write D
Read E
Unlock L1
Read F
Assume that there is a fence instruction that helps the programmer to order the instructions.
The fence instruction ensures that mem ops that are younger are not issued until the older
mem ops have globally performed.
(a) Assume a machine that does not guarantee any ordering. Insert fences between the
instructions to guarantee SC, SC with store-to-load relaxation and WO.
(b) Assume a machine that guarantee SC with store-to-load relaxation. Insert fences
between the instructions to guarantee SC and WO.

Sol.>
SC: SC with store-to-load relaxation: WO:
Lock L1 Lock L1 Lock L1
Fence Fence Fence
Read A Read A Read A
Fence Fence Read B
Read B Read B Write B
Write B Write B Write D
Fence Fence Read E
Write D Write D Fence
Fence Read E Unlock L1
Read E Fence Fence
Fence Unlock L1 Read F
Unlock L1 Read F
Fence
Read F

(b)
1
Sol.>
SC: WO:
Lock L1 Lock L1
Read A Read A
Read B Read B
Write B Write B
Write D Write D
Fence Read E
Read E Unlock L1
Unlock L1 Fence
Fence Read F
Read F

Problem 2 [Lock] (15 pts.)

Someone uses the fetch&add atomic primitive to implement a barrier synchronization suitable for
a shared memory multiprocessor. To use the barrier, a processor must execute BARRIER (BAR,
N), where BAR is the barrier name and N is the number of processes that need to arrive at the
barrier before any of them can proceed. Assume that N has the same value in each use of barrier
BAR. The barrier should be capable of supporting the following code:
while(condition) {
Compute for a while;
BARRIER (BAR,N)
A proposed solution for implementing the barrier is the following:
BARRIER (B: Barvariable, N: Integer)
{
if (F&A(B,1) = N-1)
{
B:=0;
}
while (B != 0) do {};
}

1. What is the problem with this code? Write the code for BARRIER in a way that avoids
the problem.
2. Assume that the processor does not have a Fetch&add. However, it has a LL (load
loacked) and SC (store conditional) instruction. Show how you would implement
Fetch&add, using LL and SC.

Solution:

1. The problem with this code is that one process could be delayed in the while loop after
executing the fetch and add. During this delay it is possible that all other processes
would go through the barrier and one of them would reach the barrier again. As a result,
this delayed process would end up one iteration behind all other processes, an
unintended outcome. A number of solutions are possible. One is the use of a toggle flag
to differentiate between consecutive iterations.

2
global boolean flag := true
BARRIER (B: Barvariable, N: Integer)
{
int local_flag = not flag;
if (F&A(B,1) = N-1)
{
B:=0;
flag := local_flag;
}
while (flag != local_flag) do {};
}
Here again a process could get delayed in the while loop. However, now no one can update the
global flag before all processes pass the barrier completely and finish their next iteration.

2.
F&A(A,R1):
LOOP: LL A,R2
ADD R3,R2,R1
SC A,R3
BEQZ R3,LOOP
RET

Problem 3 [Memory Consistency] (20 pts.)

Exercise 7.1 from the textbook

Solution:
(a)
-NOT coherent.
All possible outcomes are coherent because all accesses to each memory variable are made
by different threads. So interleaving of accesses to the same memory variable is arbitrary and
all outcomes are coherent.
-NOT sequentially consistent
The two stores are in different threads, thus they are not ordered. However P2 and P3 cannot
observe the two stores in different orders. According to SC, if P2 reads A=1 and P3 reads
B=1 then P3 cannot read A=0. So (R1, R2, R3)=(1,1,0) is not sequentially consistent. In this
outcome to load-to-store order has been violated since B:=1 is allowed to perform while the
preceding load of A is not globally performed.
-NOT TSO
The difference between IBM370 and TSO is that in TSO values can be returned from the
store buffer. There is no such occurrence here since no thread reads one of its own store.
Thus the execution behaves similarly on IBM370 and TSO and the only outcome that is not
TSO is (1,1,0)
-NOT weakly ordered
Since WO systems do not order regular loads and stores, all outcomes are possible.

(b)
This is the sequence in Dekker’s algorithm.
-NOT coherent
All possible outcomes are coherent because all accesses to each memory variable are made
by different processors. So interleaving is arbitrary.
-NOT sequentially consistent

3
This is well-known. All outcomes are SC except for (R1,R2)=(0,0). In this outcome the store-
to-load order is violated.
-NOT TSO
All outcomes are possible under TSO. Same reason as for IBM370.
-NOT weakly ordered
Since WO systems do not order regular loads and stores, all outcomes are possible.

(c)
To be at all correct, this sequence must return R1=1 and R3=1 (because of intra-process
dependencies). So all outcomes in which R1 or R3 are 0 are incorrect under any model. Thus
the only possible outcomes under ANY model are:
(R1,R2,R3,R4)=(1,0,1,0),(1,0,1,1),(1,1,1,0) or (1,1,1,1). In the following we restrict the
discussion to these four outcomes.
-NOT coherent
All four outcomes are coherent
-NOT sequentially consistent.
Both R2 and R4 cannot be 0. So the only non-SC outcome is (1,0,1,0)
Since all subsequent models are relaxed model, they must accept all SC outcomes. Therefore
we focus on this one non-SC outcome only in the following. -NOT TSO
In TSO, a Load returning a value from same thread’s Store is not ordered with the Store,
although it returns its value. Thus the two loads in each thread could be performed before the
first Store and (1,0,1,0) is possible.
-NOT W.O.
In WO, no order is imposed on regular Loads and Stores and thus all outcomes are possible.

(d)
This problem is more complex because up to 3 values can be returned by Loads. R1 and R3
could potential be 0,1, or 2 and R2 or R4 could be 0 or 1. The total number of possibilities is
3x3x2x2=36. However, any outcome that returns R1=0 or R3=0 is incorrect: Because of
intra-thread dependencies, which are enforced in all cases, R1 and R3 must be equal to 1 or 2.
So (0,x,x,x) and (x,x,0,x) are incorrect in all cases. The only possible values for R1 and R3 in
all cases are either 1 or 2. This leaves us with 2x2x2x2=16 possibilities.
-NOT coherent.
The only memory location that could cause coherence problems is C, because accesses to A
and B are in different threads, so are not ordered. If we look at all the possible orderings of
the 4 accesses to C, the following must be enforced: if R1=2 then R3 must be 2 and if R3=1
then R1 must be 1. So (2,x,1,x) is not coherent. We are left with 3x4 = 12 possibilities.
- NOT sequentially consistent.
SC must be at least coherent so (2,x,1,x) is not SC. If we look at the other accesses (on A and
B), they are the same as in Dekker’s algorithm. Thus we cannot have (R2,R4) =(0,0). We are
left with 3x3 = 9 possibilities.
-NOT TSO
TSO is coherent (forwarding store buffer), so (2,x,1,x) cannot be TSO. However, in TSO, the
store of C followed by the load of C are not ordered. The processor load may return its value
from the store buffer before the store has been globally performed. Thus, (R2,R4) = (0,0) is a
possible outcome under TSO.
-NOT WO
Since WO systems do not order regular loads and stores, all outcomes are possible.

Problem 4 [SIMD] (20 points)

SIMD utilization of an application/program running on a GPU is the fraction of SIMD lanes that
are kept busy with active threads during the execution of an application.
4
Consider the following code segment running on a GPU. Each thread executes a SINGLE
ITERATION of the loop shown. The data values in arrays A, B, C and D are already in vector
registers i.e. there is no load or store required. There are 64 SIMD lanes in the GPU and a warp
consists of 64 threads.
(Note: there are 7 instructions in each thread and assume that each instruction takes the same time
to execute)

for (i = 0; i < 1610241024; i++) {

B[i] = A[i] - C[i]; // Instruction 1
D[i] = A[i] + C[i]; // Instruction 2
if (A[i] > 0){ // Instruction 3
A[i] = A[i] * C[i]; // Instruction 4
B[i] = A[i] + B[i]; // Instruction 5
C[i] = B[i] + 1; // Instruction 6
D[i] = B[i] – 1; // Instruction 7
}
}

a) How many warps does it take to execute the program?

2^18 warps.
Warps = (number of threads)/(number of threads per warp)
Number of threads = 2^24 (one thread per iteration)
Number of threads per warp = 64 = 2^6
Number of warps = 2^24 / 2^6 = 2^18

b) Is it possible for the SIMD utilization to be 50%? If yes what condition needs to be true
for arrays A, B, C and D?
Yes
ARRAY A:
8 out of every 64 consecutive elements of A are positive.
When A is positive then thread runs all the instructions, but if A is negative only 3 out of 7
instructions are executed by the thread i.e. 3/7 utilization.
So if 1/8 of the threads run at full utilization and 7/8 are run at 3/7 utilization, then the total
SIMD utilization becomes 1/8+7/8*3/7 = 50%.

ARRAY B:
No requirement

ARRAY C:
No requirement

ARRAY D:
No requirement

If it is not possible for the SIMD utilization to be 50%, please explain why.
N/A

c) Is it possible for the SIMD utilization to be 100%? If yes what condition needs to be true
for arrays A, B, C and D? [6 points]
Yes
ARRAY A:

5
All elements in every 64 consecutive values of A are either +ve or -ve

ARRAY B:
No requirement

ARRAY C:
No requirement

ARRAY D:
No requirement

If it is not possible for the SIMD utilization to be 100%, please explain why.
N/A

Problem 5 [Synchronization and Consistency] (15 pts.)

In the following code, R1 is a register in each thread. A is a memory address. The
memory system is coherent (store atomic). I01 and I11 are loads from A; I03 and I13
are stores to A.
INIT: A=2
T0 T1
I01: R1:=A I11: R1:=A
I02: R1:=R1+2 I12: R1:=R1+1
I03: A:=R1 I13: A:=R1
1. What are all the possible values of memory location A after this program is
executed?
Answer: (A) = 5,4,3
If instructions of T0 execute before instructions of T1 (and vice
versa) the result is 5 in both cases
The two loads could execute at the same time so that both threads
return 2. Depending on which thread writes last the final value of A is 4
or 3
2. We add a barrier synchronization at the beginning and at the end of the code:
INIT: A=2,bar=0,bar1=0
T0 T1
Barrier(bar,2) Barrier (bar,2)
I01: R1:=A I11: R1:=A
I02: R1:=R1+2 I12: R1:=R1+1
I03: A:=R1 I13: A:=R1
Barrier(bar1,2) Barrier(bar1,2)
What are all the possible values of memory location A after this program is
executed?
Answer: (A) = 5,4,3
The barriers do not change the situation.
Possible results are as in 1.
3. We include the code in a critical section:
INIT: A=2,l=0
T0 T1
Lock(l) Lock(l)
I01: R1:=A I11: R1:=A
I02: R1:=R1+2 I12: R1:=R1+1
I03: A:=R1 I13: A:=R1
Unlock(l) Unlock(l)
What are all the possible values of memory location A after this program is

6
executed?
Answer: (A) = 5
Here the critical section enforces that the code of T0 and T1 do not overlap
in time. So the only possible result is 5.
Problem 6 [Coherence and Consistency] (20 pts.)
For this problem we will be using the following sequences of instructions. These are small
programs, each executed on a different processor, each with its own cache and register set. In the
following R is a register and X is a memory location. Each instruction has been named (e.g., B3)
to make it easy to write answers.

Initial value in location X is 0.

Processor A Processor B Processor C

A1: ST X, 1 B1: R2 := LD X C1: ST X, 6
A2: R1 := LD X B2: R2 := ADD R2, 1 C2: R3 := LD X

A3: R1 := ADD R1, R1 B3: ST X, R2 C3: R3 := ADD R3, R3

A4: ST X, R1 B4: R2:= LD X C4: ST X, R3

B5: R2 := ADD R2, R2

B6: ST X, R2

(a) Can X hold the value of 4 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.

(b) Can X hold the value of 5 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.

(c) Can X hold the value of 6 after all three threads have finished execution? If yes please
provide a sequence which results in the value. If not, please explain why.

(d) For this particular problem, can a processor that reorders instructions but follows local
dependencies produce an answer that cannot be produced by the SC model?

Solution:

(a) Yes. C1, B1-B6, A1-A4, C2-C4

(b) No. All results must be even

(c) Yes. All of C, All of A, All of B

(d) No. All stores/loads must be done in order because they’re to the same address, so no new
results are possible.

7
Problem 7 [Consistency and Synchronization] (10 pts.)
In the following code snippet, what are the possible outcomes?
INIT: A=0;B=0;lock=1 /initial values in memory
T0 T1
ADDI R2,R0,#1 ADDI R3,R0,#1
ADDI R1,R0,#2; ADDI R1,R0,#1
SW R1,A SW R1,B
SW R0,lock while(R3!=0)T&S R3,lock;
LW R2,B LW R2,A
R1, R2 and R3 are registers in threads T0 and T1. R0 always contains value 0. A, B
and lock are shared-memory addresses. T&S is an instruction that atomically returns
the value in lock and stores 1 in lock.
To answer this questions give all the possible values of R1, R2 and R3 in T0 and T1
at the end of the execution.

ANSWER:
In Thread T0:
R1 = 2 /Local. As set by the ADDI instruction
R2 = 0 or 1 /The code of T0 could execute before or after the code in T1
In Thread T1:
R1 = 1 /Local. As set by ADDI instruction.
R2 = 2 /Synchronization. The LW into R2 of A must return the value set by T0
R3 = 0 /Because of the while loop

VTS Userguide
100% (5)
VTS Userguide
228 pages
Assignment 3 With Solution
No ratings yet
Assignment 3 With Solution
6 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Problem 1 A) Considering The Number of Instructions Here To Be A Constant A
No ratings yet
Problem 1 A) Considering The Number of Instructions Here To Be A Constant A
13 pages
Data and Computer Communications: Tenth Edition by William Stallings
No ratings yet
Data and Computer Communications: Tenth Edition by William Stallings
21 pages
University of Florida Ece Graduate Guidelines/handbook
No ratings yet
University of Florida Ece Graduate Guidelines/handbook
45 pages
CS061 Sample Final
No ratings yet
CS061 Sample Final
16 pages
Investigating Synchronisation
No ratings yet
Investigating Synchronisation
7 pages
OS A191 - Assign 2 Question
No ratings yet
OS A191 - Assign 2 Question
11 pages
Chapter 05
No ratings yet
Chapter 05
19 pages
Untitled
No ratings yet
Untitled
25 pages
HW4S24 - Sol
No ratings yet
HW4S24 - Sol
11 pages
Midterm Solution
No ratings yet
Midterm Solution
18 pages
HW2 S24 Sol
No ratings yet
HW2 S24 Sol
15 pages
Advanced Data Structures: Sartaj Sahni
No ratings yet
Advanced Data Structures: Sartaj Sahni
34 pages
Midterm Exam Architecture
No ratings yet
Midterm Exam Architecture
2 pages
Homework 1 PDF
No ratings yet
Homework 1 PDF
2 pages
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
No ratings yet
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
14 pages
High Performance Computer Architecture (CS60003)
No ratings yet
High Performance Computer Architecture (CS60003)
2 pages
Data Link Layer (Chapter 3)
No ratings yet
Data Link Layer (Chapter 3)
12 pages
Finalsolution PDF
No ratings yet
Finalsolution PDF
4 pages
Solution For Chapter 4
100% (3)
Solution For Chapter 4
26 pages
Classical Problems of Synchronization
No ratings yet
Classical Problems of Synchronization
10 pages
ITS323Y11S1E02-Final-Exam-Answers
100% (1)
ITS323Y11S1E02-Final-Exam-Answers
20 pages
PDF
No ratings yet
PDF
6 pages
Midtermsolutions
No ratings yet
Midtermsolutions
3 pages
Processors
No ratings yet
Processors
25 pages
Week 6: Assignment Solutions
No ratings yet
Week 6: Assignment Solutions
4 pages
Midterm Answer
No ratings yet
Midterm Answer
2 pages
8051 Timers
No ratings yet
8051 Timers
55 pages
Assignment - 1
0% (1)
Assignment - 1
4 pages
Computer Architecture Questions
No ratings yet
Computer Architecture Questions
1 page
p1 p2 l1 p1 l1 l2 p2 l1 l3 p1 p2: Course: Hardware Software Co0Design Assignment 1 (10 Marks) Last Date: 3 MRCH 2023
No ratings yet
p1 p2 l1 p1 l1 l2 p2 l1 l3 p1 p2: Course: Hardware Software Co0Design Assignment 1 (10 Marks) Last Date: 3 MRCH 2023
1 page
CH 01 - DCC10e Data Communications, Data Network and Internet
No ratings yet
CH 01 - DCC10e Data Communications, Data Network and Internet
33 pages
Operating System Concepts Chapter 3 Exercise Solution Part 1
No ratings yet
Operating System Concepts Chapter 3 Exercise Solution Part 1
3 pages
Cs433 Fa12 Hw4 Sol Correct
No ratings yet
Cs433 Fa12 Hw4 Sol Correct
14 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
39 pages
Chap 5 Process Synchronization
No ratings yet
Chap 5 Process Synchronization
37 pages
N (0:1:40) A 1.2 F 0.1 X A Cos (2 Pi F N) Stem (N, X,'r','filled') Xlabel ('TIME') Ylabel ('AMPLITUDE')
No ratings yet
N (0:1:40) A 1.2 F 0.1 X A Cos (2 Pi F N) Stem (N, X,'r','filled') Xlabel ('TIME') Ylabel ('AMPLITUDE')
7 pages
w17 Cs251 Assignment 4
No ratings yet
w17 Cs251 Assignment 4
8 pages
Semaphore Exercises
0% (1)
Semaphore Exercises
18 pages
Instruction Op-Code Operand Bytes Machine - Cycles T - States Detail
No ratings yet
Instruction Op-Code Operand Bytes Machine - Cycles T - States Detail
3 pages
L10-L11-Instruction Pipelining
No ratings yet
L10-L11-Instruction Pipelining
38 pages
Asynchronous and Synchronous Transmission
100% (1)
Asynchronous and Synchronous Transmission
18 pages
Chapter 9
No ratings yet
Chapter 9
21 pages
Chapter 5 - CPU Scheduling
100% (1)
Chapter 5 - CPU Scheduling
41 pages
Computer Architecture Assignment 3 (ARCH)
No ratings yet
Computer Architecture Assignment 3 (ARCH)
9 pages
ECE 341 2013 in Class Midterm1
No ratings yet
ECE 341 2013 in Class Midterm1
9 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Slides Chapter 5 Basic Processing Unit
No ratings yet
Slides Chapter 5 Basic Processing Unit
44 pages
Notes - Unit 5
No ratings yet
Notes - Unit 5
12 pages
Experiment No - 14: Objective - To Implement 11011 Nonoverlapping Mealy Sequence Detector
No ratings yet
Experiment No - 14: Objective - To Implement 11011 Nonoverlapping Mealy Sequence Detector
7 pages
Faqs Verilog
No ratings yet
Faqs Verilog
7 pages
Test 6 PracticeQuestion Cachememory 1
No ratings yet
Test 6 PracticeQuestion Cachememory 1
21 pages
Computer Architecture and Parallel Processing
No ratings yet
Computer Architecture and Parallel Processing
29 pages
Problem_4_Solution
No ratings yet
Problem_4_Solution
3 pages
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
No ratings yet
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
18 pages
ECE 341 Final Exam Solution: Problem No. 1 (10 Points)
No ratings yet
ECE 341 Final Exam Solution: Problem No. 1 (10 Points)
9 pages
Department of Computer Science & Engineering CSL718 Architecture of High Performance Systems Major Test Solution
No ratings yet
Department of Computer Science & Engineering CSL718 Architecture of High Performance Systems Major Test Solution
8 pages
Par - 2 In-Term Exam - Course 2019/20-Q1: Memory Line
No ratings yet
Par - 2 In-Term Exam - Course 2019/20-Q1: Memory Line
9 pages
Pni DK110
No ratings yet
Pni DK110
150 pages
NIS 2 Directive Practicle Implementation guide
No ratings yet
NIS 2 Directive Practicle Implementation guide
41 pages
Lab Manual - Analog & Digital Electronics (ESC 301)
No ratings yet
Lab Manual - Analog & Digital Electronics (ESC 301)
66 pages
Mri Devices Corporation: Technical Report
No ratings yet
Mri Devices Corporation: Technical Report
9 pages
SMO Open 2022
No ratings yet
SMO Open 2022
6 pages
ho_moderation
No ratings yet
ho_moderation
5 pages
12536 Syntactic and Semantic C
No ratings yet
12536 Syntactic and Semantic C
34 pages
if else exercises
No ratings yet
if else exercises
2 pages
MYP 4&5 Geometric Transformations
No ratings yet
MYP 4&5 Geometric Transformations
39 pages
Class 3 Mathematics Note
No ratings yet
Class 3 Mathematics Note
4 pages
Productivity Improvement Through Line Balancing by Using Simulation Modeling (Case Study Almeda Garment Factory)
No ratings yet
Productivity Improvement Through Line Balancing by Using Simulation Modeling (Case Study Almeda Garment Factory)
14 pages
Introduction To Cloud Computing: Pertemuan Iii
No ratings yet
Introduction To Cloud Computing: Pertemuan Iii
94 pages
Micro Project
No ratings yet
Micro Project
7 pages
R23 Stack1 JBoss External Runbook
No ratings yet
R23 Stack1 JBoss External Runbook
59 pages
ABB-Welcome Outdoor Station & Modules - Product Manual - EN - 2TMD042000D0027 - ABB - 20201023
No ratings yet
ABB-Welcome Outdoor Station & Modules - Product Manual - EN - 2TMD042000D0027 - ABB - 20201023
57 pages
An Effective Falcon Optimization Algorithm Based MPPT Under Partial Shaded Photovoltaic Systems
No ratings yet
An Effective Falcon Optimization Algorithm Based MPPT Under Partial Shaded Photovoltaic Systems
16 pages
Reject / Retract Cassette
No ratings yet
Reject / Retract Cassette
2 pages
Python Basics For Beginers
No ratings yet
Python Basics For Beginers
29 pages
Industry 4.0 Revolution PowerPoint Templates
No ratings yet
Industry 4.0 Revolution PowerPoint Templates
81 pages
Indira Gandhi Delhi Technical University For Women
No ratings yet
Indira Gandhi Delhi Technical University For Women
29 pages
73
No ratings yet
73
3 pages
Daewoo Sp-900x Chassis Dpx-42a2lmbd Ambd Plasma TV SM
No ratings yet
Daewoo Sp-900x Chassis Dpx-42a2lmbd Ambd Plasma TV SM
56 pages
Advanced Data Mining and Applications 10th International Conference ADMA 2014 Guilin China December 19 21 2014 Proceedings 1st Edition Xudong Luo pdf download
100% (1)
Advanced Data Mining and Applications 10th International Conference ADMA 2014 Guilin China December 19 21 2014 Proceedings 1st Edition Xudong Luo pdf download
65 pages
Construction_and_Evaluation_of_Defense-in-Depth_ar
No ratings yet
Construction_and_Evaluation_of_Defense-in-Depth_ar
5 pages
Cisco: 300-515 Exam
No ratings yet
Cisco: 300-515 Exam
8 pages
Caepipe: Tutorial For Modeling and Results Review Problem 1
No ratings yet
Caepipe: Tutorial For Modeling and Results Review Problem 1
45 pages
The School Objects
No ratings yet
The School Objects
10 pages
Client and Server Role
No ratings yet
Client and Server Role
2 pages

HW5S24 Sol

Uploaded by

HW5S24 Sol

Uploaded by

University of Southern California

Department of Electrical Engineering

Problem 2 [Lock] (15 pts.)

Problem 3 [Memory Consistency] (20 pts.)

Problem 4 [SIMD] (20 points)

for (i = 0; i < 16*1024*1024; i++) {

a) How many warps does it take to execute the program?

Problem 5 [Synchronization and Consistency] (15 pts.)

Initial value in location X is 0.

Processor A Processor B Processor C

A3: R1 := ADD R1, R1 B3: ST X, R2 C3: R3 := ADD R3, R3

A4: ST X, R1 B4: R2:= LD X C4: ST X, R3

B5: R2 := ADD R2, R2

(a) Yes. C1, B1-B6, A1-A4, C2-C4

(b) No. All results must be even

(c) Yes. All of C, All of A, All of B

You might also like

for (i = 0; i < 1610241024; i++) {