Par - 1 In-Term Exam - Course 2018/19-Q2
Par - 1 In-Term Exam - Course 2018/19-Q2
Problem 1 (2.0 points) Given the following parallel code in OpenMP in which vector S is transformed into
vector D in such a way that the position in S of all those elements with the same value S[i]%256 are stored
in consecutive positions of D. At the end of the main program vector D will contain the position for all those
elements with value %256 equal to 0, followed by the position for all those elements with value %256 equal
to 1, ... up to those with value %256 equal to 255.
#define N 1024*1024*1024
unsigned int S[N], D[N], C[256];
C[0] = 0;
for (int i=1; i<256; i++) C[i] = C[i-1] + TMP[i];
}
void transform_vector(unsigned int *S, unsigned int *D, unsigned int *C) {
unsigned int i, value;
void main() {
find_groups(S, C);
transform_vector(S, D, C);
}
To do the transformation, function find_groups builds a vector TMP so that element TMP[value] in-
dicates the number of elements in S for which S[i]%256 = value; based on this vector TMP function
find_groups builds and returns vector C so that element C[value] indicates the initial position in D
where to store the information for all those elements whose S[i]%256 = value.
We ask you:
1. (0.5 points) Rewrite the code changing the synchronisation construct that is used in function
find_groups in order to reduce the overhead that is incurred in the parallel update of vector TMP.
Solution: To protect the update of vector TMP in function find_groups we just need an atomic
operation, which allows more concurrency and reduces the overhead in the data sharing.
...
#pragma omp taskloop private(value) num_tasks(1024)
for (i=0; i<N; i++) {
value = S[i]%GROUPS;
#pragma omp atomic
TMP[value]++;
}
...
2. (1.0 point) Rewrite the code changing the synchronisation construct that is used in function
transform_vector in order to maximise the possible parallelism in the update of the elements
of D and C.
Solution: To protect the update of the other two vectors in function transform_vector we need to
use a vector of OpenMP locks in order to allow the update of each region in vector D and its associated
counter in C. The use of locks requires its creation and destruction at some point.
void transform_vector(unsigned int *S, unsigned int *D, unsigned int *C) {
unsigned int i, value;
omp_lock_t lock_vector[256];
3. (0.5 points) Assuming the following new version for function transform_vector:
void transform_vector(unsigned int *S, unsigned int *D, unsigned int *C) {
unsigned int i, value;
#pragma omp parallel
#pragma omp single
#pragma omp taskloop private(i) grainsize(4)
for (value=0; value<256; value++)
for (i=0; i<N; i++)
if (S[i]%256 == value) {
D[C[value]] = i;
C[value]++;
}
}
Insert the necessary synchronisation constructs that guarantee the correct update of the elements of
vectors D and C.
Solution: Since each task is assigned with different values for value then there is no need to syn-
chronize the access to variables D and C.
Problem 2 (2.0 points) Given the following task dependence graphs for three different parallelization strate-
gies of a sequential code:
Strategy A Strategy B Strategy C
0 1 2 3 4 5 6 7 8 0 3 6 0
1 4 7 1 2 3 4 5
2 5 8 6 7 8
1. (1.0 point) Compute the Parallelism and Pmin metrics for each one of the three dependence graphs
assuming that the cost of executing each task is tc time units.
Solution: For all strategies T1 = 9 × tc . Then for each strategy we have:
(a) Strategy A: T∞ = 9 × tc so Parallelism= T1 ÷ T∞ = 1; the minimum number of processors to
achieve this parallelism is Pmin = 1.
(b) Strategy B: T∞ = 3 × tc , Parallelism= 3 and Pmin = 3.
(c) Strategy C: T∞ = 3 × tc , Parallelism= 3 and Pmin = 4.
2. (1.0 point) Assuming a multiprocessor with P = 3 processors and the following mapping of tasks to
processors for each strategy:
• Strategy A and B: P 0 ← {0, 3, 6}; P 1 ← {1, 4, 7}; P 2 ← {2, 5, 8}.
• Strategy C: P 0 ← {0, 1, 2}, P 1 ← {3, 4, 6}; P 2 ← {5, 7, 8}.
Obtain the general expression for the speed–up SP=3 for each strategy and associated mapping, assum-
ing that there is an overhead related with task synchronisation of tsynch time units, i.e. the overhead
that a task has to pay to signal to ALL its successor tasks that it has finished.
Solution: For each parallel strategy we have:
Therefore, for Strategy C: T3 = 4×tc +3×tsynchr ; and therefore S3 = (9×tc )÷(4×tc +3×tsynchr ).
#define N 1024*1024*1024
int S[N][N], D[N][N];
void main() {
...
A2B_process(S,D);
...
A2A_process(D);
...
}
We ask you:
1. (1.0 point) Assume that PROCESS is an operation whose cost depends on the input data it receives,
that is, its execution can significantly vary depending on the value of the input arguments. Write an
OpenMP parallelisation for A2B_process following an Iterative Task Decomposition strategy that
tries to maximise the load balance we can achieve. Explain the reason of your parallel code directives.
Solution:
There are not dependences. Two possible solutions are proposed:
(a) First solution: It can be used omp parallel for with implicit tasks. In addition,
collapse(2) and schedule(dynamic), by default (dynamic,1), in order to dynamically
schedule only one PROCESS computation every time. With this type of fine-grain scheduling
we try to maximize the load balance of the work to be done among threads.
(b) Second solution: we can use explicit tasks so that each task is an invocation of PROCESS com-
putation. This way any idle threads will execute one task every time, trying to maximize the load
balance among threads.
#define N 1024*1024*1024
unsigned int S[N][N], D[N][N];
// Operation using a, b, c, and d. It doesn’t operate with any other shared memory
void PROCESS(unsigned int a, unsigned int b, unsigned int c, unsigned int d);
// First Solution
2. (1.0 point) Assume that PROCESS is a time consuming operation (coarse grain) whose cost is always the
same. Write an OpenMP parallelisation for A2A_process using Implicit Tasks following an Iterative
Task Decomposition strategy.
Solution:
There are dependences. We use doacross and specify the dependences between iterations. As the load
balance is not a problem in this exercise, we statically schedule (the default) the loop i iterations.
#define N 1024*1024*1024
unsigned int S[N][N], D[N][N];
}
3. (1.0 point) Assume that PROCESS is a time consuming operation (coarse grain) whose cost is always the
same. Write an OpenMP parallelisation for A2A_process using Explicit Tasks following an Iterative
Task Decomposition strategy.
Solution:
There are dependences. We use task explicits with data in and out dependences. Although we only
need left (A[i][j-1]) and up (A[i-1][j]) true data dependences to be defined, we define all of them. Load
balance is not a problem here since tasks are dynamically scheduled. Task creation overhead is not a
problem since it is said that PROCESS (one task computation) is a very time consuming operation.
#define N 1024*1024*1024
unsigned int S[N][N], D[N][N];
}
}
#define N 1000000
#define MINSIZE 4
#define MAXGRAINSIZE MAXROW
typedef struct {
int size; // size is always smaller than or equal to MAXROW
float *data;
} tRow;
void main () {
tRow rows[N];
// initialization of rows
// each row can have different size
...
process_rows_rec (rows, N);
}
We ask you:
1. (1.0 point) Write a parallel version in OpenMP implementing a Recursive Task Decomposition following
a Tree strategy.
Solution:
#define N 1000000
#define MINSIZE 4
#define MAXGRAINSIZE MAXROW
typedef struct {
int size; // size is always smaller than or equal to MAXROW
float *data;
} tRow;
void main () {
tRow rows[N];
// initialization of rows
// each row can have different size
...
#pragma omp parallel
#pragma omp single
process_rows_rec (rows, N);
}
2. (2.0 points) Modify the previous parallel code to control task generation, not allowing task creation
when granularity (i.e. total number of elements to be processed) is smaller than MAXGRAINSIZE. You
should not use the OpenMP mergeable clause.
Solution:
In order to have final tasks with granularity less than or equal to MAXGRAINSIZE, the cutoff control
has to be done on the number of elements of the partition to be processed.
#define N 1000000
#define MINSIZE 4
#define MAXGRAINSIZE MAXROW
typedef struct {
int size; // size value is always less than or equal to MAXROW
float *data;
} tRow;