0% found this document useful (0 votes)
36 views9 pages

Par - 1 In-Term Exam - Course 2018/19-Q2

The document contains a parallel OpenMP code that transforms one vector into another based on the modulo 256 of each element. It asks questions about improving the parallel efficiency by changing synchronization constructs. The questions analyze strategies for maximizing parallelism when updating shared data in tasks, including using atomic operations, locks, and task decomposition.

Uploaded by

Juan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views9 pages

Par - 1 In-Term Exam - Course 2018/19-Q2

The document contains a parallel OpenMP code that transforms one vector into another based on the modulo 256 of each element. It asks questions about improving the parallel efficiency by changing synchronization constructs. The questions analyze strategies for maximizing parallelism when updating shared data in tasks, including using atomic operations, locks, and task decomposition.

Uploaded by

Juan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

PAR – 1st In-Term Exam – Course 2018/19-Q2

April 3rd , 2019

Problem 1 (2.0 points) Given the following parallel code in OpenMP in which vector S is transformed into
vector D in such a way that the position in S of all those elements with the same value S[i]%256 are stored
in consecutive positions of D. At the end of the main program vector D will contain the position for all those
elements with value %256 equal to 0, followed by the position for all those elements with value %256 equal
to 1, ... up to those with value %256 equal to 255.

#define N 1024*1024*1024
unsigned int S[N], D[N], C[256];

void find_groups(unsigned int *S, unsigned int *C) {


unsigned int i, value, TMP[256];

#pragma omp parallel


#pragma omp single
{
#pragma omp taskloop grainsize(4)
for (i=0; i<256; i++) TMP[i]=0;

#pragma omp taskloop private(value) num_tasks(1024)


for (i=0; i<N; i++) {
value = S[i]%256;
#pragma omp critical
TMP[value]++;
}
}

C[0] = 0;
for (int i=1; i<256; i++) C[i] = C[i-1] + TMP[i];
}

void transform_vector(unsigned int *S, unsigned int *D, unsigned int *C) {
unsigned int i, value;

#pragma omp parallel


#pragma omp single
#pragma omp taskloop private(value) num_tasks(1024)
for (i=0; i<N; i++) {
value = S[i]%256;
#pragma omp critical
{
D[C[value]] = i;
C[value]++;
}
}
}

void main() {
find_groups(S, C);
transform_vector(S, D, C);
}
To do the transformation, function find_groups builds a vector TMP so that element TMP[value] in-
dicates the number of elements in S for which S[i]%256 = value; based on this vector TMP function
find_groups builds and returns vector C so that element C[value] indicates the initial position in D
where to store the information for all those elements whose S[i]%256 = value.
We ask you:

1. (0.5 points) Rewrite the code changing the synchronisation construct that is used in function
find_groups in order to reduce the overhead that is incurred in the parallel update of vector TMP.
Solution: To protect the update of vector TMP in function find_groups we just need an atomic
operation, which allows more concurrency and reduces the overhead in the data sharing.

...
#pragma omp taskloop private(value) num_tasks(1024)
for (i=0; i<N; i++) {
value = S[i]%GROUPS;
#pragma omp atomic
TMP[value]++;
}
...

2. (1.0 point) Rewrite the code changing the synchronisation construct that is used in function
transform_vector in order to maximise the possible parallelism in the update of the elements
of D and C.
Solution: To protect the update of the other two vectors in function transform_vector we need to
use a vector of OpenMP locks in order to allow the update of each region in vector D and its associated
counter in C. The use of locks requires its creation and destruction at some point.

void transform_vector(unsigned int *S, unsigned int *D, unsigned int *C) {
unsigned int i, value;
omp_lock_t lock_vector[256];

#pragma omp parallel


#pragma omp single
{
#pragma omp taskloop grainsize(4)
for(i=0; i<256; i++) omp_init_lock(&lock_vector[i]);

#pragma omp taskloop private(value) num_tasks(1024)


for (i=0; i<N; i++) {
value = S[i]%256;
omp_set_lock(&lock_vector[value]);
D[C[value]] = i;
C[value]++;
omp_set_lock(&lock_vector[value]);
}

#pragma omp taskloop grainsize(4)


for(i=0; i<256; i++) omp_destroy_lock(&lock_vector[i]);
}
}

3. (0.5 points) Assuming the following new version for function transform_vector:

void transform_vector(unsigned int *S, unsigned int *D, unsigned int *C) {
unsigned int i, value;
#pragma omp parallel
#pragma omp single
#pragma omp taskloop private(i) grainsize(4)
for (value=0; value<256; value++)
for (i=0; i<N; i++)
if (S[i]%256 == value) {
D[C[value]] = i;
C[value]++;
}
}

Insert the necessary synchronisation constructs that guarantee the correct update of the elements of
vectors D and C.
Solution: Since each task is assigned with different values for value then there is no need to syn-
chronize the access to variables D and C.

Problem 2 (2.0 points) Given the following task dependence graphs for three different parallelization strate-
gies of a sequential code:
Strategy A Strategy B Strategy C
0 1 2 3 4 5 6 7 8 0 3 6 0

1 4 7 1 2 3 4 5

2 5 8 6 7 8

Answer the following questions:

1. (1.0 point) Compute the Parallelism and Pmin metrics for each one of the three dependence graphs
assuming that the cost of executing each task is tc time units.
Solution: For all strategies T1 = 9 × tc . Then for each strategy we have:
(a) Strategy A: T∞ = 9 × tc so Parallelism= T1 ÷ T∞ = 1; the minimum number of processors to
achieve this parallelism is Pmin = 1.
(b) Strategy B: T∞ = 3 × tc , Parallelism= 3 and Pmin = 3.
(c) Strategy C: T∞ = 3 × tc , Parallelism= 3 and Pmin = 4.
2. (1.0 point) Assuming a multiprocessor with P = 3 processors and the following mapping of tasks to
processors for each strategy:
• Strategy A and B: P 0 ← {0, 3, 6}; P 1 ← {1, 4, 7}; P 2 ← {2, 5, 8}.
• Strategy C: P 0 ← {0, 1, 2}, P 1 ← {3, 4, 6}; P 2 ← {5, 7, 8}.
Obtain the general expression for the speed–up SP=3 for each strategy and associated mapping, assum-
ing that there is an overhead related with task synchronisation of tsynch time units, i.e. the overhead
that a task has to pay to signal to ALL its successor tasks that it has finished.
Solution: For each parallel strategy we have:

(a) Strategy A: T3 = 9 × tc + 8 × tsynchr ; therefore S3 = (9 × tc ) ÷ (9 × tc + 8 × tsynchr ).


(b) Strategy B: T3 = 3 × tc + 2 × tsynchr ; therefore S3 = (9 × tc ) ÷ (3 × tc + 2 × tsynchr ).
(c) This figure shows the timeline with the parallel execution of the tasks and the synchronisation
overheads between tasks that are necessary.
Strategy C
P0 0 1 2
0
P1 3 4 6
1 2 3 4 5
P2 5 7 8
6 7 8
Time

Therefore, for Strategy C: T3 = 4×tc +3×tsynchr ; and therefore S3 = (9×tc )÷(4×tc +3×tsynchr ).

Problem 3 (3.0 points) Given the following sequential code:

#define N 1024*1024*1024
int S[N][N], D[N][N];

// Operation using a, b, c, and d. It doesn’t access/modify any other shared memory


int PROCESS(int a, int b, int c, int d);

void A2B_process(int A[N][N], int B[N][N]) {


for(unsigned int i=1; i<N-1; i++)
for(unsigned int j=1; j<N-1; j++)
A[i][j] += PROCESS(B[i-1][j],B[i][j-1],B[i-1][j-1],B[i+1][j+1]);
}

void A2A_process(int A[N][N]) {


for(unsigned int i=1; i<N-1; i++)
for(unsigned int j=1; j<N-1; j++)
A[i][j] += PROCESS(A[i-1][j],A[i][j-1],A[i-1][j-1],A[i+1][j+1]);
}

void main() {
...
A2B_process(S,D);
...
A2A_process(D);
...
}

We ask you:

1. (1.0 point) Assume that PROCESS is an operation whose cost depends on the input data it receives,
that is, its execution can significantly vary depending on the value of the input arguments. Write an
OpenMP parallelisation for A2B_process following an Iterative Task Decomposition strategy that
tries to maximise the load balance we can achieve. Explain the reason of your parallel code directives.
Solution:
There are not dependences. Two possible solutions are proposed:

(a) First solution: It can be used omp parallel for with implicit tasks. In addition,
collapse(2) and schedule(dynamic), by default (dynamic,1), in order to dynamically
schedule only one PROCESS computation every time. With this type of fine-grain scheduling
we try to maximize the load balance of the work to be done among threads.
(b) Second solution: we can use explicit tasks so that each task is an invocation of PROCESS com-
putation. This way any idle threads will execute one task every time, trying to maximize the load
balance among threads.
#define N 1024*1024*1024
unsigned int S[N][N], D[N][N];

// Operation using a, b, c, and d. It doesn’t operate with any other shared memory
void PROCESS(unsigned int a, unsigned int b, unsigned int c, unsigned int d);

// First Solution

void A2B_process(unsigned int A[N][N], unsigned int B[N][N])


{
unsigned int i,j;

#pragma omp parallel for collapse(2) schedule(dynamic)


for(i=1; i<N-1; i++)
for(j=1; j<N-1; j++)
A[i][j]+= PROCESS(B[i-1][j],B[i][j-1],B[i-1][j-1],B[i+1][j+1]);

// Second Alternative Solution:

void A2B_process(unsigned int A[N][N], unsigned int B[N][N])


{
unsigned int i,j;

#pragma omp parallel


#pragma omp single
for(i=1; i<N-1; i++)
for(j=1; j<N-1; j++)
#pragma omp task firstprivate(i,j)
A[i][j]+= PROCESS(B[i-1][j],B[i][j-1],B[i-1][j-1],B[i+1][j+1]);

2. (1.0 point) Assume that PROCESS is a time consuming operation (coarse grain) whose cost is always the
same. Write an OpenMP parallelisation for A2A_process using Implicit Tasks following an Iterative
Task Decomposition strategy.
Solution:
There are dependences. We use doacross and specify the dependences between iterations. As the load
balance is not a problem in this exercise, we statically schedule (the default) the loop i iterations.

#define N 1024*1024*1024
unsigned int S[N][N], D[N][N];

void A2A_process(unsigned int A[N][N])


{
unsigned int i,j;
#pragma omp parallel for ordered(2) private(j)
for(i=1; i<N-1; i++)
for(j=1; j<N-1; j++)
{
#pragma omp ordered depend(sink:i-1,j) depend(sink: i,j-1)
A[i][j]+= PROCESS(A[i-1][j],A[i][j-1],A[i-1][j-1],A[i+1][j+1]);
#pragma omp ordered depend(source)
}

}
3. (1.0 point) Assume that PROCESS is a time consuming operation (coarse grain) whose cost is always the
same. Write an OpenMP parallelisation for A2A_process using Explicit Tasks following an Iterative
Task Decomposition strategy.
Solution:
There are dependences. We use task explicits with data in and out dependences. Although we only
need left (A[i][j-1]) and up (A[i-1][j]) true data dependences to be defined, we define all of them. Load
balance is not a problem here since tasks are dynamically scheduled. Task creation overhead is not a
problem since it is said that PROCESS (one task computation) is a very time consuming operation.

#define N 1024*1024*1024
unsigned int S[N][N], D[N][N];

void A2A_process(unsigned int A[N][N])


{

#pragma omp parallel


#pragma omp single
{
unsigned int i,j;
for(i=1; i<N-1; i++)
for(j=1; j<N-1; j++)
#pragma omp task depend(in:A[i-i][j],A[i][j-1],A[i-1][j-1],A[i+1][j+1]) \
depend(out:A[i][j])
A[i][j]+= PROCESS(A[i-1][j],A[i][j-1],A[i-1][j-1],A[i+1][j+1]);

}
}

Problem 4 (3.0 points) Given the following C code:

#define N 1000000
#define MINSIZE 4
#define MAXGRAINSIZE MAXROW

typedef struct {
int size; // size is always smaller than or equal to MAXROW
float *data;
} tRow;

// Function partition is already implemented:


// 1) it finds the index such that total number of
// elements is well-balanced between both partitions
// (index belongs to right partition), and
// 2) it returns the total number of elements in each partition
void partition (tRow *rows, int nrows, int *index, int *nelem_left, int *nelem_right);

void process_rows (tRow *rows, int nrows) {


for (int r=0; r<nrows; r++)
for (int i=0; i<rows[r].size; i++)
foo (&rows[r].data[i]); // only modifies the parameter
}

void process_rows_rec (tRow *rows, int nrows) {


int index, nelem_left, nelem_right;

if (nrows < MINSIZE)


process_rows (rows, nrows);
else {
partition (rows, nrows, &index, &nelem_left, &nelem_right);
process_rows_rec (rows, index);
process_rows_rec (&rows[index], nrows-index);
}
return;
}

void main () {
tRow rows[N];
// initialization of rows
// each row can have different size
...
process_rows_rec (rows, N);
}

We ask you:

1. (1.0 point) Write a parallel version in OpenMP implementing a Recursive Task Decomposition following
a Tree strategy.
Solution:

#define N 1000000
#define MINSIZE 4
#define MAXGRAINSIZE MAXROW

typedef struct {
int size; // size is always smaller than or equal to MAXROW
float *data;
} tRow;

// finds the index such that total number of elements


// is well-balanced between both partitions and
// return also the number of elements on each partition
// (index belongs to right partition)
void partition (tRow *rows, int nrows, int *index, int *nelem_left, int *nelem_right);

void process_rows (tRow *rows, int nrows) {


for (int r=0; (r<nrows); r++)
for (int i=0; (i<rows[r].size); i++)
foo (&rows[r].data[i]); // only modifies the parameter
}

void process_rows_rec (tRow *rows, int nrows) {


int index, nelem_left, nelem_right;

if (nrows < MINSIZE)


process_rows (rows, nrows);
else {
partition (rows, nrows, &index, &nelem_left, &nelem_right);
#pragma omp task
process_rows_rec (rows, index);
#pragma omp task
process_rows_rec (&rows[index], nrows-index);
}
}

void main () {
tRow rows[N];
// initialization of rows
// each row can have different size
...
#pragma omp parallel
#pragma omp single
process_rows_rec (rows, N);
}

2. (2.0 points) Modify the previous parallel code to control task generation, not allowing task creation
when granularity (i.e. total number of elements to be processed) is smaller than MAXGRAINSIZE. You
should not use the OpenMP mergeable clause.
Solution:
In order to have final tasks with granularity less than or equal to MAXGRAINSIZE, the cutoff control
has to be done on the number of elements of the partition to be processed.

#define N 1000000
#define MINSIZE 4
#define MAXGRAINSIZE MAXROW

typedef struct {
int size; // size value is always less than or equal to MAXROW
float *data;
} tRow;

// finds the index such that total number of elements


// is well-balanced between both partitions and
// return also the number of elements on each sub-partition
// (index belongs to right partition)
void partition (tRow *rows, int nrows, int *index, int *nelem_left, int *nelem_right);

void process_rows (tRow *rows, int nrows) {


for (int r=0; (r<nrows); r++)
for (int i=0; (i<rows[r].size); i++)
foo (&rows[r].data[i]); // only modifies the parameter
}

void process_rows_rec (tRow *rows, int nrows) {


int index, nelem_left, nelem_right;

if (nrows < MINSIZE)


process_rows (rows, nrows);
else {
partition (rows, nrows, &index, &nelem_left, &nelem_right);
if (!omp_in_final()) {
#pragma omp task final (nelem_left <= MAXGRAINSIZE)
process_rows_rec (rows, index);
#pragma omp task final (nelem_right) <= MAXGRAINSIZE)
process_rows_rec (&rows[index], nrows-index);
}
else {
process_rows_rec (rows, index);
process_rows_rec (&rows[index], nrows-index);
}
}
return;
}
void main () {
tRow rows[N];
// initialization of rows
// each row can have different size
...
#pragma omp parallel
#pragma omp single
process_rows_rec (rows, N);
}

You might also like