HPC_SUMMARY
HPC_SUMMARY
May 2025
Contents
1 Introduction 2
2 OpenMP 2
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Compiling and Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Setting the Number of Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Private and Shared Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Parallelizing For Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.7 Reduction Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9 Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 MPI 12
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Compiling and Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Number of Processes and Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Scatter and Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.9 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.10 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
1 Introduction
This document provides a detailed course on OpenMP and MPI, two key technologies
for parallel computing. OpenMP is designed for shared-memory systems using threads,
while MPI is suited for distributed-memory systems using processes. The course in-
cludes concepts, syntax, examples, and comparisons to aid in preparing for your High
Performance Computing exam.
Note
This document uses blue for OpenMP-related content and green for MPI-related
content to enhance clarity.
2 OpenMP
OpenMP (Open Multi-Processing) is an API for shared-memory parallel programming in
C, C++, and Fortran. It uses threads to execute code concurrently on a single machine
with shared memory.
2.1 Overview
• Purpose: Parallelize code on multi-core processors using threads.
• Key Features:
– Shared memory model: All threads access the same memory space.
– Directives to parallelize loops, sections, or tasks.
– Synchronization mechanisms to manage thread interactions.
2
2.2 Compiling and Running
To compile, use the -fopenmp flag with gcc:
Note: The order of output lines may vary due to thread scheduling.
• Programmatically: omp_set_num_threads(NUM_THREADS);
Comparison:
3
1 #include <stdio.h>
2 #include <omp.h>
3 int main() {
4 omp_set_num_threads(4); // Programmatic
5 #pragma omp parallel
6 {
7 printf("Thread %d (programmatic)\n", omp_get_thread_num());
8 }
9 #pragma omp parallel num_threads(4) // Directive
10 {
11 printf("Thread %d (directive)\n", omp_get_thread_num());
12 }
13 #pragma omp parallel // Environment (export OMP_NUM_THREADS=4)
14 {
15 printf("Thread %d (environment)\n", omp_get_thread_num());
16 }
17 return 0;
18 }
Thread 0 (programmatic)
Thread 1 (programmatic)
Thread 2 (programmatic)
Thread 3 (programmatic)
Thread 0 (directive)
Thread 1 (directive)
Thread 2 (directive)
Thread 3 (directive)
Thread 0 (environment)
Thread 1 (environment)
Thread 2 (environment)
Thread 3 (environment)
4
2.4 Private and Shared Variables
Variables are shared by default. Use private to give each thread its own copy.
Expected Output:
Note: The shared variable pub accumulates thread IDs, and the output order may vary.
5
2.5 Parallelizing For Loops
#pragma omp for distributes loop iterations across threads. Example: 8 iterations
across 4 threads.
1 #include <stdio.h>
2 #include <omp.h>
3 #define NUM_THREADS 4
4 #define NUM_ITERS 8
5 int main() {
6 #pragma omp parallel num_threads(NUM_THREADS)
7 {
8 #pragma omp for
9 for (int i = 0; i < NUM_ITERS; i++) {
10 int id = omp_get_thread_num();
11 printf("Thread %d: Loop iteration %d\n", id, i);
12 }
13 }
14 return 0;
15 }
Note: The assignment of iterations to threads depends on the scheduling type (default is
static). Output order may vary.
2.6 Scheduling
Scheduling controls how iterations are assigned:
• Static:Divides the iterations of the loop into equal-sized chunks, with each chunk
assigned to a thread statically at compile time. The optional chunk parameter spec-
ifies the size of each chunk.
(Equal chunks at compile time.)
• Dynamic: Divides the iterations of the loop into chunks of size chunk, and assigns
these chunks dynamically to threads as threads become available to do work.
(Assigns chunks dynamically as threads become available.)
6
• Guided:Similar to dynamic scheduling, but the size of the chunks decreases dy-
namically over time. Initially, larger chunks are assigned, which helps reduce
scheduling overhead, but as the loop progresses, the chunk size decreases to bal-
ance the workload among threads.
(Assigns larger chunks initially, decreasing over time.)
• Runtime: Allows the scheduling policy to be determined at runtime using the OMP
SCHEDULE environment variable or the omp set schedule function. This provides
flexibility in choosing the scheduling policy without recompiling the code.
Set via OMP_SCHEDULE.
7
36 }
37 }
38 return 0;
39 }
Dynamic Scheduling:
Thread 0: Iteration 0
Thread 0: Iteration 1
Thread 1: Iteration 2
Thread 1: Iteration 3
Thread 2: Iteration 4
Thread 2: Iteration 5
Thread 3: Iteration 6
Thread 3: Iteration 7
Thread 0: Iteration 8
Thread 0: Iteration 9
Thread 1: Iteration 10
Thread 1: Iteration 11
Thread 2: Iteration 12
Thread 2: Iteration 13
Thread 3: Iteration 14
Thread 3: Iteration 15
Guided Scheduling:
Thread 0: Iteration 0
Thread 0: Iteration 1
Thread 0: Iteration 2
Thread 0: Iteration 3
Thread 1: Iteration 4
Thread 1: Iteration 5
Thread 1: Iteration 6
Thread 2: Iteration 7
Thread 2: Iteration 8
8
Thread 2: Iteration 9
Thread 3: Iteration 10
Thread 3: Iteration 11
Thread 0: Iteration 12
Thread 1: Iteration 13
Thread 2: Iteration 14
Thread 3: Iteration 15
Runtime Scheduling:
Thread 0: Iteration 0
Thread 0: Iteration 1
Thread 0: Iteration 2
Thread 0: Iteration 3
Thread 1: Iteration 4
Thread 1: Iteration 5
Thread 1: Iteration 6
Thread 1: Iteration 7
Thread 2: Iteration 8
Thread 2: Iteration 9
Thread 2: Iteration 10
Thread 2: Iteration 11
Thread 3: Iteration 12
Thread 3: Iteration 13
Thread 3: Iteration 14
Thread 3: Iteration 15
Note: Static scheduling assigns fixed chunks (e.g., 4 iterations per thread). Dynamic and
guided scheduling assign chunks as threads become available, with guided reducing
chunk sizes over time. Runtime scheduling depends on OMP_SCHEDULE, so the output
may vary.
9
20 }
Expected Output:
Total = 36
2.8 Synchronization
• Critical:Ensures that a specific block of code is executed by only one thread at a
time. all threads will eventually execute the code in this section but not in the same
time. ( One thread at a time)
• Single:Specifies that a block of code should be executed by only one thread, but it
doesn’t specify which thread. ( One thread, others wait.)
• Master:master: like single but the block is ensured to be executed by the master
(main program,thread with ID 0). (Master thread (ID 0) only.)
• Barrier:Synchronizes all threads in a parallel region. Each thread waits at the bar-
rier until all threads have reached it, then they all proceed. (All threads wait.)
10
Thread 0 incremented counter to 1
Thread 1 incremented counter to 2
Thread 2 incremented counter to 3
Thread 3 incremented counter to 4
Thread 2 in single section
Master thread 0
Thread 0 passed barrier
Thread 1 passed barrier
Thread 2 passed barrier
Thread 3 passed barrier
Final counter: 8
Note: The order of critical section outputs may vary. The single section is executed by
one thread (ID varies). The final counter is 8 (4 from critical + 4 from atomic).
2.9 Sections
#pragma omp sections assigns independent tasks to threads. Example: Parallelizing
two functions.
1 #include <stdio.h>
2 #include <omp.h>
3 int f1(int b, int c) { return b + c; }
4 int f2(int d, int e) { return d * e; }
5 int main() {
6 int a = 0, b = 1, c = 2, d = 3, e = 4;
7 int a1, a2;
8 #pragma omp parallel sections
9 {
10 #pragma omp section
11 {
12 a1 = f1(b, c);
13 printf("Section 1: a1 = %d\n", a1);
14 }
15 #pragma omp section
16 {
17 a2 = f2(d, e);
18 printf("Section 2: a2 = %d\n", a2);
19 }
20 }
21 a = a1 + a2;
22 printf("a = %d\n", a);
23 return 0;
24 }
Expected Output:
Section 1: a1 = 3
Section 2: a2 = 12
a = 15
Note: The order of section outputs may vary due to thread assignment.
11
3 MPI
MPI (Message Passing Interface) is a standard for distributed-memory parallel comput-
ing using processes.
3.1 Overview
• Purpose: Parallelize code across multiple nodes/processes.
• Key Features:
12
3.3 General Structure
• MPI_Init(&argc, &argv);
• MPI_Finalize();
• MPI_Comm_rank(MPI_COMM_WORLD, &rank);
13
3.6 Broadcast
MPI_Bcast sends data from a root process to all others. Example: Broadcasting a num-
ber.
1 #include <stdio.h>
2 #include <mpi.h>
3 int main(int argc, char *argv[]) {
4 MPI_Init(&argc, &argv);
5 int rank;
6 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
7 int number = 0;
8 if (rank == 0) {
9 number = 100;
10 }
11 MPI_Bcast(&number, 1, MPI_INT, 0, MPI_COMM_WORLD);
12 printf("Process %d has number %d\n", rank, number);
13 MPI_Finalize();
14 return 0;
15 }
14
20 printf("\n");
21 }
22 MPI_Finalize();
23 return 0;
24 }
Process 0 received 10
Process 1 received 20
Process 2 received 30
Process 3 received 40
Gathered data: 20 40 60 80
Note: The order of scatter outputs may vary, but the gathered data is deterministic.
3.8 Reduce
MPI_Reduce combines data using an operation. Example: Summing values.
1 #include <stdio.h>
2 #include <mpi.h>
3 int main(int argc, char *argv[]) {
4 MPI_Init(&argc, &argv);
5 int rank;
6 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
7 int value = rank + 1;
8 int result;
9 MPI_Reduce(&value, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
10 if (rank == 0) {
11 printf("Sum of values: %d\n", result);
12 }
13 MPI_Finalize();
14 return 0;
15 }
Sum of values: 10
15
3.9 Synchronization
MPI_Barrier synchronizes all processes. Example: Ordered "Hello World".
1 #include <stdio.h>
2 #include <mpi.h>
3 int main(int argc, char *argv[]) {
4 MPI_Init(&argc, &argv);
5 int size, rank;
6 char dummy;
7 MPI_Comm_size(MPI_COMM_WORLD, &size);
8 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
9 if (rank != 0) {
10 int src = (rank + size - 1) % size;
11 MPI_Recv(&dummy, 1, MPI_CHAR, src, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
12 }
13 printf("Process %d is saying Hello World.\n", rank);
14 if (rank != size - 1) {
15 MPI_Send(&dummy, 1, MPI_CHAR, (rank + 1) % size, 0,
MPI_COMM_WORLD);
16 }
17 MPI_Finalize();
18 return 0;
19 }
Slave Node:
1. Update system.
16
Table 1: Comparison of OpenMP and MPI
• Memorize: Key directives (#pragma omp parallel) and MPI functions (MPI_Send).
17