0% found this document useful (0 votes)

24 views

ST7 SHP 2.1 Multithreading On Multicores 1spp 2

Multithreading on multicores can provide performance benefits but also introduces challenges. OpenMP is a common approach for incrementally adding multithreading to existing sequential code. It supports parallel regions, directives for loop and section parallelism, and runtime information on threads. However, memory access remains a bottleneck as the number of cores increases due to limited memory channels. Cache blocking and data layout optimizations are needed to reduce memory contention. Overall, the number of memory channels does not scale as quickly as the number of cores, limiting theoretical speedup on multicore processors.

Uploaded by

josh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

ST7 SHP 2.1 Multithreading On Multicores 1spp 2

Uploaded by

josh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

ST7: High Performance Simulation

Multithreading on multicores

Stéphane Vialle

[email protected]
http://www.metz.supelec.fr/~vialle
Multithreading on multicores

1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Threads vs Processes

Multithreaded processes
A sequential process: Memory space of the
• in the RAM of one node process (and of its threads)
• running on one core code
Stack and code of the process
A multithreaded process:
• in the RAM of one node
• running on … one or several cores!
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process

The process threads will distribute themselves over the resources (RAM
and cores) accessible to the process: the whole node, or part of the node.
Threads vs Processes

Multithreaded processes
A multithreaded process:
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process

• A process can create threads

• There are several thread libraries (Posix threads, OpenMP threads, MPI
threads when calling asynchronous communications…)

• Modern OS can manage many threads, even on one core !

• But multiple cores are needed to run the threads
simultaneously and produce an acceleration (one or two threads
per core is efficient)
Threads vs Processes

Examples of deployment
One multithreaded process per node:

One multithreaded process per processor (per « socket »):

Multithreading on multicores

1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
OpenMP principles

Objectives
Sequential code development:
• design Initialisation();
• implementation for (int i=0; i<N; i++)
Calcul(i);
• debug
Autre_calcul();

Add compilation directives for parallelism

• incremental parallelization with little extra code
• multithreading
• limited to the native parallelism of the initial code
#include <omp.h>
Initialisation();
Ex. : added an #pragma omp parallel for
OpenMP directive for (int i=0; i<N; i++)
Calcul(i);
Autre_calcul();
OpenMP principles
Parallel regions
main() {
…… 1th: seq. code
#pragma omp parallel parallel region
{
…… 3th: replicated
…… code
}
…… 1th: seq. code
}

Example with a 3-core machine (!):

• Creation and destruction of 3 threads at the beginning and end of the
parallel region
• Automatic synchronization at the end of a parallel region
 execution continues when all threads in the region are terminated
• Very simple and concise syntax
OpenMP principles
Parallelism with directives
main() {
…… Code seq.
#pragma omp parallel parallel region
{
…… Code répliqué
#pragma omp for
for (int i=0; i<N; i++){ Calculs répartis
………………
} de même nature
#pragma omp single Calcul séq.
{ …… }
Calculs
#pragma omp critical
{ …… } « mutexés »

…… Code répliqué
à durée variable
#pragma omp barrier -- Synchro --
……
Code répliqué
}
……
} Code seq.
OpenMP principles
Parallelism with directives
main() {
……
Seq. code
#pragma omp parallel parallel region
{
…… Replicated code
#pragma omp sections Distributed
{
#pragma omp section calculations of
{ …… } various kinds
#pragma omp section
{ …… }
}
…… Replicated code
}
…… Seq. code
}
OpenMP principles
Parallelism with directives
Parallelization of a sequential function call :
main() { Seq. code
……
Parallele code f_lib(0, N, SharedTable);
main() { ……
…… }
#pragma omp parallel
{
// Lower boundary of the thread
int inf = N/omp_get_num_threads()* omp_get_thread_num();
// Upper boundary of the thread
int sup = N/omp_get_num_threads()*(omp_get_thread_num()+1);
// Call to the sequential library function
} f_lib(inf, sup, SharedTable);
} …… omp_get_num_threads(): nb of threads in the current region
omp_get_thread_num() : rank of the thread
Replicated code BUT with specific parameters for each thread
Rmq: the function code must be reentrant (avoid global variables).
OpenMP principles
Parallelism with directives
main() { Hyp : 3 threads OpenMP
…… created on a machine with
#pragma omp parallel 3 CPU cores
{
switch (omp_get_thread_num()) { … and one of
case 0 : the CPU
……… // calcul sur cores is
break; // le GPU dedicated to
default : driving a
…… // calcul sur
break; // les cœurs CPU
GPU
}
}
……
}

Make thread 0 (which still exists) perform a special task

Ex: drive a GPU (or make disk IO)
In parallel with calculations launched on the other CPU cores
OpenMP principles

Limitations of OpenMP
OpenMP encounters classic multithreading limitations

• pb de synchronisation ShM
stop stop
• pb de contention stop stop
stop stop stop stop
• pb de false sharing cache cache

(« cache war »)

The use of OpenMP facilitates the implementation of multithreading.

But this does not dispense with the problems of parallel algorithms in
shared memory.
Multithreading on multicores

1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Memory access bottleneck
Hardware:
• k RAM access channels per processor
• L1 cache memory per core
• L2 cache memory per subset of
cores, or per processor
• NUMA computing nodes
(Non Uniform Memory Access)

(Complicated) HPC development:

• Serial optimizations to achieve cache compliant
data accesses
• Cache blocking to relieve the memory bus of
multicore nodes

But today: n cores × m vector units per processor! On the rise!

 Memory access remains a bottleneck! 15
Choose your processor
When the number of cores increases: the frequency decreases
The number of memory channels does not increases like the nb of cores

Do you prefer:
• 4 cores at 4.0 - 4.5 GHz, with 4 channels Easier to program
• 8/12/16 cores at 2.2 GHz, with 4 channels Higher theoretical
… ?? peak performance
16
Speedup limitation on multicores
Experiments:
• Performance does not increase linearly on multicores!
(optimized OpenBLAS matrix product)
• Our 2x8-cores node were more expensive than our 2x4-cores node
• But is just a little bit faster!
Matrix Product (BLAS) on a 2x8-cores Matrix Product (BLAS) on a 2x4-cores
300 node at 2.1 GHz 300 node at 3.5 GHz
265 Gflops Gflops max Gflops max
Gflops (double precision)

250 Gflops min 250 234 Gflops Gflops min

200 200

150 150

100 100

50 50

0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
number of threads number of threads
Memory access bottleneck
is the problem! 17
Multithreading on multicores

Questions ?

iPreFlight Genesis Quick Start Guide v.1.2
No ratings yet
iPreFlight Genesis Quick Start Guide v.1.2
16 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
Program Excecution ExpFinal
No ratings yet
Program Excecution ExpFinal
10 pages
07 OpenMP
No ratings yet
07 OpenMP
28 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
UNIT 3
No ratings yet
UNIT 3
13 pages
PDC Presentation Update
No ratings yet
PDC Presentation Update
29 pages
OpenMP Presentation
No ratings yet
OpenMP Presentation
51 pages
Lecture 10 Shared Memory Programming with OpenMP.pptx
No ratings yet
Lecture 10 Shared Memory Programming with OpenMP.pptx
30 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
09 OpenMP Intro
No ratings yet
09 OpenMP Intro
15 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
DS1822-Parallel Computing - Unit2
No ratings yet
DS1822-Parallel Computing - Unit2
25 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
2.3-DD2356-OpenMP Definitions
No ratings yet
2.3-DD2356-OpenMP Definitions
12 pages
NGK Openmp
No ratings yet
NGK Openmp
13 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Open MP 2016
No ratings yet
Open MP 2016
45 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
Shared Memory: Openmp Environment and Synchronization
No ratings yet
Shared Memory: Openmp Environment and Synchronization
32 pages
Module 2 - New 1
No ratings yet
Module 2 - New 1
72 pages
W8L2 OpenMP6 Furthertopics
No ratings yet
W8L2 OpenMP6 Furthertopics
20 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Lecture 06 - OpenMP
No ratings yet
Lecture 06 - OpenMP
37 pages
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
No ratings yet
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
73 pages
Num Tech
No ratings yet
Num Tech
39 pages
04
No ratings yet
04
39 pages
Unit 3 - Programming Multi-Core and Shared Memory
No ratings yet
Unit 3 - Programming Multi-Core and Shared Memory
100 pages
Updated_CS8083 MCP UNIT III notes
No ratings yet
Updated_CS8083 MCP UNIT III notes
26 pages
OpenMP Examples
No ratings yet
OpenMP Examples
12 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Experiment No. 6 Aim: To Learn Basics of Openmp Api (Open Multi-Processor Api) Theory What Is Openmp?
No ratings yet
Experiment No. 6 Aim: To Learn Basics of Openmp Api (Open Multi-Processor Api) Theory What Is Openmp?
6 pages
Ecole Militaire Polytechnique: Content
No ratings yet
Ecole Militaire Polytechnique: Content
16 pages
Work Replication With Parallel Region: #Pragma Omp Parallel For (For (J 0 J 10 J++) Printf ("Hello/n") )
No ratings yet
Work Replication With Parallel Region: #Pragma Omp Parallel For (For (J 0 J 10 J++) Printf ("Hello/n") )
19 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
Omp Handouts
No ratings yet
Omp Handouts
109 pages
Lecture 12 Synchronization Constructs in OpenMP
No ratings yet
Lecture 12 Synchronization Constructs in OpenMP
32 pages
Openmp: Openmp Adds Constructs For Shared-Memory
No ratings yet
Openmp: Openmp Adds Constructs For Shared-Memory
15 pages
PC File
No ratings yet
PC File
57 pages
ipc_assig 1
No ratings yet
ipc_assig 1
9 pages
Tutorial 4
No ratings yet
Tutorial 4
32 pages
Openmp 1
No ratings yet
Openmp 1
38 pages
Chap4 OpenMP
No ratings yet
Chap4 OpenMP
35 pages
CS8083 UNIT III Notes
No ratings yet
CS8083 UNIT III Notes
26 pages
OpenMP Tutorial
100% (1)
OpenMP Tutorial
82 pages
Openmp: Elektrotehnički Fakultet I. Sarajevo Paralelni Računarski Sistemi
No ratings yet
Openmp: Elektrotehnički Fakultet I. Sarajevo Paralelni Računarski Sistemi
14 pages
Open MP1551363136163
No ratings yet
Open MP1551363136163
29 pages
Open MP
No ratings yet
Open MP
30 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
st-6100
No ratings yet
st-6100
2 pages
Data Science Statistics Mathematics Cheat Sheet
100% (1)
Data Science Statistics Mathematics Cheat Sheet
13 pages
Computerised Assessment of Handwriting
No ratings yet
Computerised Assessment of Handwriting
15 pages
W5 Discussion
No ratings yet
W5 Discussion
1 page
YK-2038 Bug: Vikash Chandra Sharma
No ratings yet
YK-2038 Bug: Vikash Chandra Sharma
33 pages
Java F
No ratings yet
Java F
42 pages
Mystery Clock
100% (1)
Mystery Clock
88 pages
Usb Cassette Capture Ezcap Guia Rapida Audacity
No ratings yet
Usb Cassette Capture Ezcap Guia Rapida Audacity
9 pages
Get Design and Analysis of Ecological Experiments 2nd Edition Samuel M. Scheiner free all chapters
No ratings yet
Get Design and Analysis of Ecological Experiments 2nd Edition Samuel M. Scheiner free all chapters
41 pages
Cyber Security Maturity Assessment Framework For Technology Startups A Systematic Literature Review
No ratings yet
Cyber Security Maturity Assessment Framework For Technology Startups A Systematic Literature Review
11 pages
제조업 엔지니어 연구-디지털 전환 (DX) 과 전망 - web
No ratings yet
제조업 엔지니어 연구-디지털 전환 (DX) 과 전망 - web
168 pages
Harmanjot Singh 0763547 2.3.2.4 Lab Date of Submission: - 2020-10-1 Professor: - Mohammad Ali
No ratings yet
Harmanjot Singh 0763547 2.3.2.4 Lab Date of Submission: - 2020-10-1 Professor: - Mohammad Ali
7 pages
(Ebook) Cognitive Science: An Introduction to the Science of the Mind by Bermudez J.L. ISBN 9781107051621, 1107051622 - Quickly download the ebook to never miss important content
100% (1)
(Ebook) Cognitive Science: An Introduction to the Science of the Mind by Bermudez J.L. ISBN 9781107051621, 1107051622 - Quickly download the ebook to never miss important content
49 pages
Varispeed L: Inverters For Elevator Drives
No ratings yet
Varispeed L: Inverters For Elevator Drives
9 pages
Automatic Approval For Imported Standard Purchase Orders: An Oracle White Paper June 2002
No ratings yet
Automatic Approval For Imported Standard Purchase Orders: An Oracle White Paper June 2002
11 pages
WickedWhims v181c Exception
No ratings yet
WickedWhims v181c Exception
10 pages
Wireless Sensor Synopsis
No ratings yet
Wireless Sensor Synopsis
6 pages
Final_Lab ESA Roster_Autumn_2024 - for student _241205_091931
No ratings yet
Final_Lab ESA Roster_Autumn_2024 - for student _241205_091931
4 pages
HP Z230 Tower Workstation Specifications - HP® Customer Support
No ratings yet
HP Z230 Tower Workstation Specifications - HP® Customer Support
1 page
SAP2000 - API Ejemplo
No ratings yet
SAP2000 - API Ejemplo
1 page
Project
No ratings yet
Project
29 pages
CNH Industrial Supplier Portal Guidelines For Suppliers
No ratings yet
CNH Industrial Supplier Portal Guidelines For Suppliers
8 pages
Smart City Wikipedia
No ratings yet
Smart City Wikipedia
20 pages
MAD
No ratings yet
MAD
8 pages
Community Sophos Com Sophos XG Firewall F Recommended Reads 125318 Sophos XG Fir
No ratings yet
Community Sophos Com Sophos XG Firewall F Recommended Reads 125318 Sophos XG Fir
61 pages
Poly Questions
No ratings yet
Poly Questions
5 pages
Asus x555ld Rev2.0
No ratings yet
Asus x555ld Rev2.0
43 pages
Hardware: Architectures
No ratings yet
Hardware: Architectures
33 pages

ST7 SHP 2.1 Multithreading On Multicores 1spp 2

Uploaded by

ST7 SHP 2.1 Multithreading On Multicores 1spp 2

Uploaded by

ST7: High Performance Simulation

• A process can create threads

• Modern OS can manage many threads, even on one core !

One multithreaded process per processor (per « socket »):

Add compilation directives for parallelism

Example with a 3-core machine (!):

Make thread 0 (which still exists) perform a special task

The use of OpenMP facilitates the implementation of multithreading.

(Complicated) HPC development:

But today: n cores × m vector units per processor! On the rise!

250 Gflops min 250 234 Gflops Gflops min

You might also like