ST7 SHP 2.1 Multithreading On Multicores 1spp 2
ST7 SHP 2.1 Multithreading On Multicores 1spp 2
Multithreading on multicores
Stéphane Vialle
[email protected]
http://www.metz.supelec.fr/~vialle
Multithreading on multicores
1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Threads vs Processes
Multithreaded processes
A sequential process: Memory space of the
• in the RAM of one node process (and of its threads)
• running on one core code
Stack and code of the process
A multithreaded process:
• in the RAM of one node
• running on … one or several cores!
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process
The process threads will distribute themselves over the resources (RAM
and cores) accessible to the process: the whole node, or part of the node.
Threads vs Processes
Multithreaded processes
A multithreaded process:
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process
Examples of deployment
One multithreaded process per node:
1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
OpenMP principles
Objectives
Sequential code development:
• design Initialisation();
• implementation for (int i=0; i<N; i++)
Calcul(i);
• debug
Autre_calcul();
…… Code répliqué
à durée variable
#pragma omp barrier -- Synchro --
……
Code répliqué
}
……
} Code seq.
OpenMP principles
Parallelism with directives
main() {
……
Seq. code
#pragma omp parallel parallel region
{
…… Replicated code
#pragma omp sections Distributed
{
#pragma omp section calculations of
{ …… } various kinds
#pragma omp section
{ …… }
}
…… Replicated code
}
…… Seq. code
}
OpenMP principles
Parallelism with directives
Parallelization of a sequential function call :
main() { Seq. code
……
Parallele code f_lib(0, N, SharedTable);
main() { ……
…… }
#pragma omp parallel
{
// Lower boundary of the thread
int inf = N/omp_get_num_threads()* omp_get_thread_num();
// Upper boundary of the thread
int sup = N/omp_get_num_threads()*(omp_get_thread_num()+1);
// Call to the sequential library function
} f_lib(inf, sup, SharedTable);
} …… omp_get_num_threads(): nb of threads in the current region
omp_get_thread_num() : rank of the thread
Replicated code BUT with specific parameters for each thread
Rmq: the function code must be reentrant (avoid global variables).
OpenMP principles
Parallelism with directives
main() { Hyp : 3 threads OpenMP
…… created on a machine with
#pragma omp parallel 3 CPU cores
{
switch (omp_get_thread_num()) { … and one of
case 0 : the CPU
……… // calcul sur cores is
break; // le GPU dedicated to
default : driving a
…… // calcul sur
break; // les cœurs CPU
GPU
}
}
……
}
Limitations of OpenMP
OpenMP encounters classic multithreading limitations
• pb de synchronisation ShM
stop stop
• pb de contention stop stop
stop stop stop stop
• pb de false sharing cache cache
(« cache war »)
1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Memory access bottleneck
Hardware:
• k RAM access channels per processor
• L1 cache memory per core
• L2 cache memory per subset of
cores, or per processor
• NUMA computing nodes
(Non Uniform Memory Access)
Do you prefer:
• 4 cores at 4.0 - 4.5 GHz, with 4 channels Easier to program
• 8/12/16 cores at 2.2 GHz, with 4 channels Higher theoretical
… ?? peak performance
16
Speedup limitation on multicores
Experiments:
• Performance does not increase linearly on multicores!
(optimized OpenBLAS matrix product)
• Our 2x8-cores node were more expensive than our 2x4-cores node
• But is just a little bit faster!
Matrix Product (BLAS) on a 2x8-cores Matrix Product (BLAS) on a 2x4-cores
300 node at 2.1 GHz 300 node at 3.5 GHz
265 Gflops Gflops max Gflops max
Gflops (double precision)
200 200
150 150
100 100
50 50
0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
number of threads number of threads
Memory access bottleneck
is the problem! 17
Multithreading on multicores
Questions ?
18