SlideShare a Scribd company logo
OpenPOWER Application Optimization
2
SCOPE OF THE PRESENTATION
• Outline Tuning strategies to improve performance of programs on POWER9 processors
• Performance bottlenecks can arise in the processor front end and back end
• Lets discuss some of the bottlenecks and how we can work around them using compiler flags,
source code pragmas/attributes
• This talk refers to compiler options supported by open source compilers such as GCC. Latest
version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries
over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
POWER9 PROCESSOR
3
• Optimized for Stronger Thread Performance and Efficiency
• Increased Execution Bandwidth efficiency for a range of workloads including commercial,
cognitive and analytics
• Sophisticated instruction scheduling and branch prediction for unoptimized applications and
interpretive languages
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
4
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
• Shorter Pipelines with reduced disruption
• Improved Application Performance for Modern
Codes
• Higher Performance and Pipeline Utilization
• Removed instruction grouping
• Enhanced instruction fusion
• Pipeline can complete upto 128 (64-SMT4)
instructions /cycle
• Reduced Latency and Improved Scalability
• Improved pipe control of load/store
instructions
• Improved hazard avoidance
FORMAT OF TODAYS DISCUSSION
5
Brief presentation on optimization strategies
Followed by handson exercises
Initial steps -
>ssh –l student<n> orthus.nic.uoregon.edu
>ssh gorgon
Once you have a home directory make a directory with your name within the home/student<n>
>mkdir /home/student<n>/<yourname>
copy the following files into them
> cp -rf /home/users/gansys/archana/Handson .
You will see the following directories within Handson/
Task1/
Task2/
Task3/
Task4/
During the course of the presentation we will discuss the exercises inline and you can try them on the machine
6
PERFORMANCE TUNING IN THE FRONT-END
• Front end fetches and decodes the successive instructions and passes them to the backend for
processing
• POWER9 is a superscalar processor and is pipeline based so works with an advanced branch
predictor to predict the sequence and fetch instructions in advance
• We have call branches, loop branches
• Typically we use the following strategies to work around bottlenecks seen around branches –
• Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not
automatically)
• Converting control to data dependence using ?: and compiling with –misel for difficult to
predict branches
• Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling
• Indirect call promotion to promote more inlining
7
PERFORMANCE TUNING IN THE BACK-END
• Backend is concerned with executing of the instructions that were fetched and
dispatched to the appropriate units
• Compiler takes care of making sure dependent instructions are far from each other
in its scheduling pass automatically
• Tuning backend performance involves optimal usage of Processor
Resources. We can tune the performance using following.
• Registers- using instructions that reduce reg usage, Vectorization /
reducing pressure on GPRs/ ensuring more throughput, Making loops
free of pointers and branches as much as possible to enable more
vectorization
• Caches – data layout optimizations that reduce footprint, using –fshort-
enums, Prefetching – hardware and software
• System Tuning- parallelization, binding, largepages, optimized libraries
8
STRUCTURE OF HANDSON EXERCISE
• All the handson exercises work on the Jacobi application
• The application has two versions – poisson2d_reference (referred to as
poisson2d_serial in Task4) and poisson2d
• Inorder to showcase an optimization impact, poisson2d is optimized and
poisson2d_reference is minimally optimized to a baseline level and the performance
of the two routines are compared
• The application internally measures the time and prints the speedup
• Higher the speedup higher is the impact of the optimization in focus
• For the handson we work with gcc (9.2.0) and pgi compilers (19.10)
• Solutions are indicated in the Solutions/ folder within each of the Task directories
9
TASK1: BASIC COMPILER FLAGS
• Here the poisson2d_reference.c is optimized at O3 level
• The user needs to optimize poisson2d.c with Ofast level
• Build and run the application poisson2d
• What is the speedup you observe and why ?
• You can generate a perf profile using perf record –e cycles ./poisson2d
• Running perf report will show you the top routines and you can compare
performance of poisson2d_reference and poisson2d to get an idea
10
TASK2: SW PREFETCHING
• Now that we saw that Ofast improved performance beyond O3 lets optimize
poisson2d_reference at Ofast and see if we can further improve it
• The user needs to optimize the poisson2d with sw prefetching flag
• Build and run the application
• What is the speedup you observe?
• Verify whether sw prefetching instructions have been added
• Grep for dcbt in the objdump file
11
TASK3: OPENMP PARALLELIZATION
• The jacobi application is highly parallel
• We can using openMP pragmas parallelize it and measure the speedup
• The source file has openMP pragmas in comments
• Uncomment them and build with openMP options –fopenmp and link with –lgomp
• Run with multiple threads and note the speedup
• OMP_NUM_THREADS=4 ./poisson2d
• OMP_NUM_THREADS=16 ./poisson2d
• OMP_NUM_THREADS=32 ./poisson2d
• OMP_NUM_THREADS=64 ./poisson2d
12
TASK3.1: OPENMP PARALLELIZATION
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
13
TASK3.2: IMPACT OF BINDING
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
14
TASK4: ACCELERATE USING GPUS
• You can attempt this after the lecture on GPUs
• Jacobi application contains a large set of parallelizable loops
• Poisson2d.c contains commented openACC pragmas which should be
uncommented, built with appropriate flags and run on an accelerated platform
• #pragma acc parallel loop
• In case you want to refer to Solution - poisson2d.solution.c
• You can compare the speedup by running poisson2d without the pragmas and
running the poisson2d.solution
• For more information you can refer to the Makefile
15
TASK1: BASIC COMPILER FLAGS- SOLUTION
– This hands-on exercise illustrates the impact of the Ofast flag
– Ofast enables –ffast-math option that implements the same math function in a way
that does not require guarantees of IEEE / ISO rules or specification and avoids the
overhead of calling a function from the math library
– If you look at the perf profile, you will observe poisson2d_reference makes a call to
fmax
– Whereas poisson2d.c::main() of poisson2d generates native instructions such as
xvmax as it is optimized at Ofast
16
TASK2: SW PREFETCHING- SOLUTION
– Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst
instructions into the code if it is beneficial
– __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store
– POWER9 has prefetching enabled both at HW and SW levels
– At HW level, prefetching is “ON” by default
– At the SW level, you can request the compiler to insert prefetch
instructions ; However the compiler can choose to ignore the
request if it determines that it is not beneficial to do so.
– You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level
but not when
It is compiled at the O3 level
– That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the
conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax
– GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays
17
TASK3.1: OPENMP PARALLELIZATION
• Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS
• [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d
• 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92
• [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65
• [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18
• Likewise if you bind threads across different cores you will see greater speedup
• [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20
• [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
18
TASK4: ACCELERATE USING GPUS
• Building and running poisson2d as it is, you will see no speedups
• [student02@gorgon Task4]$ make poisson2d
• /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o
poisson2d_serial.o
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o -
o poisson2d
• [student02@gorgon Task4]$ ./poisson2d
• ….
• 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02
• If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will
accelerate by pushing the parallel portions to the GPU you will see a massive speedup
• [student02@gorgon Task4]$ make poisson2d.solution
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c
poisson2d_serial.o -o poisson2d.solution
• [student02@gorgon Task4]$ ./poisson2d.solution
• 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
19
•SUMMARY
• Today we talked about
• Tuning strategies pertaining to the various units in the POWER9 HW –
• Front-end, Back-end
• Some of these strategies were compiler flags, source code pragmas that
one can apply to see improved performance of their programs
• We also saw additional ways of improving performance such as parallelization,
binding etc
• Hopefully the associated handson exercises gave you a more practical experience
in applying these concepts in optimizing an application
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed
on sub optimal conditions
20
ACKUP
21
•
•
•
•
•
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
22
•
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
23
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
24
•
•
• 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES
•
•
•
•
•
•
•
25
Flag Kind XL GCC/LLVM
Can be simulated
in source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler Increases register pressure
Inlining -qinline=auto:level=N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power8,
-mtune=power9
Turns on platform specific
tuning
64bit
compilation-q64 -m64
Prefetching
-
qprefetch[=aggressiv
e] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are not
used
Link time
optimizatio
n -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step

More Related Content

What's hot (20)

Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
SeongJae Park
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
noor ul ain
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
Gichelle Amon
 
Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber
l xf
 
Meetup 2009
Meetup 2009Meetup 2009
Meetup 2009
HuaiEnTseng
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
Yi-Hsiu Hsu
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
Hammad Farooq
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
Esctp snir
Esctp snirEsctp snir
Esctp snir
Marc Snir
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
SeongJae Park
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Intel® Software
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
Rafi Dar
 
Load Store Execution
Load Store ExecutionLoad Store Execution
Load Store Execution
Ramdas Mozhikunnath
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
SeongJae Park
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
Continuent
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
HarshitParkar6677
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Vliw
VliwVliw
Vliw
AJAL A J
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
Amit Kumar Rathi
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
Pragnya Dash
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
SeongJae Park
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
noor ul ain
 
Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber
l xf
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
SeongJae Park
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Intel® Software
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
Rafi Dar
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
SeongJae Park
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
Continuent
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
HarshitParkar6677
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
Amit Kumar Rathi
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
Pragnya Dash
 

Similar to OpenPOWER Application Optimization (20)

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
Lecture6
Lecture6Lecture6
Lecture6
tt_aljobory
 
Open Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFVOpen Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFV
Open Networking Perú (Opennetsoft)
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Pier Luca Lanzi
 
Lecture5
Lecture5Lecture5
Lecture5
tt_aljobory
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
Surinder Kaur
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
Anne Nicolas
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
LemonReddy1
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
Federico Michele Facca
 
The role of the cpu in the operation
The role of the cpu in the operationThe role of the cpu in the operation
The role of the cpu in the operation
mary_ramsay
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Pier Luca Lanzi
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
Anne Nicolas
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
LemonReddy1
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
Federico Michele Facca
 
The role of the cpu in the operation
The role of the cpu in the operationThe role of the cpu in the operation
The role of the cpu in the operation
mary_ramsay
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 

More from Ganesan Narayanasamy (20)

Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging TechnologiesEmpowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging TechnologiesEmpowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 

Recently uploaded (20)

Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 

OpenPOWER Application Optimization

  • 2. 2 SCOPE OF THE PRESENTATION • Outline Tuning strategies to improve performance of programs on POWER9 processors • Performance bottlenecks can arise in the processor front end and back end • Lets discuss some of the bottlenecks and how we can work around them using compiler flags, source code pragmas/attributes • This talk refers to compiler options supported by open source compilers such as GCC. Latest version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
  • 3. POWER9 PROCESSOR 3 • Optimized for Stronger Thread Performance and Efficiency • Increased Execution Bandwidth efficiency for a range of workloads including commercial, cognitive and analytics • Sophisticated instruction scheduling and branch prediction for unoptimized applications and interpretive languages IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 4. 4 IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation • Shorter Pipelines with reduced disruption • Improved Application Performance for Modern Codes • Higher Performance and Pipeline Utilization • Removed instruction grouping • Enhanced instruction fusion • Pipeline can complete upto 128 (64-SMT4) instructions /cycle • Reduced Latency and Improved Scalability • Improved pipe control of load/store instructions • Improved hazard avoidance
  • 5. FORMAT OF TODAYS DISCUSSION 5 Brief presentation on optimization strategies Followed by handson exercises Initial steps - >ssh –l student<n> orthus.nic.uoregon.edu >ssh gorgon Once you have a home directory make a directory with your name within the home/student<n> >mkdir /home/student<n>/<yourname> copy the following files into them > cp -rf /home/users/gansys/archana/Handson . You will see the following directories within Handson/ Task1/ Task2/ Task3/ Task4/ During the course of the presentation we will discuss the exercises inline and you can try them on the machine
  • 6. 6 PERFORMANCE TUNING IN THE FRONT-END • Front end fetches and decodes the successive instructions and passes them to the backend for processing • POWER9 is a superscalar processor and is pipeline based so works with an advanced branch predictor to predict the sequence and fetch instructions in advance • We have call branches, loop branches • Typically we use the following strategies to work around bottlenecks seen around branches – • Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not automatically) • Converting control to data dependence using ?: and compiling with –misel for difficult to predict branches • Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling • Indirect call promotion to promote more inlining
  • 7. 7 PERFORMANCE TUNING IN THE BACK-END • Backend is concerned with executing of the instructions that were fetched and dispatched to the appropriate units • Compiler takes care of making sure dependent instructions are far from each other in its scheduling pass automatically • Tuning backend performance involves optimal usage of Processor Resources. We can tune the performance using following. • Registers- using instructions that reduce reg usage, Vectorization / reducing pressure on GPRs/ ensuring more throughput, Making loops free of pointers and branches as much as possible to enable more vectorization • Caches – data layout optimizations that reduce footprint, using –fshort- enums, Prefetching – hardware and software • System Tuning- parallelization, binding, largepages, optimized libraries
  • 8. 8 STRUCTURE OF HANDSON EXERCISE • All the handson exercises work on the Jacobi application • The application has two versions – poisson2d_reference (referred to as poisson2d_serial in Task4) and poisson2d • Inorder to showcase an optimization impact, poisson2d is optimized and poisson2d_reference is minimally optimized to a baseline level and the performance of the two routines are compared • The application internally measures the time and prints the speedup • Higher the speedup higher is the impact of the optimization in focus • For the handson we work with gcc (9.2.0) and pgi compilers (19.10) • Solutions are indicated in the Solutions/ folder within each of the Task directories
  • 9. 9 TASK1: BASIC COMPILER FLAGS • Here the poisson2d_reference.c is optimized at O3 level • The user needs to optimize poisson2d.c with Ofast level • Build and run the application poisson2d • What is the speedup you observe and why ? • You can generate a perf profile using perf record –e cycles ./poisson2d • Running perf report will show you the top routines and you can compare performance of poisson2d_reference and poisson2d to get an idea
  • 10. 10 TASK2: SW PREFETCHING • Now that we saw that Ofast improved performance beyond O3 lets optimize poisson2d_reference at Ofast and see if we can further improve it • The user needs to optimize the poisson2d with sw prefetching flag • Build and run the application • What is the speedup you observe? • Verify whether sw prefetching instructions have been added • Grep for dcbt in the objdump file
  • 11. 11 TASK3: OPENMP PARALLELIZATION • The jacobi application is highly parallel • We can using openMP pragmas parallelize it and measure the speedup • The source file has openMP pragmas in comments • Uncomment them and build with openMP options –fopenmp and link with –lgomp • Run with multiple threads and note the speedup • OMP_NUM_THREADS=4 ./poisson2d • OMP_NUM_THREADS=16 ./poisson2d • OMP_NUM_THREADS=32 ./poisson2d • OMP_NUM_THREADS=64 ./poisson2d
  • 12. 12 TASK3.1: OPENMP PARALLELIZATION • Running lscpu you will see Thread(s) per core: 4 • You will see the setting as SMT=4 on the system; You can verify by running ppc64_cpu –smt on the command line • Run cat /proc/cpuinfo to determine the total number of threads, cores in the system • Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3 • Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Set n1, .. n4 to threads in different cores and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Compare Speedups; Which one is higher?
  • 13. 13 TASK3.2: IMPACT OF BINDING • Running lscpu you will see Thread(s) per core: 4 • You will see the setting as SMT=4 on the system; You can verify by running ppc64_cpu –smt on the command line • Run cat /proc/cpuinfo to determine the total number of threads, cores in the system • Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3 • Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Set n1, .. n4 to threads in different cores and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Compare Speedups; Which one is higher?
  • 14. 14 TASK4: ACCELERATE USING GPUS • You can attempt this after the lecture on GPUs • Jacobi application contains a large set of parallelizable loops • Poisson2d.c contains commented openACC pragmas which should be uncommented, built with appropriate flags and run on an accelerated platform • #pragma acc parallel loop • In case you want to refer to Solution - poisson2d.solution.c • You can compare the speedup by running poisson2d without the pragmas and running the poisson2d.solution • For more information you can refer to the Makefile
  • 15. 15 TASK1: BASIC COMPILER FLAGS- SOLUTION – This hands-on exercise illustrates the impact of the Ofast flag – Ofast enables –ffast-math option that implements the same math function in a way that does not require guarantees of IEEE / ISO rules or specification and avoids the overhead of calling a function from the math library – If you look at the perf profile, you will observe poisson2d_reference makes a call to fmax – Whereas poisson2d.c::main() of poisson2d generates native instructions such as xvmax as it is optimized at Ofast
  • 16. 16 TASK2: SW PREFETCHING- SOLUTION – Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst instructions into the code if it is beneficial – __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store – POWER9 has prefetching enabled both at HW and SW levels – At HW level, prefetching is “ON” by default – At the SW level, you can request the compiler to insert prefetch instructions ; However the compiler can choose to ignore the request if it determines that it is not beneficial to do so. – You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level but not when It is compiled at the O3 level – That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax – GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays
  • 17. 17 TASK3.1: OPENMP PARALLELIZATION • Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS • [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d • 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92 • [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65 • [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d • 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18 • Likewise if you bind threads across different cores you will see greater speedup • [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20 • [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
  • 18. 18 TASK4: ACCELERATE USING GPUS • Building and running poisson2d as it is, you will see no speedups • [student02@gorgon Task4]$ make poisson2d • /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o poisson2d_serial.o • /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o - o poisson2d • [student02@gorgon Task4]$ ./poisson2d • …. • 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02 • If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will accelerate by pushing the parallel portions to the GPU you will see a massive speedup • [student02@gorgon Task4]$ make poisson2d.solution • /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution • [student02@gorgon Task4]$ ./poisson2d.solution • 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
  • 19. 19 •SUMMARY • Today we talked about • Tuning strategies pertaining to the various units in the POWER9 HW – • Front-end, Back-end • Some of these strategies were compiler flags, source code pragmas that one can apply to see improved performance of their programs • We also saw additional ways of improving performance such as parallelization, binding etc • Hopefully the associated handson exercises gave you a more practical experience in applying these concepts in optimizing an application IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed on sub optimal conditions
  • 21. 21 • • • • • • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 22. 22 • • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 23. 23 • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 24. 24 • • • 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES • • • • • • •
  • 25. 25 Flag Kind XL GCC/LLVM Can be simulated in source Benefit Drawbacks Unrolling -qunroll -funroll-loops #pragma unroll(N) Unrolls loops ; increases opportunities pertaining to scheduling for compiler Increases register pressure Inlining -qinline=auto:level=N -finline-functions Inline always attribute or manual inlining increases opportunities for scheduling; Reduces branches and loads/stores Increases register pressure; increases code size Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint Can cause issues in alignment isel instructions -misel Using ?: operator generates isel instruction instead of branch; reduces pressure on branch predictor unit latency of isel is a bit higher; Use if branches are not predictable easily General tuning -qarch=pwr9, -qtune=pwr9 -mcpu=power8, -mtune=power9 Turns on platform specific tuning 64bit compilation-q64 -m64 Prefetching - qprefetch[=aggressiv e] -fprefetch-loop-arrays __dcbt/__dcbtst, _builtin_prefetch reduces cache misses Can increase memory traffic particularly if prefetched values are not used Link time optimizatio n -qipo -flto , -flto=thin Enables Interprocedural optimizations Can increase overall compilation time Profile directed -fprofile-generate and –fprofile-use LLVM has an intermediate step