How To Parallelise An Application
How To Parallelise An Application
Application
How to Parallelise a Code
Designing and developing parallel programs has characteristically been a highly
manual process
◦ the programmer is typically responsible for both identifying and actually implementing
parallelism
◦ time consuming, complex, error-prone and iterative process
Tools have been available to assist the programmer with converting serial
programs into parallel programs
◦ e.g. Vector compilers were successful in the 80’s
◦ so why not parallel?
The most common type of tool used to automatically parallelize a serial program
is a parallelizing compiler or pre-processor
2
How to Parallelise a Code: Methodologies
Parallelising compiler – two flavours:
◦ Fully Automatic
◦ the compiler analyzes the source code and identifies opportunities for parallelism
◦ analysis includes identifying inhibitors to parallelism and possibly a cost weighting on the
benefit OR NOT to the performance
◦ loop parallel paradigm is the strategy usually adopted
◦ Programmer Directed
◦ using directives or possibly compiler flags
◦ the programmer explicitly tells the compiler how to parallelize the code
◦ can be used in conjunction with some degree of automatic parallelization
3
How to Parallelise a Code: Dangers
If you start with an existing serial code and have time or budget constraints,
then automatic parallelization may be the answer ..but there are dangers...
◦ wrong results may be produced (opt level, parallel sum)
◦ performance may actually degrade (overhead may outweigh performance gain)
◦ significantly less flexible than manual parallelization
◦ limited to a subset (mostly loops) of code
◦ may actually not parallelize code if the analysis suggests there are inhibitors or the code is too
complex
4
How to Parallelise a code: Schematic
Understand the serial version
Determine whether or not the problem is one that can actually be parallelised!
Example of Parallelisable Problem:
Change the contrast of a bitmap image by calculating the new colour shade of each pixel in a
bitmap image and also determine the minimum and maximum shade
This problem can be solved in parallel. Each of the pixel shades are independently
determinable. The calculation of the minimum and maximum shades can also be obtained in
parallel.
6
Understand The Serial Version:
Non parallelisable example
Example of a non-Parallelisable Problem:
7
Understand The Serial Version:
Where is the work being done?
Identify the program's hotspots:
Know where most of the real work is being done. The majority of scientific and
technical programs usually accomplish most of their work in a few places
Simple timing of code sections or a profiler such as gprof or a performance
analysis tool can help
Focus on parallelizing the hotspots and ignore those sections of the program
that account for little CPU usage
◦ caveat - can be significant depending on performance goal
8
Understand The Serial Version:
Parallel Bottlenecks
Identify bottlenecks in the program
Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred?
◦ for example, I/O is usually something that slows a program down
◦ is it diagnostic?
◦ can it be deferred to a later point of execution?
It may be possible to restructure the program or use a different algorithm to reduce or eliminate
unnecessary slow areas
Identify inhibitors to parallelism
◦ one common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above
◦ more advanced topic - lecture on its own
Investigate other algorithms if possible
◦ this may be the single most important consideration when designing a parallel application.
9
Decide How To Partition The
Problem/Data
Break the problem into discrete pieces or chunks of work that can be distributed
to multiple tasks. This is known as decomposition or partitioning
There are two basic ways to partition computational work among parallel tasks:
◦ domain decomposition
◦ functional decomposition
Domain decomposition
◦ The data associated with a problem is decomposed
◦ Each parallel task then works on a portion of the data
5 12 8 32 19 4 65 17 21
+1 +1 +1
6 13 9 33 20 5 66 18 22
Functional Decomposition
Problem is decomposed according to the work (function) that must be done
◦ each task then performs a portion of the overall work
Sum
4 Min 5 12 8 32 19 4 65 17 21 Max 65
Mean
20.33
Functional Decomposition Example (2)
An audio signal data set is passed through four distinct computational filters (processes)
The first segment of data must pass through the first filter before progressing to the second.
When it does, the second segment of data passes through the first filter. By the time the fourth
segment of data is in the first filter, all four tasks are busy
◦ pipeline processing
16
Modify The Program Statements
•Largely dictated by decisions on the partitioning of data
•Statements could read like ”...if I own the data being assigned, then I execute these
statements...”
if (myprocnum.eq.0) {
printf(...
if(master) {
send(...
}
else if (worker) {
•Statements of this kind are termed execution control masks or masks for short
17
Modify The Program Statements
•The most time consuming and error-prone of the stages in
developing a parallel code
•Need to ensure all data is assigned by only the allowable tasks
• may require judgemental decisions to be made
• e.g. whether to replicate work or to communicate information (get access to
data) from another process?
•Decisions made in the masking stage will have a knock on effect with
the placement of necessary communication of data between tasks to
ensure consistency across all tasks
• does this still follow the original serial computations?
18
Communication Statements:
When you need them
The need for data communication is problem dependent
19
Communication Statements
Example of needing communication
You DO need communications
Most real-world parallel applications are not so simple
They do require tasks to share data with each other
e.g. 3-D heat diffusion problem requires a task to know the temperatures calculated by the tasks
that have neighbouring data. Changes to neighbouring data has a direct effect on that task's
data
◦ Communication can have a significant effect on performance – this will be discussed in a later
lecture
21
Communication Statements:
Cost of Communication
Cost of communications
• Inter-task communication implies overhead
• it’s NOT free!
• Machine cycles and resources that could be used for computation are instead
used to package and transmit data
• Communications very often need some synchronization between tasks
• can result in tasks spending time "waiting" instead of doing work
• Competing communication traffic can saturate the available network
bandwidth
• further aggravating performance problems
Communication Statements
Latency vs. Bandwidth
◦ Latency is the time it takes to send a minimal (0 byte) message from point A to point
B
◦ units in microseconds (μs) or nanoseconds (ns)
◦ Bandwidth is the amount of data that can be communicated per unit of time
◦ units in Mb/sec or Gb/sec.
◦ Sending many small messages can cause latency to dominate communication
overheads
◦ Often it is more efficient to package small messages into a larger message,
this increases the effective communications bandwidth and also reduces the
overall latency cost
Communication Statements:
Visibility of communications
◦ when using message passing the communications are explicit and visible
◦ under the control of the programmer
◦ applies in MPI case
◦ When using threads the communications occur transparently to the
programmer
◦ the programmer is unlikely to know exactly how the inter-task communications are being
accomplished
◦ applies in OpenMP case
24
Communication Statements
Synchronous
◦ Synchronous communications are often referred to as blocking
communications since other work must wait until the communications have
completed
25
Communication Statements
Asynchronous
◦ Asynchronous communications are often referred to as non-blocking
communications
◦ other work can be done while the communications are taking place
◦ Asynchronous communications allow tasks to transfer data independently
from one another
◦ For example, task 1 can prepare and send a message to task 2, and then immediately begin
doing other work. When task 2 actually receives the data doesn't matter to task 1
◦ Interleaving computation with communication is the single greatest benefit
for using asynchronous communications
26
Communication Statements:
Point to Point
◦ Knowing which tasks must communicate with each other is critical during the design stage of
a parallel code
◦ Point-to-point communication involves two tasks with one task acting as the
sender/producer of data, and the other acting as the receiver/consumer
Processor 1 Processor 2
Process A Process B
application SEND application RECV
DATA DATA
application RECV
NETWORK
application SEND
DATA DATA
Communication Statements:
Collective
◦ Collective communication involves data sharing between more than two tasks
◦ often specified as being members of a common group or collective
Processor 2
Process B
Processor 1 application BCAST Processor 3
DATA
Process A Process B
application BCAST application BCAST
DATA DATA
Processor 4
Process B
application BCAST
DATA
Communication Statements:
Efficiency
Many factors can affect communications performance
◦ which implementation for a given model should be used?
◦ for example, using MPI message passing one implementation may be faster on a given hardware
platform than another
◦ what type of communication operations should be used?
◦ as mentioned previously, asynchronous communication operations may improve overall program
performance
◦ but are they more difficult to implement?
◦ some platforms may offer more than one network for communications
◦ which one is best?
29
Compile and Run the Parallel Version
Compiling
• This will usually be done as a command line instruction
• be prepared for syntax and other errors
• When compilation is completed with no errors and any appropriate libraries
have been linked an executable should be produced
Running
• Usually a command line instruction
• will probably require some environment variables to also be set
• If it does not work as you expect (highly likely) then you will need to debug
• the fun part !
30