0% found this document useful (0 votes)
67 views71 pages

CII3D4-SisTerPar-10-Desain Paralel Programming-PHV

The document discusses several key considerations for designing parallel programs, including understanding the problem, partitioning work, communication between tasks, and synchronization. It provides examples of partitioning work through domain and functional decomposition. Factors that impact communication such as latency, bandwidth, and synchronous vs asynchronous methods are also examined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views71 pages

CII3D4-SisTerPar-10-Desain Paralel Programming-PHV

The document discusses several key considerations for designing parallel programs, including understanding the problem, partitioning work, communication between tasks, and synchronization. It provides examples of partitioning work through domain and functional decomposition. Factors that impact communication such as latency, bandwidth, and synchronous vs asynchronous methods are also examined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

CSH3J3

SISTEM PARALEL DAN TERDISTRIBUSI

Chapter 11:
Parallel Programming
Design

1 04/23/2023
Outline Today

Design parallel program

2 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Design Parallel Program
• Understand the problem and program
• Partitioning
• Communication
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• I/O
• Example
3 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
UNDERSTAND THE PROBLEM AND
PROGRAM

4 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Understand the Problem
• Undoubtedly, the first step in developing parallel
software is to first understand the problem that you
wish to solve in parallel. If you are starting with a
serial program, this necessitates understanding the
existing code also.
• Before spending time in an attempt to develop a
parallel solution for a problem, determine whether
or not the problem is one that can actually be
parallelized.
5 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Example Easy to parallelize problem
• Calculate the potential energy for each of several
thousand independent conformations of a
molecule. When done, find the minimum energy
conformation.
• This problem is able to be solved in parallel. Each of
the molecular conformations is independently
determinable. The calculation of the minimum
energy conformation is also a parallelizable
problem.
6 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Problem with little-to-no parallelism
• Calculation of the Fibonacci series
(0,1,1,2,3,5,8,13,21,...) by use of the
formula:F(n) = F(n-1) + F(n-2) 
• The calculation of the F(n) value uses those of
both F(n-1) and F(n-2), which must be
computed first.

7 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Understand The Program
• Know where most of the real work is being done. The
majority of scientific and technical programs usually
accomplish most of their work in a few places.
• Identify the program's hotspots
• Profilers and performance analysis tools can help here
• Focus on parallelizing the hotspots and ignore those
sections of the program that account for little CPU
usage.

8 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Identify bottlenecks in the program:
• Are there areas that are disproportionately
slow, or cause parallelizable work to halt or be
deferred? For example, I/O is usually
something that slows a program down.
• May be possible to restructure the program or
use a different algorithm to reduce or
eliminate unnecessary slow areas

9 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Cont.
• Identify inhibitors to parallelism. One common class of
inhibitor is data dependence, as demonstrated by the
Fibonacci sequence above.
• Investigate other algorithms if possible. This may be the
single most important consideration when designing a
parallel application.
• Take advantage of optimized third party parallel software
and highly optimized math libraries available from leading
vendors (IBM's ESSL, Intel's MKL, AMD's AMCL, etc.)

10 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


PARTITIONING

11 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Partitioning
• One of the first steps in designing a parallel
program is to break the problem into discrete
"chunks" of work that can be distributed to
multiple tasks. This is known as decomposition or
partitioning.
• There are two basic ways to partition
computational work among parallel tasks: domain
decomposition and functional decomposition.

12 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


 Domain Decomposition
• In this type of partitioning, the data associated
with a problem is decomposed. Each parallel
task then works on a portion of the data.

13 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


There are different ways to partition data

14 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Functional Decomposition
• In this approach, the focus is on the
computation that is to be performed rather
than on the data manipulated by the
computation. The problem is decomposed
according to the work that must be done. Each
task then performs a portion of the overall
work

15 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


16 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Example Ecosystem Modeling
• Each program calculates the population of a
given group, where each group's growth
depends on that of its neighbors. As time
progresses, each process calculates its current
state, then exchanges information with the
neighbor populations. All tasks then progress
to calculate the state at the next time step.

17 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


18 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Example Signal Processing 
• An audio signal data set is passed through four
distinct computational filters. Each filter is a
separate process. The first segment of data
must pass through the first filter before
progressing to the second. When it does, the
second segment of data passes through the
first filter. By the time the fourth segment of
data is in the first filter, all four tasks are busy.
19 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
20 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Example Climate Modeling 
• Each model component can be thought of as a
separate task. Arrows represent exchanges of
data between components during
computation: the atmosphere model
generates wind velocity data that are used by
the ocean model, the ocean model generates
sea surface temperature data that are used by
the atmosphere model, and so on.
21 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
22 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
COMMUNICATION

23 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Who Needs Communications?
• The need for communications between tasks depends upon
your problem:
• You DON'T need communications
– Some types of problems can be decomposed and executed in parallel
with virtually no need for tasks to share data. For example, imagine an
image processing operation where every pixel in a black and white
image needs to have its color reversed. The image data can easily be
distributed to multiple tasks that then act independently of each other
to do their portion of the work.
– These types of problems are often called embarrassingly
parallel because they are so straight-forward. Very little inter-task
communication is required.

24 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Cont.
• You DO need communications
• Most parallel applications are not quite so
simple, and do require tasks to share data
with each other. For example, a 3-D heat
diffusion problem requires a task to know the
temperatures calculated by the tasks that have
neighboring data. Changes to neighboring
data has a direct effect on that task's data.
25 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Factors to consider when designing program's inter-task communications:

• Cost of communications
• Latency vs. Bandwidth
• Visibility of communications
• Synchronous vs. asynchronous communications
• Scope of communications
• Efficiency of communications
• Overhead and Complexity

26 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Cost of communications
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for
computation are instead used to package and transmit data.
• Communications frequently require some type of
synchronization between tasks, which can result in tasks
spending time "waiting" instead of doing work.
• Competing communication traffic can saturate the available
network bandwidth, further aggravating performance
problems

27 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Latency vs. Bandwidth
• latency is the time it takes to send a minimal (0 byte)
message from point A to point B. Commonly expressed as
microseconds.
• bandwidth is the amount of data that can be
communicated per unit of time. Commonly expressed as
megabytes/sec or gigabytes/sec.
• Sending many small messages can cause latency to
dominate communication overheads. Often it is more
efficient to package small messages into a larger message,
thus increasing the effective communications bandwidth.

28 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Visibility of communications
• With the Message Passing Model, communications
are explicit and generally quite visible and under
the control of the programmer.
• With the Data Parallel Model, communications
often occur transparently to the programmer,
particularly on distributed memory architectures.
The programmer may not even be able to know
exactly how inter-task communications are being
accomplished.
29 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Synchronous vs. asynchronous communications
• Synchronous communications require some type of "handshaking" between tasks
that are sharing data. This can be explicitly structured in code by the programmer, or
it may happen at a lower level unknown to the programmer.
• Synchronous communications are often referred to as blocking communications
since other work must wait until the communications have completed.
• Asynchronous communications allow tasks to transfer data independently from one
another. For example, task 1 can prepare and send a message to task 2, and then
immediately begin doing other work. When task 2 actually receives the data doesn't
matter.
• Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the communications
are taking place.
• Interleaving computation with communication is the single greatest benefit for using
asynchronous communications.

30 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Scope of communications
• Knowing which tasks must communicate with each other is
critical during the design stage of a parallel code. Both of the
two scopings described below can be implemented
synchronously or asynchronously.
• Point-to-point - involves two tasks with one task acting as the
sender/producer of data, and the other acting as the
receiver/consumer.
• Collective - involves data sharing between more than two tasks,
which are often specified as being members in a common
group, or collective. Some common variations (there are more):

31 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


32 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Efficiency of communications
• Very often, the programmer will have a choice with regard to factors
that can affect communications performance. Only a few are
mentioned here.
• Which implementation for a given model should be used? Using the
Message Passing Model as an example, one MPI implementation may
be faster on a given hardware platform than another.
• What type of communication operations should be used? As
mentioned previously, asynchronous communication operations can
improve overall program performance.
• Network media - some platforms may offer more than one network
for communications. Which one is best?

33 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Overhead and Complexity

34 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


SYNCHRONIZATION

35 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Synchronization
• Managing the sequence of work and the tasks
performing it is a critical design consideration
for most parallel programs.
• Can be a significant factor in program
performance (or lack of it)
• Often requires "serialization" of segments of
the program.

36 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Types of Synchronization
• Barrier
• Lock / semaphore
• Synchronous communication operations

37 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the
barrier. It then stops, or "blocks".
• When the last task reaches the barrier, all tasks are
synchronized.
• What happens from here varies. Often, a serial
section of work must be done. In other cases, the
tasks are automatically released to continue their
work.
38 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Lock / semaphore
• Can involve any number of tasks
• Typically used to serialize (protect) access to global data
or a section of code. Only one task at a time may use
(own) the lock / semaphore / flag.
• The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or code.
• Other tasks can attempt to acquire the lock but must
wait until the task that owns the lock releases it.
• Can be blocking or non-blocking

39 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Synchronous communication operations
• Involves only those tasks executing a
communication operation
• When a task performs a communication operation,
some form of coordination is required with the
other task(s) participating in the communication.
For example, before a task can perform a send
operation, it must first receive an acknowledgment
from the receiving task that it is OK to send.

40 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


DATA DEPENDENCY

41 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Data Dependencies
• A dependence exists between program
statements when the order of statement
execution affects the results of the program.
• A data dependence results from multiple use of
the same location(s) in storage by different tasks.
• Dependencies are important to parallel
programming because they are one of the
primary inhibitors to parallelism.

42 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Loop carried data dependence
• DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0
500 CONTINUE
• The value of A(J-1) must be computed before the value of A(J), therefore
A(J) exhibits a data dependency on A(J-1). Parallelism is inhibited.
• If Task 2 has A(J) and task 1 has A(J-1), computing the correct value of
A(J) necessitates:
• Distributed memory architecture - task 2 must obtain the value of A(J-1)
from task 1 after task 1 finishes its computation
• Shared memory architecture - task 2 must read A(J-1) after task 1
updates it

43 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Loop independent data dependence

• As with the previous example, parallelism is inhibited. The value of Y is


dependent on:
– Distributed memory architecture - if or when the value of X is communicated
between the tasks.
– Shared memory architecture - which task last stores the value of X.
• Although all data dependencies are important to identify when
designing parallel programs, loop carried dependencies are particularly
important since loops are possibly the most common target of
44
parallelization
04/23/2023
efforts. CSH3J3 - Sistem Paralel dan Terdistribusi
 How to Handle Data Dependencies
• Distributed memory architectures -
communicate required data at synchronization
points.
• Shared memory architectures -synchronize
read/write operations between tasks.

45 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


LOAD BALANCING

46 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Load Balancing
• Load balancing refers to the practice of distributing
approximately equal amounts of work among tasks
so that all tasks are kept busy all of the time. It can
be considered a minimization of task idle time.
• Load balancing is important to parallel programs for
performance reasons. For example, if all tasks are
subject to a barrier synchronization point, the
slowest task will determine the overall
performance.
47 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
How to Achieve Load Balance
• Equally partition the work each task receives
• Use dynamic work assignment

48 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Equally partition the work each task receives
• For array/matrix operations where each task performs
similar work, evenly distribute the data set among the
tasks.
• For loop iterations where the work done in each iteration
is similar, evenly distribute the iterations across the tasks.
• If a heterogeneous mix of machines with varying
performance characteristics are being used, be sure to use
some type of performance analysis tool to detect any load
imbalances. Adjust work accordingly.

49 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Use dynamic work assignment
• Certain classes of problems result in load imbalances even if data is
evenly distributed among tasks:
• Sparse arrays - some tasks will have actual data to work on while others have mostly "zeros".
• Adaptive grid methods - some tasks may need to refine their mesh while others don't.
• N-body simulations - where some particles may migrate to/from their original task domain
to another task's; where the particles owned by some tasks require more work than those
owned by other tasks.
• When the amount of work each task will perform is intentionally variable,
or is unable to be predicted, it may be helpful to use a scheduler - task
pool approach. As each task finishes its work, it queues to get a new piece
of work.
• It may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code.

50 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


GRANULARITY

51 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Granularity
• Computation / Communication Ratio:
• In parallel computing, granularity is a
qualitative measure of the ratio of
computation to communication.
• Periods of computation are typically separated
from periods of communication by
synchronization events.

52 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Fine-grain Parallelism
• Relatively small amounts of computational work are
done between communication events
• Low computation to communication ratio
• Facilitates load balancing
• Implies high communication overhead and less
opportunity for performance enhancement
• If granularity is too fine it is possible that the overhead
required for communications and synchronization
between tasks takes longer than the computation
53 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Coarse-grain Parallelism
• Relatively large amounts of computational
work are done between
communication/synchronization events
• High computation to communication ratio
• Implies more opportunity for performance
increase
• Harder to load balance efficiently

54 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Coarse vs Fine Grain

55 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Which is Best?
• The most efficient granularity is dependent on the
algorithm and the hardware environment in which it
runs.
• In most cases the overhead associated with
communications and synchronization is high relative
to execution speed so it is advantageous to have
coarse granularity.
• Fine-grain parallelism can help reduce overheads
due to load imbalance.
56 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
I/O

57 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


I/O The Bad News
• I/O operations are generally regarded as inhibitors to parallelism.
• I/O operations require orders of magnitude more time than
memory operations.
• Parallel I/O systems may be immature or not available for all
platforms.
• In an environment where all tasks see the same file space, write
operations can result in file overwriting.
• Read operations can be affected by the file server's ability to
handle multiple read requests at the same time.
• I/O that must be conducted over the network (NFS, non-local) can
cause severe bottlenecks and even crash file servers.

58 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


I/O The Good News
• Parallel file systems are available. For example:GPFS:
General Parallel File System (IBM)
• Lustre: for Linux clusters (Intel)
• OrangeFS: Open source parallel file system follow on to
Parallel Virtual File System (PVFS)
• PanFS: Panasas ActiveScale File System for Linux
clusters (Panasas, Inc.)
• And more - see 
http://en.wikipedia.org/wiki/List_of_file_systems#Distr
ibuted_parallel_file_systems
59 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Cont.
• Rule #1: Reduce overall I/O as much as possible
• If you have access to a parallel file system, use it.
• Writing large chunks of data rather than small chunks is usually
significantly more efficient.
• Fewer, larger files performs better than many small files.
• Confine I/O to specific serial portions of the job, and then use parallel
communications to distribute data to parallel tasks. For example, Task 1
could read an input file and then communicate required data to other
tasks. Likewise, Task 1 could perform write operation after receiving
required data from all other tasks.
• Aggregate I/O operations across tasks - rather than having many tasks
perform I/O, have a subset of tasks perform it.

60 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


EXAMPLE

61 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Parallel Examples: PI Calculation
• The value of PI can be calculated in a number of ways.
Consider the following method of approximating PI Inscribe
a circle in a square:
1. Randomly generate points in the square
2. Determine the number of points in the square that are also in the
circle
3. Let r be the number of points in the circle divided by the number
of points in the square
4. PI ~ 4 r
5. Note that the more points generated, the better the
approximation

62 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


63 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
Pseudo code

64 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Design
• Is this problem able to be parallelized?
• How would the problem be partitioned?
• Are communications needed?
• Are there any data dependencies?
• Are there synchronization needs?
• Will load balancing be a concern?

65 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Parallel Solution
• Another problem that's easy to parallelize:All
point calculations are independent; no data
dependencies
• Work can be evenly divided; no load balance
concerns
• No need for communication or
synchronization between tasks

66 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


Strategy
• Divide the loop into equal portions that can be
executed by the pool of tasks
• Each task independently performs its work
• A SPMD model is used
• One task acts as the master to collect results
and compute the value of PI

67 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


68 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
69 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi
References
Blaise Barney, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/parallel_comp/

70 04/23/2023 CSH3J3 - Sistem Paralel dan Terdistribusi


THANK YOU

You might also like