0% found this document useful (0 votes)
4 views

LECTURE 5 - Parallel Computing Design (PART 2)

This document covers advanced concepts in parallel computing, focusing on various decomposition techniques such as recursive, data, exploratory, and speculative decomposition. It discusses the characteristics of tasks and interactions, including task generation, sizes, and communication patterns, as well as mapping techniques for load balancing and minimizing overheads in parallel processing. Additionally, it highlights hybrid decompositions and the importance of task characteristics in optimizing parallel algorithm performance.

Uploaded by

2024793147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

LECTURE 5 - Parallel Computing Design (PART 2)

This document covers advanced concepts in parallel computing, focusing on various decomposition techniques such as recursive, data, exploratory, and speculative decomposition. It discusses the characteristics of tasks and interactions, including task generation, sizes, and communication patterns, as well as mapping techniques for load balancing and minimizing overheads in parallel processing. Additionally, it highlights hybrid decompositions and the importance of task characteristics in optimizing parallel algorithm performance.

Uploaded by

2024793147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CSC580

Parallel Processing
LECTURE 5:
Parallel Computing Design (PART 2)

PREPARED BY: SALIZA RAMLY


Topic Overview
This topic introduces the students:
Algorithms and Concurrency
o Decomposition Techniques
o Recursive Decomposition
o Data Decomposition
o Exploratory Decomposition
o Speculative Decomposition

o Characteristics of Tasks and Interactions


o Task Generation, Granularity, and Context
o Characteristics of Task Interactions.

SALIZA RAMLY - CSC580


Topic Overview
Concurrency and Mapping
oMapping Techniques for Load Balancing
oStatic and Dynamic Mapping
oMethods for Minimizing Interaction Overheads
oMaximizing Data Locality
oMinimizing Contention and Hot-Spots
oOverlapping Communication and Computations
oReplication vs. Communication
oGroup Communications vs. Point-to-Point Communication
oParallel Algorithm Design Models
oData-Parallel, Work-Pool, Task Graph, Master-Slave,
Pipeline, and Hybrid Models
SALIZA RAMLY - CSC580
Decomposition
Techniques
Cont..

SALIZA RAMLY - CSC580


Exploratory Decomposition
o In many cases, the decomposition of the problem goes hand-in-hand with its
execution.
o These problems typically involve the exploration (search) of a state space of
solutions.
o We partition the search space into the smaller parts, and search each one of these
parts concurrently, until the desired solution are found.
o Problems in this class include a variety of discrete optimization problems (i.e:
0/1 integer programming), theorem proving, game playing, etc.

SALIZA RAMLY - CSC580


Exploratory Decomposition: Example
A simple application of exploratory decomposition is in the solution to a 15 puzzle
(a tile puzzle). We show a sequence of three moves that transform a given initial
state (a) to desired final state (d).

Of-course, the problem of computing the solution, in general, is much more difficult than in this simple example.

SALIZA RAMLY - CSC580


Exploratory Decomposition: Example
The state space can be
explored by generating
various successor states of
the current state and to view
them as independent tasks.

SALIZA RAMLY - CSC580


Exploratory Decomposition: Anomalous
Computations
In many instances of exploratory decomposition, the Example: consider a search space that
decomposition technique may change the amount of has been partitioned into four
concurrent tasks.
work done by the parallel formulation. If the solution lies right at the beginning
This change results in super- or sub-linear speedups. of the search space. In (a).
o parallel algorithm: it will found almost
immediately.
o serial algorithm: would found the
solution only after performing work
equivalent to searching the entire
space corresponding to task 1 and
task 2.
If the solution lies towards the end of
task 1. In (b)
o parallel algorithm will perform almost
four times the work of serial
algorithm and will yield no speedup.
SALIZA RAMLY - CSC580
Speculative Decomposition
o In some applications, dependencies between tasks are not known a-priori.
o For such applications, it is impossible to identify independent tasks.
o There are generally two approaches to dealing with such applications:
conservative approaches
• which identify independent tasks only when they are guaranteed to not have
dependencies
• may yield little concurrency

optimistic approaches
• which schedule tasks even when they may potentially be erroneous.
• may require roll-back mechanism in the case of an error.

SALIZA RAMLY - CSC580


Speculative Decomposition: Example
o A classic example of speculative decomposition is in discrete event simulation.
o The central data structure in a discrete event simulation is a time-ordered
event list.
o Events are extracted precisely in time order, processed, and if required,
resulting events are inserted back into the event list.
o Consider your day today as a discrete event system - you get up, get ready,
drive to work, work, eat lunch, work some more, drive back, eat dinner, and
sleep.
o Each of these events may be processed independently, however, in driving to
work, you might meet with an unfortunate accident and not get to work at all.
o Therefore, an optimistic scheduling of other events will have to be rolled back.

SALIZA RAMLY - CSC580


Speculative Decomposition: Example
Another example is the simulation of a network of nodes (for instance, an
assembly line or a computer network through which packets pass). The task is to
simulate the behavior of this network for various inputs and node delay
parameters (note that networks may become unstable for certain values of
service rates, queue sizes, etc.).

SALIZA RAMLY - CSC580


Hybrid Decompositions
Often, a mix of decomposition techniques is necessary for decomposing a problem.
Consider the following examples:
oIn quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and
recursive decompositions is more desirable.
oIn discrete event simulation, there might be concurrency in task processing. A mix of
speculative decomposition and data decomposition may work well.
oEven for simple problems like finding a minimum of a list of numbers, a mix of data and
recursive decomposition works well.

SALIZA RAMLY - CSC580


Characteristics
of Task and
Interactions

SALIZA RAMLY - CSC580


Characteristics of Tasks
Once a problem has been decomposed into independent tasks, the
characteristics of these tasks critically impact choice and performance of parallel
algorithms.
Relevant task characteristics include:

Task generation.

Task sizes.

Size of data associated with tasks.

SALIZA RAMLY - CSC580


Task Generation
• The scenario where all the tasks are known before the algorithm
Static task starts executions. Usually in Data decomposition (ie: Matrix
generation multiplication) and Recursive decomposition (ie: finding a min.
list of a number).

• Tasks are generated as we perform computation. These


Dynamic task applications are typically decomposed using exploratory (ie: game
generation playing - each 15 puzzle board is generated from the previous
one) or speculative decompositions.

SALIZA RAMLY - CSC580


Task Sizes
o Task size: the relative amount of time required to complete it.
o The complexity of mapping schemes often depends on whether:

Uniform Non-uniform

Require roughly the same The amount of time required by


amount of time tasks varies significantly

SALIZA RAMLY - CSC580


Size of Data Associated with Tasks
o Important consideration for mapping:
o Data associated with task must be available to the process performing that task.
o The size and the location of these data may determine the process that can
perform the task without incurring excessive data movement overhead.
o Different type of data associated with a task may have different sizes.
o The input data maybe small but the output maybe large (ie: 15-puzzle problem)
o The input data maybe large but the output maybe small (ie: Quicksort)

SALIZA RAMLY - CSC580


Characteristics of Task Interactions
Tasks may communicate with each other in various ways. The associated
dichotomy is:
Static interactions
• The tasks and their interactions are known a-priori. These are relatively simpler
to code into programs.
• Easier to program.

Dynamic interactions
• The timing or interacting tasks cannot be determined a-priori. These interactions
are harder to code, especially, as we shall see, using message passing APIs.
• Harder to program.

SALIZA RAMLY - CSC580


Characteristics of Task Interactions
Another way of classifying the interactions is based on upon their spatial
structure. The interaction pattern:

• There is a definite pattern (in the graph sense) to the


Regular
interactions. These patterns can be exploited for efficient
interactions implementation.

Irregular • Interactions lack well-defined topologies. (no such regular


interactions pattern exists.)

SALIZA RAMLY - CSC580


Characteristics of Task Interactions:
Example
A simple example of a regular
static interaction pattern is in
image dithering. The underlying
communication pattern is a
structured (2-D mesh) one as
shown here:

SALIZA RAMLY - CSC580


Characteristics of Task Interactions:
Example
The multiplication of a sparse matrix with a vector is a good example of a static
irregular interaction pattern. Here is an example of a sparse matrix and its
associated interaction pattern.

SALIZA RAMLY - CSC580


Characteristics of Task Interactions
The type of data sharing may impact the choice of the mapping. Data sharing interactions
can be categorized:

Read-only interactions Read-write interactions

• tasks just read data items • tasks read, as well as


associated with other tasks. modify data items
• ie: Matrix multiplication. associated with other tasks.
• ie: 15-puzzle problem.

In general, read-write interactions are harder to code.

SALIZA RAMLY - CSC580


Characteristics of Task Interactions
The data or work needed by a task or a subset of tasks is explicitly supplied by
another task or subset of tasks. Interactions may be:

A one-way interaction A two-way interaction

• can be initiated and • requires participation


accomplished by one of from both tasks involved
the two interacting tasks. in an interaction.

SALIZA RAMLY - CSC580


To be continued…

SALIZA RAMLY - CSC580


Mapping
Techniques for
Load Balancing

SALIZA RAMLY - CSC580


Mapping Techniques
o Once a problem has been decomposed into concurrent tasks, these tasks must
be mapped onto processes (that can be executed on a parallel platform).
o In order to achieve a small execution time, overheads of executing the tasks in
parallel must be minimized.
o Two key sources of overheads:
o The time spent in inter-process interaction.
o Time that some processes may spend being idle.
o A good mapping of tasks onto processes must strive to achieve the objectives:
o Reducing the amount of time processes spend in interacting with each other.
o Reducing the total amount of time some processes are idle while the others are
engaged in performing some tasks.

SALIZA RAMLY - CSC580


Mapping Techniques
Mapping must simultaneously minimize idling and load balance. Merely balancing
load does not minimize idling.

Two mapping of 12 tasks decomposition


in which the last 4 tasks can be started
only after the first eight are finished due
to dependencies among tasks.

SALIZA RAMLY - CSC580


Mapping Techniques
Factors that determine the choice of mapping techniques include the size of
data associated with a task, the characteristics of inter-task interactions and
even the parallel programming paradigm.

Static Tasks are mapped to processes a-priori. For this to work, we


Mapping must have a good estimate of the size of each task.

Tasks are mapped to processes at runtime. This may be


Dynamic
because the tasks are generated at runtime, or that their sizes
Mapping are not known.

SALIZA RAMLY - CSC580


Mappings based on data
partitioning.
Schemes for
Static Mappings based on task
graph partitioning.
Mapping
Hybrid mappings.
SALIZA RAMLY - CSC580
Mappings Based on Data Partitioning
We can combine data partitioning with the ``owner-computes'' rule to partition
the computation into subtasks. The simplest data decomposition schemes for
dense matrices are 1-D block distribution schemes.

Example of 1-D partitioning of an array among eight processes

SALIZA RAMLY - CSC580


Block Array Distribution Schemes
Block distribution schemes can be generalized to higher dimensions as well.

4 x 4 process grid 2 x 8 process grid


Example of 2-D distributions of an array

SALIZA RAMLY - CSC580


Block Array Distribution Schemes:
Examples
o For multiplying two dense matrices A and B, we can partition the output matrix
C using a block decomposition.

o For load balance, we give each task the same number of elements of C. (Note
that each element of C corresponds to a single dot product.)
o The choice of precise decomposition (1-D or 2-D) is determined by the
associated communication overhead.
o In general, higher dimension decomposition allows the use of larger number of
processes.

SALIZA RAMLY - CSC580


Data Sharing in Dense Matrix
Multiplication
Data sharing needed for
matrix multiplication with
(a) 1-D partitioning
(b) 2-D partitioning
of the output matrix.

Shaded portions of the input


matrix A and B are required
by process that computes the
shaded portion of the output
matrix C.

SALIZA RAMLY - CSC580


Cyclic and Block Cyclic Distributions
o If the amount of computation associated with data items varies, a block
decomposition may lead to significant load imbalances.
o A simple example of this is in LU decomposition (or Gaussian Elimination) of
dense matrices.

SALIZA RAMLY - CSC580


LU Factorization of a Dense Matrix
A decomposition of LU factorization into 14 tasks - notice the significant load
imbalance.

SALIZA RAMLY - CSC580


Block Cyclic Distributions
o Variation of the block distribution scheme that can be used to alleviate the
load-imbalance and idling problems.
o Partition an array into many more blocks than the number of available
processes.
o Blocks are assigned to processes in a round-robin manner so that each process
gets several non-adjacent blocks.

SALIZA RAMLY - CSC580


Block-Cyclic Distribution for Gaussian
Elimination
The active part of the matrix in Gaussian Elimination changes. By assigning
blocks in a block-cyclic fashion, each processor receives blocks from different
parts of the matrix.

SALIZA RAMLY - CSC580


Block-Cyclic Distribution: Examples
A naive mapping of LU factorization tasks onto processes based on a two-
dimensional block distribution

SALIZA RAMLY - CSC580


Block-Cyclic Distribution
o A cyclic distribution is a special case in which block size is one.
o A block distribution is a special case in which block size is n/p , where n is the
dimension of the matrix and p is the number of processes.
Example of one- and two-dimensional
block-cyclic distributions among four
processes
(a) The rows of the array are grouped
into block each consisting of two
rows, resulting in 8 block of raws.
(b) The matrix is blocked into 16 blocks
each of size 4x4, and it is mapped
into a 2x2 grid of processes in a
wraparound fashion
SALIZA RAMLY - CSC580
Graph Partitioning Based Data
Decomposition
o In case of sparse matrices, block decompositions are more complex.
o Consider the problem of multiplying a sparse matrix with a vector.
o The graph of the matrix is a useful indicator of the work (number of nodes) and
communication (the degree of each node).
o In this case, we would like to partition the graph so as to assign equal number
of nodes to each process, while minimizing edge count of the graph partition.

SALIZA RAMLY - CSC580


Partitioning the Graph of Lake Superior
Example: the simulation of a physical phenomenon such the dispersion of a water contaminant
in the lake

Random Partitioning Partitioning for minimum edge-cut.


A random distribution of the mesh A distribution of the mesh elements to eight
elements to eight processes processes, by using a graph-partitioning
algorithm

SALIZA RAMLY - CSC580


Mappings Based on Task Partitioning
o Partitioning a given task-dependency graph across processes.
o Determining an optimal mapping for a general task-dependency graph is an
NP-complete problem.
o Excellent heuristics exist for structured graphs.

SALIZA RAMLY - CSC580


Task Partitioning: Mapping a Binary Tree
Dependency Graph
The mapping does not
introduce any further idling
and all tasks that are
permitted to be concurrently
active by the task-dependency
graph are mapped onto
different processes for parallel
execution.

SALIZA RAMLY - CSC580


Task Partitioning: Mapping a Sparse
Graph
Sparse graph for computing a sparse matrix-vector product and its mapping.

SALIZA RAMLY - CSC580


Hierarchical Mappings
o Sometimes a single mapping technique is inadequate.
o For example, the task mapping of the binary tree (quicksort) cannot use a large
number of processors.
o For this reason, task mapping can be used at the top level and data partitioning
within each level.

SALIZA RAMLY - CSC580


An example
An example of task partitioning at top level with data partitioning at the lower
level.

Hierarchical mapping of task-


dependency graph. Each node
represented by an array is a supertask.
The partitioning of the arrays represents
subtasks which are mapped onto eight
processes.

SALIZA RAMLY - CSC580


Dynamic mapping is sometimes also
referred to as dynamic load balancing,
since load balancing is the primary
Schemes for motivation for dynamic mapping.
Dynamic
Mapping Dynamic mapping schemes can be
centralized or distributed.

SALIZA RAMLY - CSC580


Centralized Dynamic Mapping
o Processes are designated as masters or slaves.
o When a process runs out of work, it requests the master for more work.
o When the number of processes increases, the master may become the
bottleneck.
o To alleviate this, a process may pick up a number of tasks (a chunk) at one
time. This is called Chunk scheduling.
o Selecting large chunk sizes may lead to significant load imbalances as well.
o A number of schemes have been used to gradually decrease chunk size as the
computation progresses.

SALIZA RAMLY - CSC580


Distributed Dynamic Mapping
o Each process can send or receive work from other processes.
o This alleviates the bottleneck in centralized schemes.
o There are four critical questions:
o how are sensing and receiving processes paired together?
o who initiates work transfer?
o how much work is transferred?
o and when is a transfer triggered?
o Answers to these questions are generally application specific. We will look at
some of these techniques later in this class.

SALIZA RAMLY - CSC580


Methods in Minimizing Interaction
Overheads
Maximize data locality
• Where possible, reuse intermediate data. Restructure computation so that
data can be reused in smaller time windows.
• Minimize volume of data exchange
• There is a cost associated with each word that is communicated. For this reason, we
must minimize the volume of data communicated.
• Minimize frequency of interactions
• There is a startup cost associated with each interaction. Therefore, try to merge
multiple interactions to one, where possible.

Minimize contention and hot-spots


• Use decentralized techniques, replicate data where necessary.

SALIZA RAMLY - CSC580


Methods in Minimizing Interaction
Overheads (continued)
Overlapping computations with interactions
• Use non-blocking communications, multithreading, and prefetching to hide
latencies.

Replicating data or computations.


• Multiple processes may require frequent read-only access to the shared data
structure

Using optimized collection interaction operations


• Using group communications instead of point-to-point primitives
• Broadcasting some data to all the processes

Overlap interactions with other interactions.


• Can reduce the effective volume of communication

SALIZA RAMLY - CSC580


Parallel Algorithm Models
An algorithm model is a way of structuring a parallel algorithm by selecting a
decomposition and mapping technique and applying the appropriate strategy to
minimize interactions.

Data Parallel Task Graph


Model Model

Tasks are statically (or semi-statically) Starting from a task dependency graph,
mapped to processes and each task the interrelationships among the tasks
performs similar operations on different are utilized to promote locality or to
data. reduce interaction costs.

SALIZA RAMLY - CSC580


Parallel Algorithm Models (continued)
Pipeline / Producer-
Master-Slave Model Hybrid Models
Consumer Model

A hybrid model may be


One or more processes composed either of
A stream of data is passed
generate work and allocate multiple models applied
through a succession of
it to worker processes. This hierarchically or multiple
processes, each of which
allocation may be static or models applied sequentially
perform some task on it.
dynamic. to different phases of a
parallel algorithm.

SALIZA RAMLY - CSC580


LECTURE6:
NEXT! PARALLEL ALGORITHM
APPROACH (PART 1)

SALIZA RAMLY - CSC580

You might also like