Introduction To Parallel Computing Design and Anal
Introduction To Parallel Computing Design and Anal
net/publication/201976857
CITATIONS READS
1,580 8,157
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Vipin Kumar on 22 December 2013.
George Karypis
Principles of Parallel Algorithm
Design
Outline
Overview of some Serial Algorithms
Parallel Algorithm vs Parallel Formulation
Elements of a Parallel Algorithm/Formulation
Common Decomposition Methods
concurrency extractor!
Common Mapping Methods
parallel overhead reducer!
Some Serial Algorithms
Working Examples
Dense Matrix-Matrix & Matrix-Vector
Multiplication
Sparse Matrix-Vector Multiplication
Gaussian Elimination
Floyd’s All-pairs Shortest Path
Quicksort
Minimum/Maximum Finding
Heuristic Search—15-puzzle problem
Dense Matrix-Vector Multiplication
Dense Matrix-Matrix Multiplication
Sparse Matrix-Vector Multiplication
Gaussian Elimination
Floyd’s All-Pairs Shortest Path
Quicksort
Minimum Finding
15—Puzzle Problem
Parallel Algorithm vs Parallel
Formulation
Parallel Formulation
Refers to a parallelization of a serial algorithm.
Parallel Algorithm
May represent an entirely different algorithm than the
one used serially.
Holy Grail:
Maximize concurrency and reduce overheads due to parallelization!
Maximize potential speedup!
Finding Concurrent Pieces of Work
Decomposition:
The process of dividing the computation into
smaller pieces of work i.e., tasks
Tasks are programmer defined and are
considered to be indivisible
Example: Dense Matrix-Vector
Multiplication
Tasks can be of different size.
• granularity of a task
Example: Query Processing
Query:
Example: Query Processing
Finding concurrent tasks…
Task-Dependency Graph
In most cases, there are dependencies between
the different tasks
certain task(s) can only start once some other task(s)
have finished
e.g., producer-consumer relationships
These dependencies are represented using a
DAG called task-dependency graph
Task-Dependency Graph (cont)
Key Concepts Derived from the Task-
Dependency Graph
Degree of Concurrency
The number of tasks that can be concurrently
executed
we usually care about the average degree of
concurrency
Critical Path
The longest vertex-weighted path in the graph
The weights represent task size
Task granularity affects both of the above
characteristics
Task-Interaction Graph
Captures the pattern of interaction between
tasks
This graph usually contains the task-dependency
graph as a subgraph
i.e., there may be interactions between tasks even if there
are no dependencies
these interactions usually occur due to accesses on shared
data
Task Dependency/Interaction
Graphs
These graphs are important in developing
effectively mapping the tasks onto the different
processors
Maximize concurrency and minimize overheads
Each
processor is
assigned three
tasks but (a) is
better than (b)!
Load Balancing Techniques
Static
The tasks are distributed among the processors prior
to the execution
Applicable for tasks that are
generated statically
known and/or uniform computational requirements
Dynamic
The tasks are distributed among the processors
during the execution of the algorithm
i.e., tasks & data are migrated
Applicable for tasks that are
generated dynamically
unknown computational requirements
Static Mapping—Array Distribution
Suitable for algorithms that
use data decomposition
their underlying input/output/intermediate data
are in the form of arrays
Block Distribution
Cyclic Distribution
1D/2D/3D
Block-Cyclic Distribution
Randomized Block Distributions
Examples: Block Distributions
Examples: Block Distributions
Example: Block-Cyclic Distributions
Gaussian Elimination
The active portion
of the array shrinks
as the computations
progress
Random Block Distributions
Sometimes the computations are performed only
at certain portions of an array
sparse matrix-matrix multiplication
Random Block Distributions
Better load balance can be achieved via a
random block distribution
Graph Partitioning
A mapping can be achieved by directly
partitioning the task interaction graph.
EG: Finite element mesh-based computations
Directly partitioning this graph
Example: Sparse Matrix-Vector
Another instance of graph partitioning
Dynamic Load Balancing Schemes
There is a huge body of research
Centralized Schemes
A certain processors is responsible for giving out work
master-slave paradigm
Issue:
task granularity
Distributed Schemes
Work can be transferred between any pairs of processors.
Issues:
How do the processors get paired?
Who initiates the work transfer? push vs pull
How much work is transferred?
View publication stats