0% found this document useful (0 votes)
24 views53 pages

Introduction To Parallel Computing Design and Anal

Uploaded by

Shastry Sarang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views53 pages

Introduction To Parallel Computing Design and Anal

Uploaded by

Shastry Sarang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/201976857

Introduction to parallel computing. Design and analysis of algorithms

Book · January 1994

CITATIONS READS

1,580 8,157

4 authors, including:

Vipin Kumar Anshul Gupta


University of Minnesota Twin Cities IBM
720 PUBLICATIONS 77,270 CITATIONS 104 PUBLICATIONS 5,575 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Efficient Exact Sparse Nearest Neighbor Search View project

Mining Evolving Patterns in Dynamic Relational Networks View project

All content following this page was uploaded by Vipin Kumar on 22 December 2013.

The user has requested enhancement of the downloaded file.


Introduction to
Parallel Computing

George Karypis
Principles of Parallel Algorithm
Design
Outline
„ Overview of some Serial Algorithms
„ Parallel Algorithm vs Parallel Formulation
„ Elements of a Parallel Algorithm/Formulation
„ Common Decomposition Methods
concurrency extractor!
„ Common Mapping Methods
parallel overhead reducer!
Some Serial Algorithms
Working Examples
„ Dense Matrix-Matrix & Matrix-Vector
Multiplication
„ Sparse Matrix-Vector Multiplication
„ Gaussian Elimination
„ Floyd’s All-pairs Shortest Path
„ Quicksort
„ Minimum/Maximum Finding
„ Heuristic Search—15-puzzle problem
Dense Matrix-Vector Multiplication
Dense Matrix-Matrix Multiplication
Sparse Matrix-Vector Multiplication
Gaussian Elimination
Floyd’s All-Pairs Shortest Path
Quicksort
Minimum Finding
15—Puzzle Problem
Parallel Algorithm vs Parallel
Formulation
„ Parallel Formulation
Refers to a parallelization of a serial algorithm.
„ Parallel Algorithm
May represent an entirely different algorithm than the
one used serially.

„ We primarily focus on “Parallel Formulations”


Our goal today is to primarily discuss how to develop
such parallel formulations.
Of course, there will always be examples of “parallel
algorithms” that were not derived from serial
algorithms.
Elements of a Parallel
Algorithm/Formulation
„ Pieces of work that can be done concurrently
tasks
„ Mapping of the tasks onto multiple processors
processes vs processors
„ Distribution of input/output & intermediate data across the different
processors
„ Management the access of shared data
either input or intermediate
„ Synchronization of the processors at various points of the parallel
execution

Holy Grail:
Maximize concurrency and reduce overheads due to parallelization!
Maximize potential speedup!
Finding Concurrent Pieces of Work
„ Decomposition:
The process of dividing the computation into
smaller pieces of work i.e., tasks
„ Tasks are programmer defined and are
considered to be indivisible
Example: Dense Matrix-Vector
Multiplication
Tasks can be of different size.
• granularity of a task
Example: Query Processing

Query:
Example: Query Processing
„ Finding concurrent tasks…
Task-Dependency Graph
„ In most cases, there are dependencies between
the different tasks
certain task(s) can only start once some other task(s)
have finished
„ e.g., producer-consumer relationships
„ These dependencies are represented using a
DAG called task-dependency graph
Task-Dependency Graph (cont)
„ Key Concepts Derived from the Task-
Dependency Graph
Degree of Concurrency
„ The number of tasks that can be concurrently
executed
we usually care about the average degree of
concurrency
Critical Path
„ The longest vertex-weighted path in the graph
The weights represent task size
Task granularity affects both of the above
characteristics
Task-Interaction Graph
„ Captures the pattern of interaction between
tasks
This graph usually contains the task-dependency
graph as a subgraph
„ i.e., there may be interactions between tasks even if there
are no dependencies
these interactions usually occur due to accesses on shared
data
Task Dependency/Interaction
Graphs
„ These graphs are important in developing
effectively mapping the tasks onto the different
processors
Maximize concurrency and minimize overheads

„ More on this later…


Common Decomposition Methods
„ Data Decomposition
„ Recursive Decomposition
Task
„ Exploratory Decomposition decomposition
methods
„ Speculative Decomposition
„ Hybrid Decomposition
Recursive Decomposition
„ Suitable for problems that can be solved
using the divide-and-conquer paradigm
„ Each of the subproblems generated by the
divide step becomes a task
Example: Quicksort
Example: Finding the Minimum
„ Note that we can obtain divide-and-conquer algorithms
for problems that are traditionally solved using non-
divide-and-conquer approaches
Recursive Decomposition
„ How good are the decompositions that it
produces?
average concurrency?
critical path?
„ How do the quicksort and min-finding
decompositions measure-up?
Data Decomposition
„ Used to derive concurrency for problems that operate on
large amounts of data
„ The idea is to derive the tasks by focusing on the
multiplicity of data
„ Data decomposition is often performed in two steps
Step 1: Partition the data
Step 2: Induce a computational partitioning from the data
partitioning
„ Which data should we partition?
Input/Output/Intermediate?
„ Well… all of the above—leading to different data decomposition
methods
„ How do induce a computational partitioning?
Owner-computes rule
Example: Matrix-Matrix
Multiplication
„ Partitioning the output data
Example: Matrix-Matrix
Multiplication
„ Partitioning the intermediate data
Data Decomposition
„ Is the most widely-used decomposition
technique
after all parallel processing is often applied to
problems that have a lot of data
splitting the work based on this data is the natural
way to extract high-degree of concurrency
„ It is used by itself or in conjunction with other
decomposition methods
Hybrid decomposition
Exploratory Decomposition
„ Used to decompose computations that
correspond to a search of a space of
solutions
Example: 15-puzzle Problem
Exploratory Decomposition
„ It is not as general purpose
„ It can result in speedup anomalies
engineered slow-down or superlinear
speedup
Speculative Decomposition
„ Used to extract concurrency in problems in
which the next step is one of many
possible actions that can only be
determined when the current tasks
finishes
„ This decomposition assumes a certain
outcome of the currently executed task
and executes some of the next steps
Just like speculative execution at the
microprocessor level
Example: Discrete Event
Simulation
Speculative Execution
„ If predictions are wrong…
work is wasted
work may need to be undone
„ state-restoring overhead
memory/computations

„ However, it may be the only way to extract


concurrency!
Mapping the Tasks
„ Why do we care about task mapping?
Can I just randomly assign them to the available processors?
„ Proper mapping is critical as it needs to minimize the
parallel processing overheads
If Tp is the parallel runtime on p processors and Ts is the serial
runtime, then the total overhead To is p*Tp – Ts
„ The work done by the parallel system beyond that required by the
serial system
Overhead sources:
they can „ Load imbalance remember the
be at odds „ Inter-process communication holy grail…
with each
other coordination/synchronization/data-sharing
Why Mapping can be Complicated?
„ Proper mapping needs to take into account the task-dependency
and interaction graphs
Are the tasks available a priori?
„ Static vs dynamic task generation Task
How about their computational requirements? dependency
„ Are they uniform or non-uniform? graph
„ Do we know them a priori?
How much data is associated with each task?
How about the interaction patterns between the tasks?
„ Are they static or dynamic?
Task
„ Do we know them a priori? interaction
„ Are they data instance dependent? graph
„ Are they regular or irregular?
„ Are they read-only or read-write?
„ Depending on the above characteristics different mapping
techniques are required of different complexity and cost
Example: Simple & Complex Task
Interaction
Mapping Techniques for Load
Balancing
„ Be aware…
The assignment of tasks whose aggregate
computational requirements are the same does not
automatically ensure load balance.

Each
processor is
assigned three
tasks but (a) is
better than (b)!
Load Balancing Techniques
„ Static
The tasks are distributed among the processors prior
to the execution
Applicable for tasks that are
„ generated statically
„ known and/or uniform computational requirements
„ Dynamic
The tasks are distributed among the processors
during the execution of the algorithm
„ i.e., tasks & data are migrated
Applicable for tasks that are
„ generated dynamically
„ unknown computational requirements
Static Mapping—Array Distribution
„ Suitable for algorithms that
use data decomposition
their underlying input/output/intermediate data
are in the form of arrays
„ Block Distribution
„ Cyclic Distribution
1D/2D/3D
„ Block-Cyclic Distribution
„ Randomized Block Distributions
Examples: Block Distributions
Examples: Block Distributions
Example: Block-Cyclic Distributions
„ Gaussian Elimination
The active portion
of the array shrinks
as the computations
progress
Random Block Distributions
„ Sometimes the computations are performed only
at certain portions of an array
sparse matrix-matrix multiplication
Random Block Distributions
„ Better load balance can be achieved via a
random block distribution
Graph Partitioning
„ A mapping can be achieved by directly
partitioning the task interaction graph.
EG: Finite element mesh-based computations
Directly partitioning this graph
Example: Sparse Matrix-Vector
„ Another instance of graph partitioning
Dynamic Load Balancing Schemes
„ There is a huge body of research
Centralized Schemes
„ A certain processors is responsible for giving out work
master-slave paradigm
„ Issue:
task granularity
Distributed Schemes
„ Work can be transferred between any pairs of processors.
„ Issues:
How do the processors get paired?
Who initiates the work transfer? push vs pull
How much work is transferred?
View publication stats

Mapping to Minimize Interaction


Overheads
„ Maximize data locality
„ Minimize volume of data-exchange
„ Minimize frequency of interactions
„ Minimize contention and hot spots
„ Overlap computation with interactions
„ Selective data and computation replication

Achieving the above is usually an interplay of


decomposition and mapping and is usually done iteratively

You might also like