0% found this document useful (0 votes)
65 views

Parallel Algorithms: Peter Harrison and William Knottenbelt

This document outlines the structure and content of a course on parallel algorithms. It includes 18 lectures over various topics in parallel computing including architectures, performance metrics, dense and sparse matrix algorithms, and message passing. Assessment includes an exam and coursework. It recommends textbooks on parallel computing and provides an outline of lecture topics, including computer architectures, memory organization, interconnection networks, and embeddings of networks into hypercubes.

Uploaded by

adnannoor
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Parallel Algorithms: Peter Harrison and William Knottenbelt

This document outlines the structure and content of a course on parallel algorithms. It includes 18 lectures over various topics in parallel computing including architectures, performance metrics, dense and sparse matrix algorithms, and message passing. Assessment includes an exam and coursework. It recommends textbooks on parallel computing and provides an outline of lecture topics, including computer architectures, memory organization, interconnection networks, and embeddings of networks into hypercubes.

Uploaded by

adnannoor
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Parallel Algorithms

Peter Harrison and William Knottenbelt


Email: {pgh,wjk}@doc.ic.ac.uk

Department of Computing, Imperial College London

ParAlgs2012 p.1/65

Course Structure
18 lectures 6 regular tutorials 2 lab-tutorials 1 revision lecture-tutorial (optional)

ParAlgs2012 p.2/65

Course Assessment
Exam (answer 3 out of 4 questions) one assessed coursework one laboratory exercise

ParAlgs2012 p.3/65

Recommended Books
Kumar, Grama, Gupta, Karypis. Introduction to Parallel Computing. Benjamin/Cummings. Second Edition, 2002. First Edition, 1994, is OK.
Main course text

Freeman and Phillips. Parallel Numerical Algorithms. Prentice-Hall, 1992.


Main text for stuff on differential equations

ParAlgs2012 p.4/65

Other Books
Cosnard, Trystram. Parallel Algorithms and Architectures. International Thomson Computer Press, 1995. Foster. Designing and Building Parallel Programs. Addison-Wesley, 1994. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
An old classic

ParAlgs2012 p.5/65

Course Outline
Topic No. of lectures Architectures & communication networks 4 Parallel performance metrics 2 Dense matrix algorithms 4 Message Passing Interface (MPI) 2 Sparse matrix algorithms 2 Dynamic search algorithms 4 TOTAL 18

ParAlgs2012 p.6/65

Computer Architectures
1. Sequential John von Neumann model: CPU + Memory Single Instruction stream, Single Data stream (SISD) Predictable performance of (sequential) algorithms with respect to von Neumann machine

ParAlgs2012 p.7/65

Computer Architectures
2. Parallel Multiple cooperating processors, classied by control mechanism, memory organisation, interconnection network (IN) Performance of parallel algorithm depends on target architecture and how it is mapped

ParAlgs2012 p.8/65

Control Mechanisms
Single Instruction stream, Multiple Data stream (SIMD): all processors execute the same instructions synchronously good for data parallelism Multiple Instruction stream, Multiple Data stream (MIMD): processors execute their own programs asynchronously more general process networks (static) divide-and-conquer algorithms (dynamic)

ParAlgs2012 p.9/65

Control Mechanisms hybrid


Single Program, Multiple Data stream (SPMD): all processors run the same program asynchronously Hybrid SIMD / MIMD also suitable for data-parallelism but needs explicit synchronisation

ParAlgs2012 p.10/65

Memory Organization
1. Message-passing architecture Several processors with their own (local) memory interact only by message passing over the IN Distributed memory architecture MIMD message-passing architecture multicomputer 2. Shared address space architecture Single address space shared by all processors

ParAlgs2012 p.11/65

Memory Organization (2)


2. Shared address space architecture (cont.) Multiprocessor architecture Uniform memory access (UMA) (average) access time same for all memory blocks: e.g. single memory bank (or hierarchy) Otherwise non-uniform memory access (NUMA): e.g. global address space is distributed across the processors local memories (distributed shared memory multiprocessor) Also cache hierarchies imply less uniformity
ParAlgs2012 p.12/65

Interconnection Network
1. Static (or direct) networks Point to point communication amongst processors Typical in message-passing architectures Examples are ring, mesh, hypercube Topology critically affects parallel algorithm performance (see coming lectures)

ParAlgs2012 p.13/65

Interconnection Network (2)


2. Dynamic (or indirect) networks Connections between processors are constructed dynamically during execution using switches, e.g. crossbars or networks of these such as multistage banyan (or delta, or omega, or buttery) networks. Typically used to implement shared address space architectures But also in some message-passing algorithms; e.g. the FFT on a buttery (see textbooks)
ParAlgs2012 p.14/65

Parallel Random Access Machine


The PRAM is an idealised model of computation on a shared-memory MIMD computer Fixed number p of processors Unbounded UMA global memory All instructions last one cycle Synchronous operation (common clock) but different instructions are allowed in different processors on the same cycle

ParAlgs2012 p.15/65

PRAM memory access modes


Four modes of simultaneous memory access (2 types of access, 2 modes) EREW: Exclusive read, exclusive write. Weakest PRAM model, minimum concurrency. CREW: Concurrent read, exclusive write. Better. CRCW: Concurrent read, concurrent write. Maximum concurrency. Can simulate on a EREW PRAM (exercise) ERCW: Exclusive read, concurrent write. Unusual?

ParAlgs2012 p.16/65

Concurrent Write Semantics


Arbitration is needed to dene a unique semantics of concurrent write in CRCW and ERCW PRAMs
Common All values to be written are the same Arbitrary Pick one writer at random Priority All processors have a preassigned priority Reduce Write the (generalised) sum of all values attempting to be written. Sum can be any associative and commutative operator cf. reduce or fold of functional languages.
ParAlgs2012 p.17/65

PRAM role
Natural extension of the von Neumann model with zero cost communication (via shared memory) We will use the PRAM to assess the complexity of some parallel algorithms Gives an upper bound on performance, e.g. minimum achievable latency

ParAlgs2012 p.18/65

Static Interconnection Networks


1. Completely connected direct link between every pair of processors ideal performance but complex and expensive 2. Star all communication through a special central processor central processor liable to become a bottleneck logically equivalent to a bus associated with shared memory machines (dynamic network)
ParAlgs2012 p.19/65

Static Interconnection Networks (2)


3. Linear array and ring connect processors in tandem with wrap-around gives a ring communication via multiple hops over links through intermediate processors basis for quantitative analysis of many other common networks

ParAlgs2012 p.20/65

Static Interconnection Networks (3)


4. Mesh generalisation of linear array (or ring with wrap-around) to more than one dimension processors labelled by rectilinear coordinates links between adjacent processors on each coordinate axis (i.e. in each dimension) multiple paths between source and destination processors

ParAlgs2012 p.21/65

Static Interconnection Networks (4)


5. Tree unique path between any pair of processors processors reside at the leaves of the tree Internal nodes may be processors (typical in static network) or switches (typical in dynamic networks) bottlenecks higher up the tree can alleviate by increasing bandwidth at higher levels fat tree (e.g. in CM5)
ParAlgs2012 p.22/65

Cube Networks
In a k -ary d-cube topology of dimension d and radix k each processor is connected to d others (with wrap-around) and there are k processors along each dimension Regular d-dimensional mesh with k d processors Processors labelled by d digit number with radix k Ring of p processors is a p-ary 1-cube Wrap-around mesh of p processors is a p-ary 2-cube
ParAlgs2012 p.23/65

Hypercubes
A k-ary d-cube can be formed from k k-ary (d 1)-cubes by connecting corresponding nodes into rings e.g. composition of rings to form a wrap-around mesh Hypercube binary d-cube nodes labelled by binary numbers of d digits each node connected directly to d others adjacent nodes differ in exactly one bit

ParAlgs2012 p.24/65

Embeddings into Hypercubes


Hypercube is the most richly connected topology we have considered (apart from completely connected) so can we consider other topologies as embedded subnetworks? 1. Ring of 2d nodes Need to nd a sequence of adjacent nodes, with wraparound, in a d-hypercube Adjacent node labels differ in exactly one bit position

ParAlgs2012 p.25/65

Mapping: ring hypercube


Assign processor i in the ring to node G(i, d) in the hypercube where G is the binary reected Gray code (RGC) dened by: G(0, 1) = 0, G(1, 1) = 1 and G(i, n + 1) = G(i, n) i < 2n 2n + G(2n+1 1 i, n) i 2n

This is easily seen recursively, by concatenating the mapping for a (d 1)-hypercube with its reverse and pre- (or app-)ending a 0 onto one mapping and a 1 onto the other .....
ParAlgs2012 p.26/65

Why is this true?


Proof by induction: a sketch (all that is necessary here) is: 1. Certainly true for d = 1, when 0 0 and 11

2. For d 0, assume successive node addresses in any d-cube ring mapping differ in only one bit 3. Hence same applies in each half of the RGC for a (d + 1)-cube 4. But because of the reection, the same holds for adjacent nodes in different halves.
ParAlgs2012 p.27/65

Mapping: mesh hypercube


The mapping for an m dimensional mesh is obtained by concatenating the RGCs for each individual dimension Thus node (i1 , . . . , im ) in a 2r1 . . . 2rm mesh maps to node G(i1 , r1 ) <> . . . <> G(im , rm ) E.g. in an 8 8 square mesh, the node at coordinate (2, 7) maps to hypercube node (0, 1, 1, 1, 0, 0).

ParAlgs2012 p.28/65

Mapping: tree hypercube


Consider a (complete) binary tree of depth d with processors at the leaves only This embeds into a d-hypercube as follows, via a many-to-one mapping that maps every node 1. map the root (level 0) to any node, e.g. (0, . . . , 0) 2. For each node at level j, if mapped to hypercube node k, map the left child to k and the right child to k with bit j inverted. 3. repeat for j = 1, . . . , d
ParAlgs2012 p.29/65

Monotonicity of the mapping


Distance between two tree-nodes is 2n for some n 1 (difference between d and the level of the lowest common ancestor) The corresponding distance in the hypercube is n think of bit-changes Nodes further apart in the hypercube must be further apart in the tree, but the converse may not hold: because of richer hypercube connectivity some bits might ip back distant tree-nodes might happen to be closer in the hypercube: d are adjacent
ParAlgs2012 p.30/65

Communication Costs
Time spent sending data between processors in a parallel algorithm is a signicant overhead communication latency dened by the switching mechanism and parameters: 1. Startup time, ts : message preparation, route initialisation etc. Incurred once per message. 2. Per-hop time, or node latency, th : time for header to pass between directly connected processors. Incurred for every link in a path. 3. Per-word transfer time, tw : tw = 1/r for channel bandwidth r words per second. Relates message length to latency.
ParAlgs2012 p.31/65

Switching Mechanisms
1. Store-and-forward routing Each intermediate processor on a communication path receives an entire message and only then sends it on to the next node on the path For a message of size m words, the communication latency on a path of l links is: tcomm = ts + (mtw + th )l Typically th is small and so we often approximate tcomm = ts + mtw l
ParAlgs2012 p.32/65

Switching Mechanisms (2)


2. Cut-through routing Reduce idle time of resources by pipelining messages along a path in pieces Messages are advanced to the out-link of a node as they arrive at the in-link Wormhole routing splits messages into its (ow-control digits) which are then pipelined

ParAlgs2012 p.33/65

Wormhole Routing
As soon as a it is completely received, it is sent on to the next node in the messages path (same path for all its) No need for buffers for whole messages unless asynchronous multiple inputs are allowed for the same out-link Hence more time-efcient and more memory efcient But in a bufferless system, messages may become blocked (waiting for a processor already transmitting another message) possible deadlock
ParAlgs2012 p.34/65

Wormhole Routing (2)


On an l-link path, header it latency = lth An m-word message will all arrive mtw after the header For a message of size m words, the communication latency on a path of l links is therefore: tcomm = ts + mtw + lth (m + l) for cut-through vs. (ml) for store-and-forward similar for small l (identical for l = 1)
ParAlgs2012 p.35/65

Communication Operations
Certain types of computation occur in many parallel algorithms Some are implemented naturally by particular communication patterns We consider the following patterns of communication where the dual operations, with the direction of the communication reversed, are shown in brackets . . .

ParAlgs2012 p.36/65

Communication Patterns
simple message transfer between two processors (same for dual) one-to-all broadcast (single node accumulation) all-to-all broadcast (multi-node accumulation) one-to-all personalised (single node gather) all-to-all personalised, or scatter (multi-node gather) more exotic patterns, e.g. permutations

ParAlgs2012 p.37/65

Simple Message Transfer


Most basic type of communication Dual operation is of the same type Latency for single message is : Tsmt-sf = ts + tw ml + th l for store-and-forward routing Tsmt-ct = ts + tw m + th l for cut-through routing where l is the number of hops . . .

ParAlgs2012 p.38/65

Number of hops, l
This depends on the network topology l is at most: p/2 for a ring 2 p/2 for a wrap-around square mesh of p processors ( a/2 + b/2 for an a b mesh) log p for a hypercube So for a hypercube with cut-through, T smt-ct-h = ts + tw m + th log p

ParAlgs2012 p.39/65

Comparison of SF and CT
If message size m is very small, latency is similar for SF and CT If message size is large, i.e. m >> l, CT becomes asymptotically independent of path length l CT much faster than SF Tsmt-ct tw m single hop latency under SF

ParAlgs2012 p.40/65

One-to-All Broadcast (OTA)


Single processor sends data to all or a subset of other processors E.g. matrix-vector multiplication: broadcast each element of the vector over its corresponding column In the dual operation, single node accumulation, data may not only be collected but also mapped by an associative operator e.g. sum a list of elements initially distributed over processors cf. concurrent write in PRAM
ParAlgs2012 p.41/65

All-to-All Broadcast (ATA)


Each processor performs (simultaneously) one-to-all broadcast with its own data Used in matrix operations, e.g. matrix multiplication, reduction and parallel-prex In the dual operation multinode accumulation each processor receives single-node accumulation Could implement ATA by sequentially performing p OTAs Far better to proceed in parallel and catenate incoming data
ParAlgs2012 p.42/65

Reduction
To broadcast the reduction of the data held in all processors with an associative operator, we can: 1. ATA broadcast the data and then reduce locally in every node . . . inefcient 2. Single node accumulation at one node followed by OTA broadcast . . . better 3. Modify ATA broadcast so that instead of catenating messages, the incoming data and the current accumulated value are operated on by the associative operator e.g. summed the result overwriting the accumulated value ..... the most efcient
ParAlgs2012 p.43/65

Parallel Prex
The Parallel Prex of a function f over a non-null list [x1 , . . . , xn ] is the list of reductions of f over all sublists [x1 , . . . , xi ] for 1 i n, where reducef [x1 ] = x1 for all f Could implement as n reductions Better to modify the third reduction method by only updating the accumulator at each node when data comes in from the appropriate nodes (otherwise it is just passed on)

ParAlgs2012 p.44/65

All-to-All Personalised
Every processor sends a distinct message of size m to every other processor total exchange E.g. in matrix transpose, FFT, database join Communication patterns identical to ATA Label messages by pairs (x, y) where x is the source processor and y is the destination processor: uniquely determines the message contents List of n messages denoted [(x1 , y1 ), . . . , (xn , yn )]
ParAlgs2012 p.45/65

Performance Metrics
1. Run Time, Tp A parallel algorithm is hard to justify without improved run-time Tp = Elapsed time on p processors: between the start of computation on the rst processor to start, and termination of computation on the last processor to nish

ParAlgs2012 p.46/65

Performance Metrics (2)


2. Speed-up, Sp serial run-time of best sequential algorithm Sp = Tp best algorithm is the optimal one for the problem, if known, or the fastest known, if not often in practice (always in this course) T1 Sp 1 . . . usually!

ParAlgs2012 p.47/65

Example addition on hypercube


Add up p = 2d numbers on a d-hypercube Use single node accumulation Each single-hop communication combined with one addition operation Sp = (p/ log p)

ParAlgs2012 p.48/65

Performance Metrics (2)


3. Efciency, Ep Sp Ep = p Fraction of time for which a processor is doing useful work Ep = (1/ log p) in above example

ParAlgs2012 p.49/65

Performance Metrics (3)


4. Cost, Cp Cp = pTp best serial run-time so that Ep = Cp

A parallel algorithm is cost-optimal if Cp best serial run time Equivalently if Ep = (1) Above example is not cost-optimal since best serial run time is (p)

ParAlgs2012 p.50/65

Granularity
Amount of work allocated to each processor Few processors, relatively large processing load on each coarse-grained parallel algorithm Many processors, relatively small processing load on each ne-grained parallel algorithm e.g. our hypercube algorithm to add up p numbers typically many small communications, often in parallel
ParAlgs2012 p.51/65

Increasing the granularity


Let each processor simulate k processors in a ner-grained parallel algorithm Computation at each processor increases by a factor k Communication time increases by factor k typically << k but may have much larger message sizes, e.g. k parallel communications may map to a single communication k times bigger Hence Tp/k k Tp and so Cp/k Cp Cost-optimality preserved may be created?
ParAlgs2012 p.52/65

Addition on Hypercube Again


Add n numbers on a d-hypercube of p = 2d processors Let each processor simulate k = n/p processes (assuming p | n)

Each processor adds locally k numbers in (k) time p partial sums are added in (log p) time Tp = (k + log p) and Cp = (n + p log p) Cost optimal if n = (p log p)

ParAlgs2012 p.53/65

Addition on Hypercube Again (2)


Alternatively, try communication in the rst log p steps, followed by local addition of n/p numbers Tp = ((n/p) log p) So Cp = ((n) log p) = log p (C1 )
never cost-optimal

ParAlgs2012 p.54/65

Scalability
Efciency decreases as the number of processors increases Consequence of Amdahls Law: problem size Sp size of serial part of problem where size is the number of basic computation steps in the best serial algorithm

ParAlgs2012 p.55/65

Scalability (2)
A parallel system is scalable if it can maintain the efciency of a parallel algorithm by simultaneously increasing the number of procesors and problem size E.g. in the above example, efciency remains at 80% if n is increased with p as 8p log p But you cant tell me why yet!

ParAlgs2012 p.56/65

The Isoefciency Metric


Measure of the extent to which a parallel system is scalable Dene the overhead, Op to be the amount of computation not performed in the best serial algorithm O p = Cp W where W is the problem size Op includes setup overheads and possible changes to an algorithm to make it parallel, but usually (100% in this course) comprises the communication latency
ParAlgs2012 p.57/65

Overhead in Hypercube-Addition
For the above addition on a hypercube example, at granularity k = n/p Tp = n/p + 2 log p assuming time 1 for addition and single hop communication. Then Op = 2p log p

ParAlgs2012 p.58/65

Isoefciency
For a scalable system, the isoefciency function I determines W in terms of p and E such that efciency, Ep , is xed at some specied constant value E E = Sp /p = W/Cp W = W + Op(W ) 1 = 1 + Op (W )/W

ParAlgs2012 p.59/65

Isoefciency (2)
Rearranging, 1 + Op (W )/W = 1/E, and so E W = Op (W ) 1E This is the Isoefciency Equation Setting K = E/(1 E) for our given E, let the solution of this equation (assuming it exists, i.e. for a scalable system) be W = I(p, K) the Isoefciency function

ParAlgs2012 p.60/65

Back to Hypercube-Addition
For the addition on a hypercube example its easy: I(p, K) = 2Kp log p More generally, Op varies with W and the isoefciency equation is non-trivial, e.g. non-linear Plenty of examples in the rest of the course!

ParAlgs2012 p.61/65

Cost-optimality and Isoefciency


A parallel system is cost-optimal if and only if Cp = (W ), i.e. its cost is asymptotically the same as the cost of the serial algorithm This implies the upper bound on the overhead Op (W ) = O(W ) or lower bound on the problem size W = (Op (W )) Not surprising you dont want a bigger overhead than the computation of the solution itself!
ParAlgs2012 p.62/65

Cost-optimality and Isoefciency (2)


For the above example W = (n) and Op (W ) = 2p log p so that the system cannot be cost-optimal unless n = (p log p) the condition for cost-optimality already derived the system is then scalable its isoefciency function is (p log p)

ParAlgs2012 p.63/65

Minimum Run-Time
Assuming differentiability of the expression for Tp , nd p = p0 such that dTp =0 dp
min giving Tp = Tp

For the above example, Tp = n/p + 2 log p min p0 = n/2 Tp = 2 log n


Not cost-optimal
ParAlgs2012 p.64/65

Minimum Cost-Optimal Run-Time


For isoefciency function (f (p)) (at any efciency), W = (f (p)) or p = O(f 1 (W )) Then minimum cost-optimal run time is
min-cost-opt Tp = (W/f 1 (W ))

For our example, n = f (p) = p log p and we nd p = f 1 (n) = n/ log p n/ log n so that
min-cost-opt Tp

3 log n 2 log log n

min here, same asymptotic complexity as Tp

ParAlgs2012 p.65/65

You might also like