Parallel Algorithms: Peter Harrison and William Knottenbelt
Parallel Algorithms: Peter Harrison and William Knottenbelt
ParAlgs2012 p.1/65
Course Structure
18 lectures 6 regular tutorials 2 lab-tutorials 1 revision lecture-tutorial (optional)
ParAlgs2012 p.2/65
Course Assessment
Exam (answer 3 out of 4 questions) one assessed coursework one laboratory exercise
ParAlgs2012 p.3/65
Recommended Books
Kumar, Grama, Gupta, Karypis. Introduction to Parallel Computing. Benjamin/Cummings. Second Edition, 2002. First Edition, 1994, is OK.
Main course text
ParAlgs2012 p.4/65
Other Books
Cosnard, Trystram. Parallel Algorithms and Architectures. International Thomson Computer Press, 1995. Foster. Designing and Building Parallel Programs. Addison-Wesley, 1994. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
An old classic
ParAlgs2012 p.5/65
Course Outline
Topic No. of lectures Architectures & communication networks 4 Parallel performance metrics 2 Dense matrix algorithms 4 Message Passing Interface (MPI) 2 Sparse matrix algorithms 2 Dynamic search algorithms 4 TOTAL 18
ParAlgs2012 p.6/65
Computer Architectures
1. Sequential John von Neumann model: CPU + Memory Single Instruction stream, Single Data stream (SISD) Predictable performance of (sequential) algorithms with respect to von Neumann machine
ParAlgs2012 p.7/65
Computer Architectures
2. Parallel Multiple cooperating processors, classied by control mechanism, memory organisation, interconnection network (IN) Performance of parallel algorithm depends on target architecture and how it is mapped
ParAlgs2012 p.8/65
Control Mechanisms
Single Instruction stream, Multiple Data stream (SIMD): all processors execute the same instructions synchronously good for data parallelism Multiple Instruction stream, Multiple Data stream (MIMD): processors execute their own programs asynchronously more general process networks (static) divide-and-conquer algorithms (dynamic)
ParAlgs2012 p.9/65
ParAlgs2012 p.10/65
Memory Organization
1. Message-passing architecture Several processors with their own (local) memory interact only by message passing over the IN Distributed memory architecture MIMD message-passing architecture multicomputer 2. Shared address space architecture Single address space shared by all processors
ParAlgs2012 p.11/65
Interconnection Network
1. Static (or direct) networks Point to point communication amongst processors Typical in message-passing architectures Examples are ring, mesh, hypercube Topology critically affects parallel algorithm performance (see coming lectures)
ParAlgs2012 p.13/65
ParAlgs2012 p.15/65
ParAlgs2012 p.16/65
PRAM role
Natural extension of the von Neumann model with zero cost communication (via shared memory) We will use the PRAM to assess the complexity of some parallel algorithms Gives an upper bound on performance, e.g. minimum achievable latency
ParAlgs2012 p.18/65
ParAlgs2012 p.20/65
ParAlgs2012 p.21/65
Cube Networks
In a k -ary d-cube topology of dimension d and radix k each processor is connected to d others (with wrap-around) and there are k processors along each dimension Regular d-dimensional mesh with k d processors Processors labelled by d digit number with radix k Ring of p processors is a p-ary 1-cube Wrap-around mesh of p processors is a p-ary 2-cube
ParAlgs2012 p.23/65
Hypercubes
A k-ary d-cube can be formed from k k-ary (d 1)-cubes by connecting corresponding nodes into rings e.g. composition of rings to form a wrap-around mesh Hypercube binary d-cube nodes labelled by binary numbers of d digits each node connected directly to d others adjacent nodes differ in exactly one bit
ParAlgs2012 p.24/65
ParAlgs2012 p.25/65
This is easily seen recursively, by concatenating the mapping for a (d 1)-hypercube with its reverse and pre- (or app-)ending a 0 onto one mapping and a 1 onto the other .....
ParAlgs2012 p.26/65
2. For d 0, assume successive node addresses in any d-cube ring mapping differ in only one bit 3. Hence same applies in each half of the RGC for a (d + 1)-cube 4. But because of the reection, the same holds for adjacent nodes in different halves.
ParAlgs2012 p.27/65
ParAlgs2012 p.28/65
Communication Costs
Time spent sending data between processors in a parallel algorithm is a signicant overhead communication latency dened by the switching mechanism and parameters: 1. Startup time, ts : message preparation, route initialisation etc. Incurred once per message. 2. Per-hop time, or node latency, th : time for header to pass between directly connected processors. Incurred for every link in a path. 3. Per-word transfer time, tw : tw = 1/r for channel bandwidth r words per second. Relates message length to latency.
ParAlgs2012 p.31/65
Switching Mechanisms
1. Store-and-forward routing Each intermediate processor on a communication path receives an entire message and only then sends it on to the next node on the path For a message of size m words, the communication latency on a path of l links is: tcomm = ts + (mtw + th )l Typically th is small and so we often approximate tcomm = ts + mtw l
ParAlgs2012 p.32/65
ParAlgs2012 p.33/65
Wormhole Routing
As soon as a it is completely received, it is sent on to the next node in the messages path (same path for all its) No need for buffers for whole messages unless asynchronous multiple inputs are allowed for the same out-link Hence more time-efcient and more memory efcient But in a bufferless system, messages may become blocked (waiting for a processor already transmitting another message) possible deadlock
ParAlgs2012 p.34/65
Communication Operations
Certain types of computation occur in many parallel algorithms Some are implemented naturally by particular communication patterns We consider the following patterns of communication where the dual operations, with the direction of the communication reversed, are shown in brackets . . .
ParAlgs2012 p.36/65
Communication Patterns
simple message transfer between two processors (same for dual) one-to-all broadcast (single node accumulation) all-to-all broadcast (multi-node accumulation) one-to-all personalised (single node gather) all-to-all personalised, or scatter (multi-node gather) more exotic patterns, e.g. permutations
ParAlgs2012 p.37/65
ParAlgs2012 p.38/65
Number of hops, l
This depends on the network topology l is at most: p/2 for a ring 2 p/2 for a wrap-around square mesh of p processors ( a/2 + b/2 for an a b mesh) log p for a hypercube So for a hypercube with cut-through, T smt-ct-h = ts + tw m + th log p
ParAlgs2012 p.39/65
Comparison of SF and CT
If message size m is very small, latency is similar for SF and CT If message size is large, i.e. m >> l, CT becomes asymptotically independent of path length l CT much faster than SF Tsmt-ct tw m single hop latency under SF
ParAlgs2012 p.40/65
Reduction
To broadcast the reduction of the data held in all processors with an associative operator, we can: 1. ATA broadcast the data and then reduce locally in every node . . . inefcient 2. Single node accumulation at one node followed by OTA broadcast . . . better 3. Modify ATA broadcast so that instead of catenating messages, the incoming data and the current accumulated value are operated on by the associative operator e.g. summed the result overwriting the accumulated value ..... the most efcient
ParAlgs2012 p.43/65
Parallel Prex
The Parallel Prex of a function f over a non-null list [x1 , . . . , xn ] is the list of reductions of f over all sublists [x1 , . . . , xi ] for 1 i n, where reducef [x1 ] = x1 for all f Could implement as n reductions Better to modify the third reduction method by only updating the accumulator at each node when data comes in from the appropriate nodes (otherwise it is just passed on)
ParAlgs2012 p.44/65
All-to-All Personalised
Every processor sends a distinct message of size m to every other processor total exchange E.g. in matrix transpose, FFT, database join Communication patterns identical to ATA Label messages by pairs (x, y) where x is the source processor and y is the destination processor: uniquely determines the message contents List of n messages denoted [(x1 , y1 ), . . . , (xn , yn )]
ParAlgs2012 p.45/65
Performance Metrics
1. Run Time, Tp A parallel algorithm is hard to justify without improved run-time Tp = Elapsed time on p processors: between the start of computation on the rst processor to start, and termination of computation on the last processor to nish
ParAlgs2012 p.46/65
ParAlgs2012 p.47/65
ParAlgs2012 p.48/65
ParAlgs2012 p.49/65
A parallel algorithm is cost-optimal if Cp best serial run time Equivalently if Ep = (1) Above example is not cost-optimal since best serial run time is (p)
ParAlgs2012 p.50/65
Granularity
Amount of work allocated to each processor Few processors, relatively large processing load on each coarse-grained parallel algorithm Many processors, relatively small processing load on each ne-grained parallel algorithm e.g. our hypercube algorithm to add up p numbers typically many small communications, often in parallel
ParAlgs2012 p.51/65
Each processor adds locally k numbers in (k) time p partial sums are added in (log p) time Tp = (k + log p) and Cp = (n + p log p) Cost optimal if n = (p log p)
ParAlgs2012 p.53/65
ParAlgs2012 p.54/65
Scalability
Efciency decreases as the number of processors increases Consequence of Amdahls Law: problem size Sp size of serial part of problem where size is the number of basic computation steps in the best serial algorithm
ParAlgs2012 p.55/65
Scalability (2)
A parallel system is scalable if it can maintain the efciency of a parallel algorithm by simultaneously increasing the number of procesors and problem size E.g. in the above example, efciency remains at 80% if n is increased with p as 8p log p But you cant tell me why yet!
ParAlgs2012 p.56/65
Overhead in Hypercube-Addition
For the above addition on a hypercube example, at granularity k = n/p Tp = n/p + 2 log p assuming time 1 for addition and single hop communication. Then Op = 2p log p
ParAlgs2012 p.58/65
Isoefciency
For a scalable system, the isoefciency function I determines W in terms of p and E such that efciency, Ep , is xed at some specied constant value E E = Sp /p = W/Cp W = W + Op(W ) 1 = 1 + Op (W )/W
ParAlgs2012 p.59/65
Isoefciency (2)
Rearranging, 1 + Op (W )/W = 1/E, and so E W = Op (W ) 1E This is the Isoefciency Equation Setting K = E/(1 E) for our given E, let the solution of this equation (assuming it exists, i.e. for a scalable system) be W = I(p, K) the Isoefciency function
ParAlgs2012 p.60/65
Back to Hypercube-Addition
For the addition on a hypercube example its easy: I(p, K) = 2Kp log p More generally, Op varies with W and the isoefciency equation is non-trivial, e.g. non-linear Plenty of examples in the rest of the course!
ParAlgs2012 p.61/65
ParAlgs2012 p.63/65
Minimum Run-Time
Assuming differentiability of the expression for Tp , nd p = p0 such that dTp =0 dp
min giving Tp = Tp
For our example, n = f (p) = p log p and we nd p = f 1 (n) = n/ log p n/ log n so that
min-cost-opt Tp
ParAlgs2012 p.65/65