0% found this document useful (0 votes)

65 views

Parallel Algorithms: Peter Harrison and William Knottenbelt

This document outlines the structure and content of a course on parallel algorithms. It includes 18 lectures over various topics in parallel computing including architectures, performance metrics, dense and sparse matrix algorithms, and message passing. Assessment includes an exam and coursework. It recommends textbooks on parallel computing and provides an outline of lecture topics, including computer architectures, memory organization, interconnection networks, and embeddings of networks into hypercubes.

Uploaded by

adnannoor

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

Parallel Algorithms: Peter Harrison and William Knottenbelt

Uploaded by

adnannoor

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Parallel Algorithms

Peter Harrison and William Knottenbelt

Email: {pgh,wjk}@doc.ic.ac.uk

Department of Computing, Imperial College London

ParAlgs2012 p.1/65

Course Structure
18 lectures 6 regular tutorials 2 lab-tutorials 1 revision lecture-tutorial (optional)

ParAlgs2012 p.2/65

Course Assessment
Exam (answer 3 out of 4 questions) one assessed coursework one laboratory exercise

ParAlgs2012 p.3/65

Recommended Books
Kumar, Grama, Gupta, Karypis. Introduction to Parallel Computing. Benjamin/Cummings. Second Edition, 2002. First Edition, 1994, is OK.
Main course text

Freeman and Phillips. Parallel Numerical Algorithms. Prentice-Hall, 1992.

Main text for stuff on differential equations

ParAlgs2012 p.4/65

Other Books
Cosnard, Trystram. Parallel Algorithms and Architectures. International Thomson Computer Press, 1995. Foster. Designing and Building Parallel Programs. Addison-Wesley, 1994. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
An old classic

ParAlgs2012 p.5/65

Course Outline
Topic No. of lectures Architectures & communication networks 4 Parallel performance metrics 2 Dense matrix algorithms 4 Message Passing Interface (MPI) 2 Sparse matrix algorithms 2 Dynamic search algorithms 4 TOTAL 18

ParAlgs2012 p.6/65

Computer Architectures
1. Sequential John von Neumann model: CPU + Memory Single Instruction stream, Single Data stream (SISD) Predictable performance of (sequential) algorithms with respect to von Neumann machine

ParAlgs2012 p.7/65

Computer Architectures
2. Parallel Multiple cooperating processors, classied by control mechanism, memory organisation, interconnection network (IN) Performance of parallel algorithm depends on target architecture and how it is mapped

ParAlgs2012 p.8/65

Control Mechanisms
Single Instruction stream, Multiple Data stream (SIMD): all processors execute the same instructions synchronously good for data parallelism Multiple Instruction stream, Multiple Data stream (MIMD): processors execute their own programs asynchronously more general process networks (static) divide-and-conquer algorithms (dynamic)

ParAlgs2012 p.9/65

Control Mechanisms hybrid

Single Program, Multiple Data stream (SPMD): all processors run the same program asynchronously Hybrid SIMD / MIMD also suitable for data-parallelism but needs explicit synchronisation

ParAlgs2012 p.10/65

Memory Organization
1. Message-passing architecture Several processors with their own (local) memory interact only by message passing over the IN Distributed memory architecture MIMD message-passing architecture multicomputer 2. Shared address space architecture Single address space shared by all processors

ParAlgs2012 p.11/65

Memory Organization (2)

2. Shared address space architecture (cont.) Multiprocessor architecture Uniform memory access (UMA) (average) access time same for all memory blocks: e.g. single memory bank (or hierarchy) Otherwise non-uniform memory access (NUMA): e.g. global address space is distributed across the processors local memories (distributed shared memory multiprocessor) Also cache hierarchies imply less uniformity
ParAlgs2012 p.12/65

Interconnection Network
1. Static (or direct) networks Point to point communication amongst processors Typical in message-passing architectures Examples are ring, mesh, hypercube Topology critically affects parallel algorithm performance (see coming lectures)

ParAlgs2012 p.13/65

Interconnection Network (2)

2. Dynamic (or indirect) networks Connections between processors are constructed dynamically during execution using switches, e.g. crossbars or networks of these such as multistage banyan (or delta, or omega, or buttery) networks. Typically used to implement shared address space architectures But also in some message-passing algorithms; e.g. the FFT on a buttery (see textbooks)
ParAlgs2012 p.14/65

Parallel Random Access Machine

The PRAM is an idealised model of computation on a shared-memory MIMD computer Fixed number p of processors Unbounded UMA global memory All instructions last one cycle Synchronous operation (common clock) but different instructions are allowed in different processors on the same cycle

ParAlgs2012 p.15/65

PRAM memory access modes

Four modes of simultaneous memory access (2 types of access, 2 modes) EREW: Exclusive read, exclusive write. Weakest PRAM model, minimum concurrency. CREW: Concurrent read, exclusive write. Better. CRCW: Concurrent read, concurrent write. Maximum concurrency. Can simulate on a EREW PRAM (exercise) ERCW: Exclusive read, concurrent write. Unusual?

ParAlgs2012 p.16/65

Concurrent Write Semantics

Arbitration is needed to dene a unique semantics of concurrent write in CRCW and ERCW PRAMs
Common All values to be written are the same Arbitrary Pick one writer at random Priority All processors have a preassigned priority Reduce Write the (generalised) sum of all values attempting to be written. Sum can be any associative and commutative operator cf. reduce or fold of functional languages.
ParAlgs2012 p.17/65

PRAM role
Natural extension of the von Neumann model with zero cost communication (via shared memory) We will use the PRAM to assess the complexity of some parallel algorithms Gives an upper bound on performance, e.g. minimum achievable latency

ParAlgs2012 p.18/65

Static Interconnection Networks

1. Completely connected direct link between every pair of processors ideal performance but complex and expensive 2. Star all communication through a special central processor central processor liable to become a bottleneck logically equivalent to a bus associated with shared memory machines (dynamic network)
ParAlgs2012 p.19/65

Static Interconnection Networks (2)

3. Linear array and ring connect processors in tandem with wrap-around gives a ring communication via multiple hops over links through intermediate processors basis for quantitative analysis of many other common networks

ParAlgs2012 p.20/65

Static Interconnection Networks (3)

4. Mesh generalisation of linear array (or ring with wrap-around) to more than one dimension processors labelled by rectilinear coordinates links between adjacent processors on each coordinate axis (i.e. in each dimension) multiple paths between source and destination processors

ParAlgs2012 p.21/65

Static Interconnection Networks (4)

5. Tree unique path between any pair of processors processors reside at the leaves of the tree Internal nodes may be processors (typical in static network) or switches (typical in dynamic networks) bottlenecks higher up the tree can alleviate by increasing bandwidth at higher levels fat tree (e.g. in CM5)
ParAlgs2012 p.22/65

Cube Networks
In a k -ary d-cube topology of dimension d and radix k each processor is connected to d others (with wrap-around) and there are k processors along each dimension Regular d-dimensional mesh with k d processors Processors labelled by d digit number with radix k Ring of p processors is a p-ary 1-cube Wrap-around mesh of p processors is a p-ary 2-cube
ParAlgs2012 p.23/65

Hypercubes
A k-ary d-cube can be formed from k k-ary (d 1)-cubes by connecting corresponding nodes into rings e.g. composition of rings to form a wrap-around mesh Hypercube binary d-cube nodes labelled by binary numbers of d digits each node connected directly to d others adjacent nodes differ in exactly one bit

ParAlgs2012 p.24/65

Embeddings into Hypercubes

Hypercube is the most richly connected topology we have considered (apart from completely connected) so can we consider other topologies as embedded subnetworks? 1. Ring of 2d nodes Need to nd a sequence of adjacent nodes, with wraparound, in a d-hypercube Adjacent node labels differ in exactly one bit position

ParAlgs2012 p.25/65

Mapping: ring hypercube

Assign processor i in the ring to node G(i, d) in the hypercube where G is the binary reected Gray code (RGC) dened by: G(0, 1) = 0, G(1, 1) = 1 and G(i, n + 1) = G(i, n) i < 2n 2n + G(2n+1 1 i, n) i 2n

This is easily seen recursively, by concatenating the mapping for a (d 1)-hypercube with its reverse and pre- (or app-)ending a 0 onto one mapping and a 1 onto the other .....
ParAlgs2012 p.26/65

Why is this true?

Proof by induction: a sketch (all that is necessary here) is: 1. Certainly true for d = 1, when 0 0 and 11

2. For d 0, assume successive node addresses in any d-cube ring mapping differ in only one bit 3. Hence same applies in each half of the RGC for a (d + 1)-cube 4. But because of the reection, the same holds for adjacent nodes in different halves.
ParAlgs2012 p.27/65

Mapping: mesh hypercube

The mapping for an m dimensional mesh is obtained by concatenating the RGCs for each individual dimension Thus node (i1 , . . . , im ) in a 2r1 . . . 2rm mesh maps to node G(i1 , r1 ) <> . . . <> G(im , rm ) E.g. in an 8 8 square mesh, the node at coordinate (2, 7) maps to hypercube node (0, 1, 1, 1, 0, 0).

ParAlgs2012 p.28/65

Mapping: tree hypercube

Consider a (complete) binary tree of depth d with processors at the leaves only This embeds into a d-hypercube as follows, via a many-to-one mapping that maps every node 1. map the root (level 0) to any node, e.g. (0, . . . , 0) 2. For each node at level j, if mapped to hypercube node k, map the left child to k and the right child to k with bit j inverted. 3. repeat for j = 1, . . . , d
ParAlgs2012 p.29/65

Monotonicity of the mapping

Distance between two tree-nodes is 2n for some n 1 (difference between d and the level of the lowest common ancestor) The corresponding distance in the hypercube is n think of bit-changes Nodes further apart in the hypercube must be further apart in the tree, but the converse may not hold: because of richer hypercube connectivity some bits might ip back distant tree-nodes might happen to be closer in the hypercube: d are adjacent
ParAlgs2012 p.30/65

Communication Costs
Time spent sending data between processors in a parallel algorithm is a signicant overhead communication latency dened by the switching mechanism and parameters: 1. Startup time, ts : message preparation, route initialisation etc. Incurred once per message. 2. Per-hop time, or node latency, th : time for header to pass between directly connected processors. Incurred for every link in a path. 3. Per-word transfer time, tw : tw = 1/r for channel bandwidth r words per second. Relates message length to latency.
ParAlgs2012 p.31/65

Switching Mechanisms
1. Store-and-forward routing Each intermediate processor on a communication path receives an entire message and only then sends it on to the next node on the path For a message of size m words, the communication latency on a path of l links is: tcomm = ts + (mtw + th )l Typically th is small and so we often approximate tcomm = ts + mtw l
ParAlgs2012 p.32/65

Switching Mechanisms (2)

2. Cut-through routing Reduce idle time of resources by pipelining messages along a path in pieces Messages are advanced to the out-link of a node as they arrive at the in-link Wormhole routing splits messages into its (ow-control digits) which are then pipelined

ParAlgs2012 p.33/65

Wormhole Routing
As soon as a it is completely received, it is sent on to the next node in the messages path (same path for all its) No need for buffers for whole messages unless asynchronous multiple inputs are allowed for the same out-link Hence more time-efcient and more memory efcient But in a bufferless system, messages may become blocked (waiting for a processor already transmitting another message) possible deadlock
ParAlgs2012 p.34/65

Wormhole Routing (2)

On an l-link path, header it latency = lth An m-word message will all arrive mtw after the header For a message of size m words, the communication latency on a path of l links is therefore: tcomm = ts + mtw + lth (m + l) for cut-through vs. (ml) for store-and-forward similar for small l (identical for l = 1)
ParAlgs2012 p.35/65

Communication Operations
Certain types of computation occur in many parallel algorithms Some are implemented naturally by particular communication patterns We consider the following patterns of communication where the dual operations, with the direction of the communication reversed, are shown in brackets . . .

ParAlgs2012 p.36/65

Communication Patterns
simple message transfer between two processors (same for dual) one-to-all broadcast (single node accumulation) all-to-all broadcast (multi-node accumulation) one-to-all personalised (single node gather) all-to-all personalised, or scatter (multi-node gather) more exotic patterns, e.g. permutations

ParAlgs2012 p.37/65

Simple Message Transfer

Most basic type of communication Dual operation is of the same type Latency for single message is : Tsmt-sf = ts + tw ml + th l for store-and-forward routing Tsmt-ct = ts + tw m + th l for cut-through routing where l is the number of hops . . .

ParAlgs2012 p.38/65

Number of hops, l
This depends on the network topology l is at most: p/2 for a ring 2 p/2 for a wrap-around square mesh of p processors ( a/2 + b/2 for an a b mesh) log p for a hypercube So for a hypercube with cut-through, T smt-ct-h = ts + tw m + th log p

ParAlgs2012 p.39/65

Comparison of SF and CT
If message size m is very small, latency is similar for SF and CT If message size is large, i.e. m >> l, CT becomes asymptotically independent of path length l CT much faster than SF Tsmt-ct tw m single hop latency under SF

ParAlgs2012 p.40/65

One-to-All Broadcast (OTA)

Single processor sends data to all or a subset of other processors E.g. matrix-vector multiplication: broadcast each element of the vector over its corresponding column In the dual operation, single node accumulation, data may not only be collected but also mapped by an associative operator e.g. sum a list of elements initially distributed over processors cf. concurrent write in PRAM
ParAlgs2012 p.41/65

All-to-All Broadcast (ATA)

Each processor performs (simultaneously) one-to-all broadcast with its own data Used in matrix operations, e.g. matrix multiplication, reduction and parallel-prex In the dual operation multinode accumulation each processor receives single-node accumulation Could implement ATA by sequentially performing p OTAs Far better to proceed in parallel and catenate incoming data
ParAlgs2012 p.42/65

Reduction
To broadcast the reduction of the data held in all processors with an associative operator, we can: 1. ATA broadcast the data and then reduce locally in every node . . . inefcient 2. Single node accumulation at one node followed by OTA broadcast . . . better 3. Modify ATA broadcast so that instead of catenating messages, the incoming data and the current accumulated value are operated on by the associative operator e.g. summed the result overwriting the accumulated value ..... the most efcient
ParAlgs2012 p.43/65

Parallel Prex
The Parallel Prex of a function f over a non-null list [x1 , . . . , xn ] is the list of reductions of f over all sublists [x1 , . . . , xi ] for 1 i n, where reducef [x1 ] = x1 for all f Could implement as n reductions Better to modify the third reduction method by only updating the accumulator at each node when data comes in from the appropriate nodes (otherwise it is just passed on)

ParAlgs2012 p.44/65

All-to-All Personalised
Every processor sends a distinct message of size m to every other processor total exchange E.g. in matrix transpose, FFT, database join Communication patterns identical to ATA Label messages by pairs (x, y) where x is the source processor and y is the destination processor: uniquely determines the message contents List of n messages denoted [(x1 , y1 ), . . . , (xn , yn )]
ParAlgs2012 p.45/65

Performance Metrics
1. Run Time, Tp A parallel algorithm is hard to justify without improved run-time Tp = Elapsed time on p processors: between the start of computation on the rst processor to start, and termination of computation on the last processor to nish

ParAlgs2012 p.46/65

Performance Metrics (2)

2. Speed-up, Sp serial run-time of best sequential algorithm Sp = Tp best algorithm is the optimal one for the problem, if known, or the fastest known, if not often in practice (always in this course) T1 Sp 1 . . . usually!

ParAlgs2012 p.47/65

Example addition on hypercube

Add up p = 2d numbers on a d-hypercube Use single node accumulation Each single-hop communication combined with one addition operation Sp = (p/ log p)

ParAlgs2012 p.48/65

Performance Metrics (2)

3. Efciency, Ep Sp Ep = p Fraction of time for which a processor is doing useful work Ep = (1/ log p) in above example

ParAlgs2012 p.49/65

Performance Metrics (3)

4. Cost, Cp Cp = pTp best serial run-time so that Ep = Cp

A parallel algorithm is cost-optimal if Cp best serial run time Equivalently if Ep = (1) Above example is not cost-optimal since best serial run time is (p)

ParAlgs2012 p.50/65

Granularity
Amount of work allocated to each processor Few processors, relatively large processing load on each coarse-grained parallel algorithm Many processors, relatively small processing load on each ne-grained parallel algorithm e.g. our hypercube algorithm to add up p numbers typically many small communications, often in parallel
ParAlgs2012 p.51/65

Increasing the granularity

Let each processor simulate k processors in a ner-grained parallel algorithm Computation at each processor increases by a factor k Communication time increases by factor k typically << k but may have much larger message sizes, e.g. k parallel communications may map to a single communication k times bigger Hence Tp/k k Tp and so Cp/k Cp Cost-optimality preserved may be created?
ParAlgs2012 p.52/65

Addition on Hypercube Again

Add n numbers on a d-hypercube of p = 2d processors Let each processor simulate k = n/p processes (assuming p | n)

Each processor adds locally k numbers in (k) time p partial sums are added in (log p) time Tp = (k + log p) and Cp = (n + p log p) Cost optimal if n = (p log p)

ParAlgs2012 p.53/65

Addition on Hypercube Again (2)

Alternatively, try communication in the rst log p steps, followed by local addition of n/p numbers Tp = ((n/p) log p) So Cp = ((n) log p) = log p (C1 )
never cost-optimal

ParAlgs2012 p.54/65

Scalability
Efciency decreases as the number of processors increases Consequence of Amdahls Law: problem size Sp size of serial part of problem where size is the number of basic computation steps in the best serial algorithm

ParAlgs2012 p.55/65

Scalability (2)
A parallel system is scalable if it can maintain the efciency of a parallel algorithm by simultaneously increasing the number of procesors and problem size E.g. in the above example, efciency remains at 80% if n is increased with p as 8p log p But you cant tell me why yet!

ParAlgs2012 p.56/65

The Isoefciency Metric

Measure of the extent to which a parallel system is scalable Dene the overhead, Op to be the amount of computation not performed in the best serial algorithm O p = Cp W where W is the problem size Op includes setup overheads and possible changes to an algorithm to make it parallel, but usually (100% in this course) comprises the communication latency
ParAlgs2012 p.57/65

Overhead in Hypercube-Addition
For the above addition on a hypercube example, at granularity k = n/p Tp = n/p + 2 log p assuming time 1 for addition and single hop communication. Then Op = 2p log p

ParAlgs2012 p.58/65

Isoefciency
For a scalable system, the isoefciency function I determines W in terms of p and E such that efciency, Ep , is xed at some specied constant value E E = Sp /p = W/Cp W = W + Op(W ) 1 = 1 + Op (W )/W

ParAlgs2012 p.59/65

Isoefciency (2)
Rearranging, 1 + Op (W )/W = 1/E, and so E W = Op (W ) 1E This is the Isoefciency Equation Setting K = E/(1 E) for our given E, let the solution of this equation (assuming it exists, i.e. for a scalable system) be W = I(p, K) the Isoefciency function

ParAlgs2012 p.60/65

Back to Hypercube-Addition
For the addition on a hypercube example its easy: I(p, K) = 2Kp log p More generally, Op varies with W and the isoefciency equation is non-trivial, e.g. non-linear Plenty of examples in the rest of the course!

ParAlgs2012 p.61/65

Cost-optimality and Isoefciency

A parallel system is cost-optimal if and only if Cp = (W ), i.e. its cost is asymptotically the same as the cost of the serial algorithm This implies the upper bound on the overhead Op (W ) = O(W ) or lower bound on the problem size W = (Op (W )) Not surprising you dont want a bigger overhead than the computation of the solution itself!
ParAlgs2012 p.62/65

Cost-optimality and Isoefciency (2)

For the above example W = (n) and Op (W ) = 2p log p so that the system cannot be cost-optimal unless n = (p log p) the condition for cost-optimality already derived the system is then scalable its isoefciency function is (p log p)

ParAlgs2012 p.63/65

Minimum Run-Time
Assuming differentiability of the expression for Tp , nd p = p0 such that dTp =0 dp
min giving Tp = Tp

For the above example, Tp = n/p + 2 log p min p0 = n/2 Tp = 2 log n

Not cost-optimal
ParAlgs2012 p.64/65

Minimum Cost-Optimal Run-Time

For isoefciency function (f (p)) (at any efciency), W = (f (p)) or p = O(f 1 (W )) Then minimum cost-optimal run time is
min-cost-opt Tp = (W/f 1 (W ))

For our example, n = f (p) = p log p and we nd p = f 1 (n) = n/ log p n/ log n so that
min-cost-opt Tp

3 log n 2 log log n

min here, same asymptotic complexity as Tp

ParAlgs2012 p.65/65

Introduction To Parallel Computing: Solution Manual
No ratings yet
Introduction To Parallel Computing: Solution Manual
70 pages
PRAM Algorithms
100% (1)
PRAM Algorithms
24 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
05 - Lecture #5 - 6
No ratings yet
05 - Lecture #5 - 6
42 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Introduction
No ratings yet
Introduction
46 pages
Ram, Pram, and Logp Models
No ratings yet
Ram, Pram, and Logp Models
72 pages
Par Seq Algorithms
No ratings yet
Par Seq Algorithms
44 pages
Unit 1
No ratings yet
Unit 1
25 pages
Lecture 5 Network Topologies for Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies for Parallel Architectures - Updated
46 pages
Notes 02
No ratings yet
Notes 02
9 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Chapter 03
No ratings yet
Chapter 03
68 pages
unit-3.2 static interconnection networks
No ratings yet
unit-3.2 static interconnection networks
10 pages
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
No ratings yet
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
49 pages
PDC - Lecture - No. 3
No ratings yet
PDC - Lecture - No. 3
34 pages
UNIT-2 PP FlynnsClassification
No ratings yet
UNIT-2 PP FlynnsClassification
80 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
Slides Chapter 2 - Parallel Programming Platforms
No ratings yet
Slides Chapter 2 - Parallel Programming Platforms
33 pages
Aca Notes: Scalability
No ratings yet
Aca Notes: Scalability
13 pages
Lecture 4 Flynn's Classical Taxonomy
No ratings yet
Lecture 4 Flynn's Classical Taxonomy
43 pages
L2 Parallel Computing Models
No ratings yet
L2 Parallel Computing Models
31 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Unit VI
No ratings yet
Unit VI
12 pages
Using Interconnection Networks We Can
No ratings yet
Using Interconnection Networks We Can
96 pages
Parallel Algorithms and Architectures 1
No ratings yet
Parallel Algorithms and Architectures 1
22 pages
Parallel Processing Lecture3
No ratings yet
Parallel Processing Lecture3
54 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
PRAM Model
No ratings yet
PRAM Model
72 pages
10.introduction To Data-Parallel Architectures
No ratings yet
10.introduction To Data-Parallel Architectures
21 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Advanced Computer Architecture CSE 8383
No ratings yet
Advanced Computer Architecture CSE 8383
56 pages
Solution 2-DD
No ratings yet
Solution 2-DD
70 pages
Intro To Communication: - Advantages
No ratings yet
Intro To Communication: - Advantages
13 pages
Unit - IV
No ratings yet
Unit - IV
56 pages
f32 Book Parallel Pres pt4
No ratings yet
f32 Book Parallel Pres pt4
106 pages
Parallel
No ratings yet
Parallel
59 pages
05 Notes
No ratings yet
05 Notes
30 pages
Parallel and Distributed Computing Research Paper
No ratings yet
Parallel and Distributed Computing Research Paper
8 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
PARALLEL VS DISTRIBUTED COMPUTING
No ratings yet
PARALLEL VS DISTRIBUTED COMPUTING
9 pages
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
No ratings yet
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
35 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
21 pages
Lecture 5
No ratings yet
Lecture 5
72 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
rohini_71721380822
No ratings yet
rohini_71721380822
13 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
No ratings yet
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
21 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
07 - Lecture - Abstract Models
No ratings yet
07 - Lecture - Abstract Models
38 pages
07 - Lecture - Abstract Models
No ratings yet
07 - Lecture - Abstract Models
38 pages
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
No ratings yet
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
11 pages
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
DFS BFS
No ratings yet
DFS BFS
2 pages
Advanced Computer Architecture: 1.0 Objective
No ratings yet
Advanced Computer Architecture: 1.0 Objective
27 pages
What Is Parallel Computing 1 PDF
No ratings yet
What Is Parallel Computing 1 PDF
21 pages
Parallel Computer Models: CSE7002: Advanced Computer Architecture
No ratings yet
Parallel Computer Models: CSE7002: Advanced Computer Architecture
37 pages
The PRAM Model and Algorithms: Advanced Topics Spring 2008
No ratings yet
The PRAM Model and Algorithms: Advanced Topics Spring 2008
24 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Aca Mod1
No ratings yet
Aca Mod1
125 pages
Parallel Algorithms: Peter Harrison and William Knottenbelt
No ratings yet
Parallel Algorithms: Peter Harrison and William Knottenbelt
65 pages
Thesis 2
No ratings yet
Thesis 2
109 pages
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
No ratings yet
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
17 pages
1 Overview, Models of Computation, Brent's Theorem
No ratings yet
1 Overview, Models of Computation, Brent's Theorem
8 pages
ACA Notes Diginotes PDF
No ratings yet
ACA Notes Diginotes PDF
283 pages
Module-1: Chapter-1 Parallel Computer Models
No ratings yet
Module-1: Chapter-1 Parallel Computer Models
42 pages
Parallel Algorithms
No ratings yet
Parallel Algorithms
348 pages
JaJa Parallel - Algorithms Intro
50% (2)
JaJa Parallel - Algorithms Intro
45 pages
Hardware vs. Software Parallelism
50% (2)
Hardware vs. Software Parallelism
55 pages
Parallel Merge Sort
No ratings yet
Parallel Merge Sort
6 pages
Brent Theorem
No ratings yet
Brent Theorem
2 pages
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
No ratings yet
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
11 pages
Parallel and Distributed Computing Lecture 03
No ratings yet
Parallel and Distributed Computing Lecture 03
44 pages
ACA Notes TechJourney PDF
No ratings yet
ACA Notes TechJourney PDF
206 pages
Cs 903advanced Computer Architecture Unit - I
No ratings yet
Cs 903advanced Computer Architecture Unit - I
57 pages
Parallel Algorithms
No ratings yet
Parallel Algorithms
19 pages
Assignment-2 Ami Pandat Parallel Processing: Time Complexity
No ratings yet
Assignment-2 Ami Pandat Parallel Processing: Time Complexity
12 pages

Parallel Algorithms: Peter Harrison and William Knottenbelt

Uploaded by

Parallel Algorithms: Peter Harrison and William Knottenbelt

Uploaded by

Parallel Algorithms

Peter Harrison and William Knottenbelt

Department of Computing, Imperial College London

Freeman and Phillips. Parallel Numerical Algorithms. Prentice-Hall, 1992.

Control Mechanisms hybrid

Memory Organization (2)

Interconnection Network (2)

Parallel Random Access Machine

PRAM memory access modes

Concurrent Write Semantics

Static Interconnection Networks

Static Interconnection Networks (2)

Static Interconnection Networks (3)

Static Interconnection Networks (4)

Embeddings into Hypercubes

Mapping: ring hypercube

Why is this true?

Mapping: mesh hypercube

Mapping: tree hypercube

Monotonicity of the mapping

Switching Mechanisms (2)

Wormhole Routing (2)

Simple Message Transfer

One-to-All Broadcast (OTA)

All-to-All Broadcast (ATA)

Performance Metrics (2)

Example addition on hypercube

Performance Metrics (2)

Performance Metrics (3)

Increasing the granularity

Addition on Hypercube Again

Addition on Hypercube Again (2)

The Isoefciency Metric

Cost-optimality and Isoefciency

Cost-optimality and Isoefciency (2)

For the above example, Tp = n/p + 2 log p min p0 = n/2 Tp = 2 log n

Minimum Cost-Optimal Run-Time

3 log n 2 log log n

min here, same asymptotic complexity as Tp

You might also like