0% found this document useful (0 votes)
40 views

IV. Physical Organization and Models: March 9, 2009

This document discusses physical organization models of parallel computers and early models of parallel processing. It describes: - Radius-based classification of parallel machines including monoprocessor machines, parallel machines, local clusters, and distributed clusters. - Examples of recent multicore CPU architectures like Intel Pentium D, Intel Core 2 Duo, and AMD Athlon 64 X2. Massively parallel systems like IBM BlueGene, NEC SX-8, and Cray XT3 are also discussed. - Trends in parallel computing including hierarchical systems, specialized processors, high speed networks, and increasing heterogeneity at multiple levels. - Early parallel processing models like associative processing and PRAM, which were conceptual

Uploaded by

Anisorac Vasile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

IV. Physical Organization and Models: March 9, 2009

This document discusses physical organization models of parallel computers and early models of parallel processing. It describes: - Radius-based classification of parallel machines including monoprocessor machines, parallel machines, local clusters, and distributed clusters. - Examples of recent multicore CPU architectures like Intel Pentium D, Intel Core 2 Duo, and AMD Athlon 64 X2. Massively parallel systems like IBM BlueGene, NEC SX-8, and Cray XT3 are also discussed. - Trends in parallel computing including hierarchical systems, specialized processors, high speed networks, and increasing heterogeneity at multiple levels. - Early parallel processing models like associative processing and PRAM, which were conceptual

Uploaded by

Anisorac Vasile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

IV.

Physical organization and


models

March 9th, 2009


Content

„ Physical organization: „ Models:


‰ radius-based ‰ early models,
classification, ‰ PRAM
‰ multicore,
‰ clusters,
‰ grids,
‰ trends;
Physical organization
Radius-based Classification
„ Monoprocessor Machines (MM): mainly representing the SISD
computers available in the mass-market (PCs, workstations).
‰ They also include SIMD machines containing one vector processor.

„ Parallel Machines (PM): built as a single machine containing


several processing units.
‰ They include SIMD and MIMD architectures and potential
combinations of them.
„ Local Clusters (LC): collections of independent computers gathered
in the same place and connected via a local network
‰ Although they are intrinsically MIMD oriented, SIMD machines can be
used at the node level.
„ Distributed Clusters (DC): collections of local clusters scattered all
around the world and linked together via the Internet.
‰ Those systems mainly follow the MIMD model but they can include
SIMD parts.
Links between Flynn’s and radius-based
classification
Examples of recent architectures
„ Intel Pentium D –
‰ introduced in 2005,
‰ Intel’s first dual-core processor;
‰ cores have their own caches and access the common memory via the frontside bus;
‰ limited memory bandwidth in the case of memory-intensive computations.
‰ its’s long pipelines allow for high clock frequencies (at the time of writing up to 3.73 GHz with
the Pentium D 965), but may cause poor performance in the case of branches.
‰ is not dual-CPU-capable. This capability is reserved for the rather expensive Xeon CPU.
„ Intel Core 2 Duo
‰ successor to the Pentium D
‰ It abandons high clock frequencies in the favour of more efficient computation.
‰ Like the Pentium D, it uses the frontside bus for memory access by both CPUs.
„ AMD Athlon 64 X2 & Opteron A
‰ MD’s dual-core CPUs Athlon 64 X2 (single-CPU only) and
‰ Opteron (depending on model up to 8 CPUs in one system possible)
‰ very popular CPUs for Linux clusters.
‰ Each core has its own HyperTransport channel for memory access, making these CPUs well
suited for memory intensive applications.
„ IBM pSeries
‰ IBM’s server- and workstation line based on the POWER processor.
‰ The newer POWER processors are multi-core designs and feature large caches. IBM builds
shared memory systems with up to 32 CPUs.
Examples of recent architectures
„ IBM BlueGene
‰ MPP (massively parallel processing) architecture by IBM.
‰ It uses rather slow 700 MHz PowerPC processors.
‰ these processors form very large, highly integrated distributed memory
systems, with fast communication networks (a 3D-Torus, like the Cray T3E).
‰ BlueGene/L consists of 131,072 CPUs, and delivers a performance of up to
360 TeraFLOPS.
„ NEC SX-8
‰ one of the few vector supercomputers in production at the moment.
‰ It performs vector operations at a speed of 2 GHz, with eight operations per
clock cycle.
‰ One SX-8 node consists of eight CPUs, up to 512 nodes can be connected.
„ Cray XT3
‰ a massively-parallel system using AMD’s Opteron CPU.
„ SGI Altix 3700
‰ ccNUMA system using Intel’s Itanium 2 processor.
‰ Itanium 2 has large caches and good floating point performance.
‰ being ccNUMA, the Altix 3700 is easy to program.
Local Clusters
„ A practical way of building efficient low cost distributed memory MIMDs.
„ Clusters made high-performance parallel computing available to those with
much smaller budgets.
„ The idea is to combine commodity-off-the-shelf (COTS) components to
create a parallel computer.
„ Example, on PCs running the Linux operating system.
„ Communication network connecting the PCs together may vary from Gigabit
Ethernet to Myrinet and Infiniband that can broadcast messages at a rate of
several Ggabits per second (Gbs)
‰ Gigabit Ethernet has ~ 100 MB/s & cheaper than Myrinet, but the latency (travel
time of a data package) is 100 mus forGigabit Ethernet > 10 - 20 mus forMyrinet.
„ At a clock speed of 2 GHz, one cycle takes 0.5 ns. A latency of 10 mus amounts to
20,000 cycles of travel time before the data package reaches its target.
„ Possibility of combining higher-end shared mem.PCs & servers into clusters.
„ Limitations:
‰ Low-throughput switches can result in imbalanced systems & become a major
performance bottleneck,
„ especially when more powerful nodes are used in the cluster.
The Beowulf project: the first PC cluster
„ The first PC cluster was designed in 1994 at NASA Goddard Space Flight
Center to achieve one Gigaflop.
‰ 16 PCs were connected together using a standard Ethernet network.
‰ Each PC had an Intel 486 microproc with sustained performance of ~ 70 Mflops.
‰ Was built for only $40,000 compared to $1 million, which was the cost for a
commercial equivalent supercomputer at that time.
‰ Named Beowulf after the hero of medieval times who defeated the giant Grendel.
„ In 1997 researchers at the Oak Ridge national laboratory built a Beowulf
cluster from many obsolete PCs of various types;
‰ for example, in one version it included 75 PCs with Intel 486 microprocessors, 53
Intel Pentium PCs and five fast Alpha workstations.
‰ simulations producing detailed national maps of ecoregions based on almost 100
million degrees of freedom.
„ When the computers of clusters are PCs running the Linux operating
system, these clusters are called Beowulf clusters.
„ combination of several desktop computers - known as a network of
workstations (NOW) or clusters of workstations (COW)
„ vendor solution: consists in providing specific integration facilities (racks and
cabinets) with optimized network and software environment.
Advantages & disadvantages
„ Adv:
‰ Flexibility: Pus, network, RAM can be
easily added or suppressed from the
system
‰ Cheaper than closed parallel machines

„ Dis:
‰ interconnection network which is often
far slower than the fully integrated ones
in parallel machines.
‰ the connection of each node to the
network is done through the connection
bus of that node which often has
restricted bandwidths and/or latencies
Distributed Clusters - Grids
„ the cluster idea has been
expanded to connecting computers
all over the world via the Internet
„ a logical result of
‰ the great improvements in distant
networks during the last few years
‰ the stronger and stronger demand
for more and more powerful
computational systems.
„ Adv:
‰ gathering very large no. of
machines => larger computational
power & larger memory capacity.
„ Drawbacks:
‰ communications between clusters
are generally much slower than
those inside local clusters
‰ security issues
Trends of used configurations
„ Hierarchical parallel systems, mixing shared and distributed memory
‰ E.g. IBM BlueGene

„ Some vendors continue to develop specific processors to put in


supercomputers such as vector, massively multi-threaded or VLIW
(Very Large Instruction Word) processors.
„ Re-use the parallel concepts usually taking place at the processor
level at the scale of small groups of processors in order to design yet
more powerful virtual processors.
‰ IBM projects of Virtual Vector Architecture and Cell processor.
Trends of used configurations
„ Networks:
‰ Gigabit Ethernet has been intensively used in clusters, but high
latencies => Other networks:SCI, Infiniband or Myrinet.
‰ Bandwidths of the order of the Gb/s

‰ Latencies the order of the microsecond.

‰ Large flexibility in the possible topologies.

„ Inclusion of heterogeneity at different levels of the parallel


architectures.
‰ Vendors, such as Cray or SGI, working on systems combining
several kinds of processors (vector, scalar)
‰ the networks are already heterogeneous in all the hierarchical
architectures.
„ The frontiers between the different parts of a parallel system are
becoming less and less obvious!
‰ Processors tend to become mini multi-processor systems

‰ Clustering tends to be used at all the levels of multi-layer systems


„ the term cluster alone becomes more and more inaccurate outside
any specific context.
Models of parallel computers
Why?
„ Parallel processors come in many different varieties.
‰ Not possible to discuss all varieties, including their distinguishing
features, strong points within application contexts, and drawbacks.
⇒ we often deal with abstract models of real machines.
„ Benefits of using abstract models:
‰ technology-independent theories and algorithmic techniques that
are applicable to a large number of existing and future machines.
‰ The conceptual simplicity of such models makes the development of
algorithms and the analysis of various trade-offs more manageable.
‰ If automatic translation of these abstract algorithms into efficient
programs for real machines is possible through the use of intelligent
or optimizing compilers, then these models can indeed be
enormously helpful.
„ Disadvantages include
‰ the inability to predict the actual performance accurately and

‰ a tendency to simplify the models too much, so that they no longer


represent any real machine.
Early models – Associative processing AP
„ Associative or content-addressable memories (AMs, CAMs),
‰ allow memory cells to be accessed based on contents rather than their
physical locations within the memory
‰ came in the 1950s when advances in magnetic and cryogenic memory
technologies allowed the construction of reasonably sized prototypes.
„ Based on incorporating simple processing logic into the memory array
⇒ remove the need for transferring large volumes of data through the
limited-bandwidth interface between the memory and the processor (the
von Neumann bottleneck).
„ Early associative memories provided two basic capabilities:
1. masked search, or looking for a particular bit pattern in selected fields of
all memory words and marking those for which a match is indicated, and
2. parallel write, or storing a given bit pattern into selected fields of all
memory words previously marked.
‰ These capabilities + logical operations on mark vectors (e.g., ORing
them together) suffice for the programming of searches or even parallel
arithmetic ops.
AM/AP model
„ Over the past half-century, the AM/AP model has evolved
through the incorporation of additional capabilities, so that it is in
essence converging with SIMD-type array processors.
„ Early examples: Goodyear STARAN processor, commercial
product of the 1970s, whose design was motivated by the
computation-intensive problem of aircraft conflict detection;
‰ O(n2) pairwise checks required to avoid collisions & near misses
for n aircraft in the vicinity of a busy airport.
„ Modern incarnations of this model are seen in processor-in-
memory (PIM) chips,
‰ basically standard DRAM chips with a large no. very simple
processors added on their data access paths,
‰ and intelligent RAM (IRAM) architectures,

‰ advantages in both performance and power consumption.


Early models–neural networks & cellular automata
„ Neural networks
‰ introduced in the 1950s,

‰ dealt with parallel processing for image understanding applications

‰ the development of perceptrons (a neuron-like device in charge of


processing a single pixel in a digital image) in1940s.
‰ in the 1960s a flurry of research activities laid the foundation for the
modern field of neural networks.
‰ introduction of the back propagation learning algorithm put neural
networks on the fast track so that today they are used for the solution of
complex decision problems in a wide class of appls.
„ Cellular automata
‰ a collection of identical finite-state automata that are interconnected,
through their input–output links, in a regular fashion, with the state
transitions of each automaton controlled by its own state, the states of the
neighbors to which it is connected, and its primary inputs, if any.
‰ Systolic arrays, which form the basis of high-performance VLSI-based
designs in some application areas, can be viewed as cellular automata.
‰ Recent years: a resurgence of interest in CA as theoretical models of
massively parallel systems& tools for modeling physical phenomena.
PRAM model - abstraction
„ The theoretical model used for conventional or sequential computers
(SISD class) is known as the random-access machine (RAM)
‰ not to be confused with random-access mem, which has the same acronym
„ The parallel version of RAM, PRAM (pea-ram), constitutes an abstract
model of the class of global-memory parallel processors.
„ The abstraction consists of
‰ ignoring the details of the processor-to-memory interconnection network and
‰ taking the view that each processor can access any memory location in each
machine cycle, independent of what other processors are doing.
„ Example:
‰ PRAM algorithms might involve statements like
“for 0 <= i < p, Processor i adds the contents of memory location 2i + 1 to the
memory location 2i” (different locations accessed by the various processors)
“each processor loads the contents of memory location x into its Register 2”
(the same location accessed by all processors).
„ Problem of multiple processors attempting to write into a common
memory location must be resolved in some way.
‰ various inhibition, priority, or combining schemes can be employed when
concurrent write operations to a common location are attempted.
PRAM – functionality?
„ a single processor is assumed to be active initially.
‰ In each computation step, each active proc can read from and write into
the shared memory and can also activate another processor.
„ the abstract PRAM model can be SIMD or MIMD.
‰ SIMD variant: all procs obey the same instruction in each machine cycle;

‰ because of indexed and indirect (register-based) addressing, they often


execute the operation that is broadcast to them on different data.
„ Processors share a common clock but may execute different instructions
in each cycle
„ the PRAM model is highly theoretical.
‰ If one were to build a physical PRAM, the processor-to-memory
connectivity would have to be realized by an interconnection network
‰ Because memory locations are too numerous to be assigned individual
ports on an interconnection network, blocks of memory locations (or
modules) would have to share a single network port.
„ Each instruction cycle would have to consume Ο(log p) real time.
PRAM as an ideal model
„ Usually is assumed that the PRAM as seen as global
memory of unbounded size that is uniformly accessible
to all processors.
„ Suppose a PRAM as shared-memory computer with p
processors and a global memory of m words.
‰ The processors are connected to the memory through
a set of switches.
‰ These switches determine the memory word being
accessed by each processor.
‰ Each of the p processors in the ensemble can access
any of the memory words, provided that a word is not
accessed by more than one processor simultaneously.
‰ To ensure such connectivity, the total number of
switches must be Θ(mp).
„ For a reasonable memory size, constructing a switching
network of this complexity is very expensive.
⇒ PRAM models of computation are impossible to realize in
practice.
Memory access in PRAM model
„ All the processors have read and write access to a shared global mem.
‰ In the PRAM the access can be simultaneous.

‰ Each of the theoretical processors can access the global shared memory
in one uninterruptible unit of time
‰ Each processor can perform various arithmetic& logical ops in parallel.

„ The PRAM model has both concurrent and exclusive read algorithms.
‰ Concurrent read algorithms are allowed to read the same piece of
memory simultaneously with no data corruption.
‰ Exclusive read algorithms are used to ensure that no two processors ever
read the same memory location at the same time.
„ The PRAM model also has both concurrent and exclusive write algs.
‰ Concurrent write algorithms allow multiple processors to write to memory

‰ Exclusive write algorithms ensure that no two processors write to the


same memory at the same time.
Submodels of PRAM
1. Exclusive-read, exclusive-write (EREW) PRAM.
‰ Access to a memory location is exclusive.
‰ No concurrent read or write operations are allowed.
‰ The weakest PRAM model: minimum concurrency in memory access.
‰ The most realistic of the four submodels
2. Concurrent-read, exclusive-write (CREW) PRAM.
‰ Multiple read accesses to a memory location are allowed.
‰ Multiple write accesses to a memory location are serialized.
‰ Default submodel: assumed when nothing is said about the submodel,
3. Exclusive-read, concurrent-write (ERCW) PRAM.
‰ Multiple write accesses are allowed to a memory location,
‰ Multiple read accesses are serialized.
4. Concurrent-read, concurrent-write (CRCW) PRAM.
‰ Multiple read and write accesses to a common memory location.
‰ This is the most powerful PRAM model.
‰ The least restrictive submodel,
‰ Require a conflict resolution mechanism to define concurrent writes
Protocols for concurrent write
„ Common:
‰ the concurrent write is allowed if all the values that the
processors are attempting to write are identical.
„ Arbitrary:
‰ an arbitrary processor is allowed to proceed with the write
operation and the rest fail.
„ Priority:
‰ all processors are organized into a predefined prioritized list, and
the processor with the highest priority succeeds and the rest fail.
„ Sum,
‰ the sum of all the quantities is written (the sum-based write
conflict resolution model can be extended to any associative
operator defined on the quantities being written).
CRCW PRAM is further classified
„ Undefined (CRCW-U) :
‰ In case of multiple writes, the value written is undefined.

„ Detecting (CRCW-D):
‰ A special code representing “detected collision” is written.

„ Common (CRCW-C):
‰ Multiple writes allowed only if all store the same value.

‰ This is sometimes called the consistent-write submodel.

„ Random (CRCW-R):
‰ The value written is randomly chosen from among those offered.

‰ This is sometimes called the arbitrary-write submodel.

„ Priority (CRCW-P):
‰ The processor with the lowest index succeeds in writing its value.

„ Max/Min (CRCW-M):
‰ The largest/smallest of the multiple values is written.

„ Reduction:
‰ The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR
(CRCW-X), other combination of the multiple values is written.
Order the submodels
„ One way to order these submodels is by their computational power.
„ Two PRAM submodels are equally powerful if each can emulate the other
with a constant-factor slowdown.
„ A PRAM submodel is (strictly) less powerful than another submodel
(denoted by the “<” symbol) if there exist problems for which the former
requires significantly more computational steps than the latter.
„ Example:
‰ CRCW-D PRAM submodel < CRCW-M (maxim),
„ CRCW-M can find the largest number in a vector A of size p in a single step:
‰ Processor i reads A[i] and writes it to an agreed-upon location x, which will then hold
the maximum value for all processors to see
„ CRCW-D needs at least O(log n) steps.
„ The “less powerful or equal” relationship “≤” between submodels can be
similarly defined.
„ EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
„ EREW can simulate CRCW submodel with at most logarithmic slowdown.
‰ A p-processor CRCW-P (priority) PRAM can be simulated by a p-
processor EREW PRAM with a slowdown factor of O(log p).
Alg.1: Data broadcasting – how?
„ One processor needs to send a data value to all other processors.
‰ In the CREW/CRCW, broadcasting trivial: the sending proc write the data value into a
memory location, with all processors reading that data value in the following cycle.
„ => simple broadcasting is done in O(1) steps.
„ Multicasting within groups is equally simple if each processor knows its group
membership(s) & only members of each group read the multicast data
‰ All-to-all broadcasting: each of the p procs needs to send a data value to all other
processors - can be done through p separate broadcast operations in O(p) steps.
„ The above scheme is clearly inapplicable to broadcasting in the EREW model.
„ The simplest scheme for EREW broadcasting is
‰ make p copies of the data value, say in a broadcast vector B of length p,
‰ and then let each processor read its own copy by accessing B[j].
‰ Initially, Processor i writes its data value into B[0].
‰ Recursive doubling is used to copy B[0] into all elements of B in O (log 2 p) steps.
‰ Finally, Processor j, 0 ≤j < p, reads B [j] to get the data value broadcast by Processor I

Making p copies of B[0] by recursive doubling


for k = 0 to log2 p– 1 Processor j, 0 ≤j < p, do
Copy B[j] into B[j + 2k]
Endfor
Alg.1: Data broadcasting in PRAM
„ Note: in Step k of the above recursive doubling process, only the
first 2k processors need to be active.
„ The complete EREW broadcast algorithm with this provision
EREW PRAM algorithm for broadcasting by Processor I

Processor i write the data value into B[0]


s := 1
while s < p Processor j, 0 ≤j < min(s, p – s), do
Copy B[j] into B[j + s]
s := 2s
endwhile
Processor j, 0 ≤j < p, read the data value in B[j]

„ The parameter s can be interpreted as the “span” of elements


already modified or the “step” for the copying operation.
„ O(log p)-step broadcasting alg. - is optimal for EREW PRAM.
Alg.2: All-to-all broadcasting in PRAM
„ To perform all-to-all broadcasting,
‰ each processor broadcasts a value that it holds to each of the other p –
1 processors
‰ Processor j write its value into B[j], rather than into B[0].

‰ In one memory access step, all of the values to be broadcast are


written into the broadcast vector B.
‰ Each processor then reads the other p – 1 values in p – 1 memory
accesses.
‰ Ensure that all reads are exclusive: Procs j begins reading the values
starting with B[j + 1], wrapping around to B[0] after reading B[p – 1].

EREW PRAM data broadcasting without redundant copying.


EREW PRAM algorithm for all-to-all broadcasting
Processor j, 0 ≤j < p, write own data value into B[j]
for k = 1 to p – 1 Processor j, 0 ≤j < p, do
Read the data value in B[ (j + k) mod p]
endfor
Alg. 3: Naïve sorting algorithm
„ Given a data vector S of length p,
„ Let Proc j compute the rank R[j] of the data element S[j] and then store S[j] into S[R[j]].
„ The rank R[j] of S[j]
‰ = total no. data elements that are smaller than S[j],
‰ computed by each proc examining all other data elements& counting no. elements S[l] < S[j].
„ Each data element must be given a unique rank => ties broken by using the proc ID.
‰ If Processors i and j (i < j) hold equal data values, the value in Processor i is “smaller” for
ranking purposes.

Naive EREW PRAM sorting algorithm using all-to-all broadcasting


Processor j, 0 ≤j < p, write 0 into R[j]
for k = 1 to p – 1 Processor j, 0 ≤j < p, do
l := (j + k) mod p
if S[l] < S[j] or S[l] = S[j] and l < j
then R[j] := R[j] + 1
endif
endfor
Processor j, 0 ≤j < p, write S [j] into S[R[j]]

„ Not optimal in that the O(p²) computational work involved in it is significantly greater
than the O(p log p) work required for sorting p elements on a single processor.
Semigroup and prefix comp.
„ Semigroup computation or „ Parallel prefix computations:
fan in computation: ‰ consists of the first phase of the semigroup
‰ is define based on associative computation.
binary operator o. ‰ The divide-and-conquer paradigm:
‰ trivial for a CRCW PRAM of the „ Problem as composed of two subproblems:
“reduction” variety 1. computing the odd-indexed results s1,s3,s5,...
‰ Examples: 2. computing the even-index.results s0,s2,s4,...
„ computing the arithmetic sum „ The first subproblem is solved as follows.
(logical AND, logical XOR) of p ‰ Pairs of consecutive elements in the input list (x0
values, one per processor, and x1, x2 and x3, x4 and x5, and so on) are
‰ trivial for the CRCW-S (CRCW- combined to obtain a list of half the size.
A, CRCW-X) PRAM; „ Performing parallel prefix computation on this
ƒ it can be done in a single list yields values for all odd-indexed results.
cycle by each proc writing
its corresponding value into
„ The even indexed results are then found in a
a common location that will single PRAM step by combining each even-
then hold the arithmetic sum indexed input with the immediately preceding
of all of the values. odd-indexed result.
‰ The recursive doubling scheme „ The total computation time is given by the
can be used on an EREW PRAM recurrence T(p) = T(p/2) + 2 whose solution
„ the only difference appearing in isT(p) = 2 log2 p.
the final broadcasting step.
Matrix multiplication
„ Given m x m matrices A and B, with elements a[i,j] and b[i,j],, their
product C can be obtained with a O(m³)-step sequential algorithm.
„ If the PRAM has p = m³ processors, then matrix multiplication can be
done in O(log m) time
‰ one processor compute a[i,k] x b[k,j] and then allow groups of m procs
to perform m-input +s (semigroup comp) in O(log m) time.
‰ Because we are usually not interested in parallel processing for matrix
multiplication unless m is fairly large, this is not a practical solution!
„ Assume that the PRAM has p = m² processors.
‰ In this case, matrix multiplication can be done in O(m) time by using
one processor to compute each element c[i,j] of the product matrix C.
‰ Processor responsible for computing c[i,j]
„ reads the elements of Row i in A and the elements of Column j in B,
„ multiplies their corresponding kth elements, and
„ adds each of the products thus obtained to a running total t.
‰ Parallelize the i and j loops in the sequential algorithm.
‰ Label the m² processors with two indices (i, j), each ranging from 0 to
m – 1, rather than with a single index ranging from 0 to m² – 1.
CREW implementation of A x B
PRAM matrix multiplication algorithm using m² processors
Processor (i, j), 0 ≤i, j < m, do
begin
t := 0
for k = 0 to m – 1 do
t := t + a[i,k]b[k,j]
endfor
cij := t
end
„ In a given iteration of the k loop,
‰ all processors (i, y), 0 ≤y < m, access the same element a[i,k] of A

‰ all processors (x, j) access the same element b[j,k] of B.


A x B using m processors
„ Matrix multiplication can be done in O(m²) time
‰ Processor i to compute the m elements in Row i of the product matrix C in turn.
‰ Processor i will
„ read the elements of Row i in A and the elements of all columns in B,
„ multiply their corresponding kth elements, and
„ add each of the products thus obtained to a running total t.
„ Parallelize the i loop in the sequential algorithm.
PRAM matrix multiplication algorithm using m processors
for j = 0 to m – 1 Processor i, 0 ≤i < m, do
t := 0
for k = 0 to m – 1 do
t := t + a[b,i] b[k,j]
endfor
cij := t
Endfor
„ Each processor reads a different row of A => no concurrent reads are attempted
„ B: all m processors access the same element bk j a t the same time.
„ For both p = m² and p = m procs: efficient algorithms with linear speed-ups.
A x B using less than m processors
„ Naïve alg:
‰ We can let Processor i compute a set of m/p rows in the result matrix C;
say Rows i, i + p, i + 2p, . . . , i + (m/p – 1)p.
‰ Parallelizing the i loop as this is preferable
„ if k loop -- has data dependencies
„ if j loop -- imply m synchronizations of the processors, once at the end of each i
iteration, assuming the SPMD model
‰ Drawback: each element of B is fetched m/p times, with only two
arithmetic operations (+,x) performed for each such element.
„ Block matrix multiplication:
‰ increases the computation to memory access ratio

‰ divide the m x m matrices A, B, and C into p blocks of size q x q.

‰ multiply the m x m matrices by using matrix x with processors, where the


terms in the algorithm statement t := t + a[i,k]b[k,j] are now q x q matrices
‰ Processor (i, j) computes Block (i, j) of the result matrix C.

‰ Each multiply–add computation on q x q blocks needs 2q²= 2m²/p


memory accesses to read the blocks and 2q³ arithmetic operations.
‰ q arithmetic operations are performed for each memory access and better
performance will be achieved as a result of improved locality.

You might also like