Introduction to Parallel Processing Algorithms and Architectures 1st Edition by Behrooz Parhami ISBN 9780306469640 0306469642 download
Introduction to Parallel Processing Algorithms and Architectures 1st Edition by Behrooz Parhami ISBN 9780306469640 0306469642 download
https://ebookball.com/product/introduction-to-parallel-
processing-algorithms-and-architectures-1st-edition-by-behrooz-
parhami-isbn-9780306469640-0306469642-19872/
https://ebookball.com/product/algorithms-architectures-and-
information-systems-security-1st-edition-by-bhargab-
bhattacharya-9814469467-9789814469463-13068/
https://ebookball.com/product/coding-theory-algorithms-
architectures-and-applications-1st-edition-by-andre-neubauer-
jurgen-freudenberger-volker-kuhn-isbn-9780470028612-13832/
https://ebookball.com/product/speech-and-language-processing-an-
introduction-to-natural-language-processing-computational-
linguistics-and-speech-recognition-1st-edition-by-daniel-saul-
jurafsky-james-martin-0130950696-9780130950697/
https://ebookball.com/product/introduction-to-game-development-
using-processing-1st-edition-by-james-
parker-1942270658-9781942270652-25160/
Adapting Parallel Algorithms to the W Stream Model with Applications
to Graph Problems 1st Edition by Camil Demetrescu, Bruno Escoffier,
Gabriel Moruz, Andrea Ribichini ISBN 9783540744566
https://ebookball.com/product/adapting-parallel-algorithms-to-
the-w-stream-model-with-applications-to-graph-problems-1st-
edition-by-camil-demetrescu-bruno-escoffier-gabriel-moruz-andrea-
ribichini-isbn-9783540744566-12246/
https://ebookball.com/product/job-scheduling-strategies-for-
parallel-processing-1st-edition-by-dalibor-klusacek-walfredo-
cirne-gonzalo-p-rodrigo-isbn-9783030882242-13704/
https://ebookball.com/product/parallel-processing-for-artificial-
intelligence-3-1st-edition-by-geller-kitano-suttner-
isbn-0444824863-978-0444824868-19586/
https://ebookball.com/product/algorithms-for-image-processing-
and-computer-vision-2nd-edition-by-
parker-0470643854-978-0470643853-17240/
https://ebookball.com/product/cost-optimization-of-structures-
fuzzy-logic-genetic-algorithms-and-parallel-computing-1st-
edition-by-hojjat-adeli-kamal-sarma-
isbn-0470867345-9780470867341-9304/
Introduction to
Parallel Processing
Algorithms and Architectures
PLENUM SERIES IN COMPUTER SCIENCE
Series Editor: Rami G. Melhem
University of Pittsburgh
Pittsburgh, Pennsylvania
FUNDAMENTALS OF X PROGRAMMING
Graphical User Interfaces and Beyond
Theo Pavlidis
INTRODUCTION TO PARALLEL PROCESSING
Algorithms and Architectures
Behrooz Parhami
Introduction to
Parallel Processing
Algorithms and Architectures
Behrooz Parhami
University of California at Santa Barbara
Santa Barbara, California
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
The field of digital computer architecture has grown explosively in the past two decades.
Through a steady stream of experimental research, tool-building efforts, and theoretical
studies, the design of an instruction-set architecture, once considered an art, has been
transformed into one of the most quantitative branches of computer technology. At the same
time, better understanding of various forms of concurrency, from standard pipelining to
massive parallelism, and invention of architectural structures to support a reasonably efficient
and user-friendly programming model for such systems, has allowed hardware performance
to continue its exponential growth. This trend is expected to continue in the near future.
This explosive growth, linked with the expectation that performance will continue its
exponential rise with each new generation of hardware and that (in stark contrast to software)
computer hardware will function correctly as soon as it comes off the assembly line, has its
down side. It has led to unprecedented hardware complexity and almost intolerable devel-
opment costs. The challenge facing current and future computer designers is to institute
simplicity where we now have complexity; to use fundamental theories being developed in
this area to gain performance and ease-of-use benefits from simpler circuits; to understand
the interplay between technological capabilities and limitations, on the one hand, and design
decisions based on user and application requirements on the other.
In computer designers’ quest for user-friendliness, compactness, simplicity, high per-
formance, low cost, and low power, parallel processing plays a key role. High-performance
uniprocessors are becoming increasingly complex, expensive, and power-hungry. A basic
trade-off thus exists between the use of one or a small number of such complex processors,
at one extreme, and a moderate to very large number of simpler processors, at the other.
When combined with a high-bandwidth, but logically simple, interprocessor communication
facility, the latter approach leads to significant simplification of the design process. However,
two major roadblocks have thus far prevented the widespread adoption of such moderately
to massively parallel architectures: the interprocessor communication bottleneck and the
difficulty, and thus high cost, of algorithm/software development.
vii
viii INTRODUCTION TO PARALLEL PROCESSING
The above context is changing because of several factors. First, at very high clock rates,
the link between the processor and memory becomes very critical. CPUs can no longer be
designed and verified in isolation. Rather, an integrated processor/memory design optimiza-
tion is required, which makes the development even more complex and costly. VLSI
technology now allows us to put more transistors on a chip than required by even the most
advanced superscalar processor. The bulk of these transistors are now being used to provide
additional on-chip memory. However, they can just as easily be used to build multiple
processors on a single chip. Emergence of multiple-processor microchips, along with
currently available methods for glueless combination of several chips into a larger system
and maturing standards for parallel machine models, holds the promise for making parallel
processing more practical.
This is the reason parallel processing occupies such a prominent place in computer
architecture education and research. New parallel architectures appear with amazing regu-
larity in technical publications, while older architectures are studied and analyzed in novel
and insightful ways. The wealth of published theoretical and practical results on parallel
architectures and algorithms is truly awe-inspiring. The emergence of standard programming
and communication models has removed some of the concerns with compatibility and
software design issues in parallel processing, thus resulting in new designs and products with
mass-market appeal. Given the computation-intensive nature of many application areas (such
as encryption, physical modeling, and multimedia), parallel processing will continue to
thrive for years to come.
Perhaps, as parallel processing matures further, it will start to become invisible. Packing
many processors in a computer might constitute as much a part of a future computer
architect’s toolbox as pipelining, cache memories, and multiple instruction issue do today.
In this scenario, even though the multiplicity of processors will not affect the end user or
even the professional programmer (other than of course boosting the system performance),
the number might be mentioned in sales literature to lure customers in the same way that
clock frequency and cache size are now used. The challenge will then shift from making
parallel processing work to incorporating a larger number of processors, more economically
and in a truly seamless fashion.
The field of parallel processing has matured to the point that scores of texts and reference
books have been published. Some of these books that cover parallel processing in general
(as opposed to some special aspects of the field or advanced/unconventional parallel systems)
are listed at the end of this preface. Each of these books has its unique strengths and has
contributed to the formation and fruition of the field. The current text, Introduction to Parallel
Processing: Algorithms and Architectures, is an outgrowth of lecture notes that the author
has developed and refined over many years, beginning in the mid-1980s. Here are the most
important features of this text in comparison to the listed books:
the notation and terminology from the reference source. Such an approach has the
advantage of making the transition between reading the text and the original
reference source easier, but it is utterly confusing to the majority of the students
who rely on the text and do not consult the original references except, perhaps, to
write a research paper.
SUMMARY OF TOPICS
The six parts of this book, each composed of four chapters, have been written with the
following goals:
Part I sets the stage, gives a taste of what is to come, and provides the needed
perspective, taxonomy, and analysis tools for the rest of the book.
Part II delimits the models of parallel processing from above (the abstract PRAM
model) and from below (the concrete circuit model), preparing the reader for everything
else that falls in the middle.
Part III presents the scalable, and conceptually simple, mesh model of parallel process-
ing, which has become quite important in recent years, and also covers some of its
derivatives.
Part IV covers low-diameter parallel architectures and their algorithms, including the
hypercube, hypercube derivatives, and a host of other interesting interconnection
topologies.
Part V includes broad (architecture-independent) topics that are relevant to a wide range
of systems and form the stepping stones to effective and reliable parallel processing.
Part VI deals with implementation aspects and properties of various classes of parallel
processors, presenting many case studies and projecting a view of the past and future
of the field.
For classroom use, the topics in each chapter of this text can be covered in a lecture
spanning 1–2 hours. In my own teaching, I have used the chapters primarily for 1-1/2-hour
lectures, twice a week, in a 10-week quarter, omitting or combining some chapters to fit the
material into 18–20 lectures. But the modular structure of the text lends itself to other lecture
formats, self-study, or review of the field by practitioners. In the latter two cases, the readers
can view each chapter as a study unit (for 1 week, say) rather than as a lecture. Ideally, all
topics in each chapter should be covered before moving to the next chapter. However, if fewer
lecture hours are available, then some of the subsections located at the end of chapters can
be omitted or introduced only in terms of motivations and key results.
Problems of varying complexities, from straightforward numerical examples or exercises
to more demanding studies or miniprojects, have been supplied for each chapter. These problems
form an integral part of the book and have not been added as afterthoughts to make the book
more attractive for use as a text. A total of 358 problems are included (13–16 per chapter).
Assuming that two lectures are given per week, either weekly or biweekly homework can
be assigned, with each assignment having the specific coverage of the respective half-part
PREFACE xi
(two chapters) or full part (four chapters) as its “title.” In this format, the half-parts, shown
above, provide a focus for the weekly lecture and/or homework schedule.
An instructor’s manual, with problem solutions and enlarged versions of the diagrams
and tables, suitable for reproduction as transparencies, is planned. The author’s detailed
syllabus for the course ECE 254B at UCSB is available at http://www.ece.ucsb.edu/courses/
syllabi/ece254b.html.
References to important or state-of-the-art research contributions and designs are
provided at the end of each chapter. These references provide good starting points for doing
in-depth studies or for preparing term papers/projects.
xii INTRODUCTION TO PARALLEL PROCESSING
New ideas in the field of parallel processing appear in papers presented at several annual
conferences, known as FMPC, ICPP, IPPS, SPAA, SPDP (now merged with IPPS), and in
archival journals such as IEEE Transactions on Computers [TCom], IEEE Transactions on
Parallel and Distributed Systems [TPDS], Journal of Parallel and Distributed Computing
[JPDC], Parallel Computing [ParC], and Parallel Processing Letters [PPL]. Tutorial and
survey papers of wide scope appear in IEEE Concurrency [Conc] and, occasionally, in IEEE
Computer [Comp]. The articles in IEEE Computer provide excellent starting points for
research projects and term papers.
ACKNOWLEDGMENTS
GENERAL REFERENCES
[Akl89] Akl, S. G., The Design and Analysis of Parallel Algorithms, Prentice–Hall, 1989.
[Akl97] Akl, S. G., Parallel Computation: Models and Methods, Prentice–Hall, 1997.
[Alma94] Almasi, G. S., and A. Gottlieb, Highly Parallel Computing, Benjamin/Cummings, 2nd ed., 1994.
[Bert89] Bertsekas, D. P., and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods,
Prentice–Hall, 1989.
[Code93] Codenotti, B., and M. Leoncini, Introduction to Parallel Processing, Addison–Wesley, 1993.
[Comp] IEEE Computer, journal published by IEEE Computer Society: has occasional special issues on
parallel/distributed processing (February 1982, June 1985, August 1986, June 1987, March 1988,
August 1991, February 1992, November 1994, November 1995, December 1996).
[Conc] IEEE Concurrency, formerly IEEE Parallel and Distributed Technology, magazine published by
IEEE Computer Society.
[Cric88] Crichlow, J. M., Introduction to Distributed and Parallel Computing, Prentice–Hall, 1988.
[DeCe89] DeCegama, A. L., Parallel Processing Architectures and VLSI Hardware, Prentice–Hall, 1989.
[Desr87] Desrochers, G. R., Principles of Parallel and Multiprocessing, McGraw-Hill, 1987.
[Duat97] Duato, J., S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach, IEEE
Computer Society Press, 1997.
[Flyn95] Flynn, M. J., Computer Architecture: Pipelined and Parallel Processor Design, Jones and Bartlett,
1995.
[FMPC] Proc. Symp. Frontiers of Massively Parallel Computation, sponsored by IEEE Computer Society and
NASA. Held every 1 1/2–2 years since 1986. The 6th FMPC was held in Annapolis, MD, October
27–31, 1996, and the 7th is planned for February 20–25, 1999.
[Foun94] Fountain, T. J., Parallel Computing: Principles and Practice, Cambridge University Press, 1994.
[Hock81] Hockney, R. W., and C. R. Jesshope, Parallel Computers, Adam Hilger, 1981.
[Hord90] Hord, R. M., Parallel Supercomputing in SIMD Architectures, CRC Press, 1990.
[Hord93] Hord, R. M., Parallel Supercomputing in MIMD Architectures, CRC Press, 1993.
[Hwan84] Hwang, K., and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, 1984.
[Hwan93] Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-
Hill, 1993.
PREFACE xiii
[Hwan98] Hwang, K., and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming,
McGraw-Hill, 1998.
[ICPP] Proc. Int. Conference Parallel Processing, sponsored by The Ohio State University (and in recent
years, also by the International Association for Computers and Communications). Held annually since
1972.
[IPPS] Proc. Int. Parallel Processing Symp., sponsored by IEEE Computer Society. Held annually since
1987. The 11th IPPS was held in Geneva, Switzerland, April 1–5, 1997. Beginning with the 1998
symposium in Orlando, FL, March 30–April 3, IPPS was merged with SPDP. **
[JaJa92] JaJa, J., An Introduction to Parallel Algorithms, Addison-Wesley, 1992.
[JPDC] Journal of Parallel and Distributed Computing, Published by Academic Press.
[Kris89] Krishnamurthy, E. V., Parallel Processing: Principles and Practice, Addison–Wesley, 1989.
[Kuma94] Kumar, V., A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and
Analysis of Algorithms, Benjamin/Cummings, 1994.
[Laks90] Lakshmivarahan, S., and S. K. Dhall, Analysis and Design of Parallel Algorithms: Arithmetic and
Matrix Problems, McGraw-Hill, 1990.
[Leig92] Leighton, F. T., Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes,
Morgan Kaufmann, 1992.
[Lerm94] Lerman, G., and L. Rudolph, Parallel Evolution of Parallel Processors, Plenum, 1994.
[Lipo87] Lipovski, G. J., and M. Malek, Parallel Computing: Theory and Comparisons, Wiley, 1987.
[Mold93] Moldovan, D. I., Parallel Processing: From Applications to Systems, Morgan Kaufmann, 1993.
[ParC] Parallel Computing, journal published by North-Holland.
[PPL] Parallel Processing Letters, journal published by World Scientific.
[Quin87] Quinn, M. J., Designing Efficient Algorithms for Parallel Computers, McGraw-Hill, 1987.
[Quin94] Quinn, M. J., Parallel Computing: Theory and Practice, McGraw-Hill, 1994.
[Reif93] Reif, J. H. (ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1993.
[Sanz89] Sanz, J. L. C. (ed.), Opportunities and Constraints of Parallel Computing (IBM/NSF Workshop, San
Jose, CA, December 1988), Springer-Verlag, 1989.
[Shar87] Sharp, J. A., An Introduction to Distributed and Parallel Processing, Blackwell Scientific Publica-
tions, 1987.
[Sieg85] Siegel, H. J., Interconnection Networks for Large-Scale Parallel Processing, Lexington Books, 1985.
[SPAA] Proc. Symp. Parallel Algorithms and Architectures, sponsored by the Association for Computing
Machinery (ACM). Held annually since 1989. The 10th SPAA was held in Puerto Vallarta, Mexico,
June 28–July 2, 1998.
[SPDP] Proc. Int. Symp. Parallel and Distributed Systems, sponsored by IEEE Computer Society. Held
annually since 1989, except for 1997. The 8th SPDP was held in New Orleans, LA, October 23–26,
1996. Beginning with the 1998 symposium in Orlando, FL, March 30–April 3, SPDP was merged
with IPPS.
[Ston93] Stone, H. S., High-Performance Computer Architecture, Addison–Wesley, 1993.
[TCom] IEEE Trans. Computers, journal published by IEEE Computer Society; has occasional special issues
on parallel and distributed processing (April 1987, December 1988, August 1989, December 1991,
April 1997, April 1998).
[TPDS] IEEE Trans. Parallel and Distributed Systems, journal published by IEEE Computer Society.
[Varm94] Varma, A., and C. S. Raghavendra, Interconnection Networks for Multiprocessors and Multicomput-
ers: Theory and Practice, IEEE Computer Society Press, 1994.
[Zoma96] Zomaya, A. Y. (ed.), Parallel and Distributed Computing Handbook, McGraw-Hill, 1996.
*The 27th ICPP was held in Minneapolis, MN, August 10–15, 1998, and the 28th is scheduled for September
21–24, 1999, in Aizu, Japan.
**The next joint IPPS/SPDP is sceduled for April 12–16, 1999, in San Juan, Puerto Rico.
This page intentionally left blank.
Contents
1. Introduction to Parallelism . . . . . . . . . . . . . . . . . . . . . 3
1.1. Why Parallel Processing? . . . . . . . . . . . . . . . . . . . . . . 5
1.2. A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 8
1.3. Parallel Processing Ups and Downs . . . . . . . . . . . . . . . . 13
1.4. Types of Parallelism: A Taxonomy . . . . . . . . . . . . . . . . . 15
1.5. Roadblocks to Parallel Processing . . . . . . . . . . . . . . . . . 16
1.6. Effectiveness of Parallel Processing . . . . . . . . . . . . . . . . 19
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21
References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 23
xv
xvi INTRODUCTION TO PARALLEL PROCESSING
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 63
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
This page intentionally left blank.
Introduction to
Parallel Processing
Algorithms and Architectures
This page intentionally left blank.
I
Fundamental
Concepts
1
This page intentionally left blank.
1
Introduction to
Parallelism
This chapter sets the context in which the material in the rest of the book will
be presented and reviews some of the challenges facing the designers and users
of parallel computers. The chapter ends with the introduction of useful metrics
for evaluating the effectiveness of parallel systems. Chapter topics are
3
This page intentionally left blank.
INTRODUCTION TO PARALLELISM 5
1. Increase in complexity (related both to higher device density and to larger size) of
VLSI chips, projected to rise to around 10 M transistors per chip for microproces-
sors, and 1B for dynamic random-access memories (DRAMs), by the year 2000
[SIA94]
2. Introduction of, and improvements in, architectural features such as on-chip cache
memories, large instruction buffers, multiple instruction issue per cycle, multi-
threading, deep pipelines, out-of-order instruction execution, and branch prediction
Moore’s law was originally formulated in 1965 in terms of the doubling of chip complexity
every year (later revised to every 18 months) based only on a small number of data points
[Scha97]. Moore’s revised prediction matches almost perfectly the actual increases in the
number of transistors in DRAM and microprocessor chips.
Moore’s law seems to hold regardless of how one measures processor performance:
counting the number of executed instructions per second (IPS), counting the number of
floating-point operations per second (FLOPS), or using sophisticated benchmark suites
that attempt to measure the processor's performance on real applications. This is because
all of these measures, though numerically different, tend to rise at roughly the same rate.
Figure 1.1 shows that the performance of actual processors has in fact followed Moore’s
law quite closely since 1980 and is on the verge of reaching the GIPS (giga IPS = 109
IPS) milestone.
Even though it is expected that Moore's law will continue to hold for the near future,
there is a limit that will eventually be reached. That some previous predictions about when
the limit will be reached have proven wrong does not alter the fact that a limit, dictated by
physical laws, does exist. The most easily understood physical limit is that imposed by the
finite speed of signal propagation along a wire. This is sometimes referred to as the
speed-of-light argument (or limit), explained as follows.
The Speed-of-Light Argument. The speed of light is about 30 cm/ns. Signals travel
on a wire at a fraction of the speed of light. If the chip diameter is 3 cm, say, any computation
that involves signal transmission from one end of the chip to another cannot be executed
faster than 1010 times per second. Reducing distances by a factor of 10 or even 100 will only
increase the limit by these factors; we still cannot go beyond 1012 computations per second.
To relate the above limit to the instruction execution rate (MIPS or FLOPS), we need to
estimate the distance that signals must travel within an instruction cycle. This is not easy to
do, given the extensive use of pipelining and memory-latency-hiding techniques in modern
high-performance processors. Despite this difficulty, it should be clear that we are in fact not
very far from limits imposed by the speed of signal propagation and several other physical
laws.
6 INTRODUCTION TO PARALLEL PROCESSING
Figure 1.1. The exponential growth of microprocessor performance, known as Moore’s law,
shown over the past two decades.
The speed-of-light argument suggests that once the above limit has been reached, the
only path to improved performance is the use of multiple processors. Of course, the same
argument can be invoked to conclude that any parallel processor will also be limited by the
speed at which the various processors can communicate with each other. However, because
such communication does not have to occur for every low-level computation, the limit is less
serious here. In fact, for many applications, a large number of computation steps can be
performed between two successive communication steps, thus amortizing the communica-
tion overhead.
Here is another way to show the need for parallel processing. Figure 1.2 depicts the
improvement in performance for the most advanced high-end supercomputers in the same
20-year period covered by Fig. 1.1. Two classes of computers have been included: (1)
Cray-type pipelined vector supercomputers, represented by the lower straight line, and (2)
massively parallel processors (MPPs) corresponding to the shorter upper lines [Bell92].
We see from Fig. 1.2 that the first class will reach the TFLOPS performance benchmark
around the turn of the century. Even assuming that the performance of such machines will
continue to improve at this rate beyond the year 2000, the next milestone, i.e., PFLOPS (peta
FLOPS = 1015 FLOPS) performance, will not be reached until the year 2015. With massively
parallel computers, TFLOPS performance is already at hand, albeit at a relatively high cost.
PFLOPS performance within this class should be achievable in the 2000–2005 time frame,
again assuming continuation of the current trends. In fact, we already know of one serious
roadblock to continued progress at this rate: Research in the area of massively parallel
computing is not being funded at the levels it enjoyed in the 1980s.
But who needs supercomputers with TFLOPS or PFLOPS performance? Applications
of state-of-the-art high-performance computers in military, space research, and climate
modeling are conventional wisdom. Lesser known are applications in auto crash or engine
combustion simulation, design of pharmaceuticals, design and evaluation of complex ICs,
scientific visualization, and multimedia. In addition to these areas, whose current computa-
tional needs are met by existing supercomputers, there are unmet computational needs in
INTRODUCTION TO PARALLELISM 7
Figure 1.2. The exponential growth in supercomputer performance over the past two decades
[Bell92].
aerodynamic simulation of an entire aircraft, modeling of global climate over decades, and
investigating the atomic structures of advanced materials.
Let us consider a few specific applications, in the area of numerical simulation for
validating scientific hypotheses or for developing behavioral models, where TFLOPS
performance is required and PFLOPS performance would be highly desirable [Quin94].
To learn how the southern oceans transport heat to the South Pole, the following model
has been developed at Oregon State University. The ocean is divided into 4096 regions E–W,
1024 regions N–S, and 12 layers in depth (50 M 3D cells). A single iteration of the model
simulates ocean circulation for 10 minutes and involves about 30B floating-point operations.
To carry out the simulation for 1 year, about 50,000 iterations are required. Simulation for
6 years would involve 1016 floating-point operations.
In the field of fluid dynamics, the volume under study may be modeled by a 10³ × 10³
× 10³ lattice, with about 10³ floating-point operations needed per point over 104 time steps.
This too translates to 1016 floating-point operations.
As a final example, in Monte Carlo simulation of a nuclear reactor, about 1011 particles
must be tracked, as about 1 in 108 particles escape from a nuclear reactor and, for accuracy,
we need at least 10³ escapes in the simulation. With 104 floating-point operations needed per
particle tracked, the total computation constitutes about 1015 floating-point operations.
From the above, we see that 1015 –10 16 floating-point operations are required for many
applications. If we consider 10³ –104 seconds a reasonable running time for such computa-
8 INTRODUCTION TO PARALLEL PROCESSING
tions, the need for TFLOPS performance is evident. In fact, researchers have already begun
working toward the next milestone of PFLOPS performance, which would be needed to run
the above models with higher accuracy (e.g., 10 times finer subdivisions in each of three
dimensions) or for longer durations (more steps).
The motivations for parallel processing can be summarized as follows:
1. Higher speed, or solving problems faster. This is important when applications have
“hard” or “soft” deadlines. For example, we have at most a few hours of computation
time to do 24-hour weather forecasting or to produce timely tornado warnings.
2. Higher throughput, or solving more instances of given problems. This is important
when many similar tasks must be performed. For example, banks and airlines,
among others, use transaction processing systems that handle large volumes of data.
3. Higher computational power, or solving larger problems. This would allow us to
use very detailed, and thus more accurate, models or to carry out simulation runs
for longer periods of time (e.g., 5-day, as opposed to 24-hour, weather forecasting).
All three aspects above are captured by a figure-of-merit often used in connection with
parallel processors: the computation speed-up factor with respect to a uniprocessor. The
ultimate efficiency in parallel systems is to achieve a computation speed-up factor of p with
p processors. Although in many cases this ideal cannot be achieved, some speed-up is
generally possible. The actual gain in speed depends on the architecture used for the system
and the algorithm run on it. Of course, for a task that is (virtually) impossible to perform on
a single processor in view of its excessive running time, the computation speed-up factor can
rightly be taken to be larger than p or even infinite. This situation, which is the analogue of
several men moving a heavy piece of machinery or furniture in a few minutes, whereas one
of them could not move it at all, is sometimes referred to as parallel synergy.
This book focuses on the interplay of architectural and algorithmic speed-up tech-
niques. More specifically, the problem of algorithm design for general-purpose parallel
systems and its “converse,” the incorporation of architectural features to help improve
algorithm efficiency and, in the extreme, the design of algorithm-based special-purpose
parallel architectures, are considered.
A major issue in devising a parallel algorithm for a given problem is the way in which
the computational load is divided between the multiple processors. The most efficient scheme
often depends both on the problem and on the parallel machine’s architecture. This section
exposes some of the key issues in parallel processing through a simple example [Quin94].
Consider the problem of constructing the list of all prime numbers in the interval [1, n]
for a given integer n > 0. A simple algorithm that can be used for this computation is the
sieve of Eratosthenes. Start with the list of numbers 1, 2, 3, 4, . . . , n represented as a “mark”
bit-vector initialized to 1000 . . . 00. In each step, the next unmarked number m (associated
with a 0 in element m of the mark bit-vector) is a prime. Find this element m and mark all
multiples of m beginning with m ². When m² > n, the computation stops and all unmarked
elements are prime numbers. The computation steps for n = 30 are shown in Fig. 1.3.
INTRODUCTION TO PARALLELISM 9
10 INTRODUCTION TO PARALLEL PROCESSING
Figure 1.4. Schematic representation of single-processor solution for the sieve of Eratosthenes.
Figure 1.5. Schematic representation of a control-parallel solution for the sieve of Eratosthenes.
INTRODUCTION TO PARALLELISM 11
12 INTRODUCTION TO PARALLEL PROCESSING
Finally, consider the data-parallel solution, but with data I/O time also included in the
total solution time. Assuming for simplicity that the I/O time is constant and ignoring
communication time, the I/O time will constitute a larger fraction of the overall solution time
as the computation part is speeded up by adding more and more processors. If I/O takes 100
seconds, say, then there is little difference between doing the computation part in 1 second
or in 0.01 second. We will later see that such “sequential” or “unparallelizable” portions of
computations severely limit the speed-up that can be achieved with parallel processing.
Figure 1.9 shows the effect of I/O on the total solution time and the attainable speed-up.
Figure 1.8. Trade-off between communication time and computation time in the data-parallel
realization of the sieve of Eratosthenes.
INTRODUCTION TO PARALLELISM 13
Figure 1.9. Effect of a constant I/O time on the data-parallel realization of the sieve of
Eratosthenes.
Imagine a large hall like a theater. . . . The walls of this chamber are painted to form a
map of the globe. . . . A myriad of computers are at work upon the weather on the part
of the map where each sits, but each computer attends to only one equation or part of an
equation. The work of each region is coordinated by an official of higher rank. Numerous
little ‘night signs’ display the instantaneous values so that neighbouring computers can
read them. . . . One of [the conductor’s] duties is to maintain a uniform speed of progress
in all parts of the globe. . . . But instead of waving a baton, he turns a beam of rosy light
upon any region that is running ahead of the rest, and a beam of blue light upon those
that are behindhand. [See Fig. 1.10.]
Parallel processing, in the literal sense of the term, is used in virtually every modern
computer. For example, overlapping I/O with computation is a form of parallel processing,
as is the overlap between instruction preparation and execution in a pipelined processor.
Other forms of parallelism or concurrency that are widely used include the use of multiple
functional units (e.g., separate integer and floating-point ALUs or two floating-point multi-
pliers in one ALU) and multitasking (which allows overlap between computation and
memory load necessitated by a page fault). Horizontal microprogramming, and its higher-
level incarnation in very-long-instruction-word (VLIW) computers, also allows some paral-
lelism. However, in this book, the term parallel processing is used in a restricted sense of
having multiple (usually identical) processors for the main computation and not for the I/O
or other peripheral activities.
The history of parallel processing has had its ups and downs (read company formations
and bankruptcies!) with what appears to be a 20-year cycle. Serious interest in parallel
processing started in the 1960s. ILLIAC IV, designed at the University of Illinois and later
14 INTRODUCTION TO PARALLEL PROCESSING
built and operated by Burroughs Corporation, was the first large-scale parallel computer
implemented; its 2D-mesh architecture with a common control unit for all processors was
based on theories developed in the late 1950s. It was to scale to 256 processors (four
quadrants of 64 processors each). Only one 64-processor quadrant was eventually built, but
it clearly demonstrated the feasibility of highly parallel computers and also revealed some
of the difficulties in their use.
Commercial interest in parallel processing resurfaced in the 1980s. Driven primarily by
contracts from the defense establishment and other federal agencies in the United States,
numerous companies were formed to develop parallel systems. Established computer ven-
dors also initiated or expanded their parallel processing divisions. However, three factors led
to another recess:
1. Government funding in the United States and other countries dried up, in part related
to the end of the cold war between the NATO allies and the Soviet bloc.
2. Commercial users in banking and other data-intensive industries were either satu-
rated or disappointed by application difficulties.
3. Microprocessors developed so fast in terms of performance/cost ratio that custom-
designed parallel machines always lagged in cost-effectiveness.
Many of the newly formed companies went bankrupt or shifted their focus to developing
software for distributed (workstation cluster) applications.
Driven by the Internet revolution and its associated “information providers,” a third
resurgence of parallel architectures is imminent. Centralized, high-performance machines
may be needed to satisfy the information processing/access needs of some of these providers.
INTRODUCTION TO PARALLELISM 15
Parallel computers can be divided into two main categories of control flow and data
flow. Control-flow parallel computers are essentially based on the same principles as the
sequential or von Neumann computer, except that multiple instructions can be executed at
any given time. Data-flow parallel computers, sometimes referred to as “non-von Neumann,”
are completely different in that they have no pointer to active instruction(s) or a locus of
control. The control is totally distributed, with the availability of operands triggering the
activation of instructions. In what follows, we will focus exclusively on control-flow parallel
computers.
In 1966, M. J. Flynn proposed a four-way classification of computer systems based on
the notions of instruction streams and data streams. Flynn’s classification has become
standard and is widely used. Flynn coined the abbreviations SISD, SIMD, MISD, and MIMD
(pronounced “sis-dee,” “sim-dee,” and so forth) for the four classes of computers shown in
Fig. 1.11, based on the number of instruction streams (single or multiple) and data streams
(single or multiple) [Flyn96]. The SISD class represents ordinary “uniprocessor” machines.
Computers in the SIMD class, with several processors directed by instructions issued from
a central control unit, are sometimes characterized as “array processors.” Machines in the
MISD category have not found widespread application, but one can view them as generalized
pipelines in which each stage performs a relatively complex operation (as opposed to
ordinary pipelines found in modern processors where each stage does a very simple
instruction-level operation).
The MIMD category includes a wide class of computers. For this reason, in 1988, E. E.
Johnson proposed a further classification of such machines based on their memory structure
(global or distributed) and the mechanism used for communication/synchronization (shared
variables or message passing). Again, one of the four categories (GMMP) is not widely used.
The GMSV class is what is loosely referred to as (shared-memory) multiprocessors. At the
Over the years, the enthusiasm of parallel computer designers and researchers has been
counteracted by many objections and cautionary statements. The most important of these are
listed in this section [Quin87]. The list begins with the less serious, or obsolete, objections
and ends with Amdahl’s law, which perhaps constitutes the most important challenge facing
parallel computer designers and users.
random. Most applications have a pleasant amount of data access regularity and
locality that help improve the performance. One might say that the log p speed-up
rule is one side of the coin that has the perfect speed-up p on the flip side. Depending
on the application, real speed-up can range from log p to p (p /log p being a
reasonable middle ground).
3. The tyranny of IC technology (because hardware becomes about 10 times faster
every 5 years, by the time a parallel machine with 10-fold performance is designed
and implemented, uniprocessors will be just as fast). This objection might be valid
for some special-purpose systems that must be built from scratch with “old”
technology. Recent experience in parallel machine design has shown that off-the-
shelf components can be used in synthesizing massively parallel computers. If the
design of the parallel processor is such that faster microprocessors can simply be
plugged in as they become available, they too benefit from advancements in IC
technology. Besides, why restrict our attention to parallel systems that are designed
to be only 10 times faster rather than 100 or 1000 times?
4. The tyranny of vector supercomputers (vector supercomputers, built by Cray,
Fujitsu, and other companies, are rapidly improving in performance and addition-
ally offer a familiar programming model and excellent vectorizing compilers; why
bother with parallel processors?). Figure 1.2 contains a possible answer to this
objection. Besides, not all computationally intensive applications deal with vectors
or matrices; some are in fact quite irregular. Note, also, that vector and parallel
processing are complementary approaches. Most current vector supercomputers do
in fact come in multiprocessor configurations for increased performance.
5. The software inertia (billions of dollars worth of existing software makes it hard to
switch to parallel systems; the cost of converting the “dusty decks” to parallel
programs and retraining the programmers is prohibitive). This objection is valid in
the short term; however, not all programs needed in the future have already been
written. New applications will be developed and many new problems will become
solvable with increased performance. Students are already being trained to think
parallel. Additionally, tools are being developed to transform sequential code into
parallel code automatically. In fact, it has been argued that it might be prudent to
develop programs in parallel languages even if they are to be run on sequential
computers. The added information about concurrency and data dependencies would
allow the sequential computer to improve its performance by instruction prefetch-
ing, data caching, and so forth.
6. Amdahl’s law (speed-up ≤ 1/[ƒ+ (1 – ƒ)/p ] = p/[1 + ƒ(p – 1)]; a small fraction ƒ of
inherently sequential or unparallelizable computation severely limits the speed-up
that can be achieved with p processors). This is by far the most important of the six
objections/warnings. A unit-time task, for which the fraction ƒ is unparallelizable
(so it takes the same time ƒ on both sequential and parallel machines) and the
remaining 1 – ƒ is fully parallelizable [so it runs in time (1 – ƒ)/p on a p -processor
machine], has a running time of ƒ + (1 – ƒ)/ p on the parallel machine, hence
Amdahl’s speed-up formula.
Figure 1.12 plots the speed-up as a function of the number of processors for different values
of the inherently sequential fraction ƒ. The speed-up can never exceed 1/ƒ, no matter how
18 INTRODUCTION TO PARALLEL PROCESSING
many processors are used. Thus, for ƒ = 0.1, speed-up has an upper bound of 10. Fortunately,
there exist applications for which the sequential overhead is very small. Furthermore, the
sequential overhead need not be a constant fraction of the job independent of problem size.
In fact, the existence of applications for which the sequential overhead, as a fraction of the
overall computational work, diminishes has been demonstrated.
Closely related to Amdahl’s law is the observation that some applications lack inherent
parallelism, thus limiting the speed-up that is achievable when multiple processors are used.
Figure 1.13 depicts a task graph characterizing a computation. Each of the numbered nodes
in the graph is a unit-time computation and the arrows represent data dependencies or the
prerequisite structure of the graph. A single processor can execute the 13-node task graph
shown in Fig. 1.13 in 13 time units. Because the critical path from input node 1 to output
node 13 goes through 8 nodes, a parallel processor cannot do much better, as it needs at least
8 time units to execute the task graph. So, the speed-up associated with this particular task
graph can never exceed 1.625, no matter how many processors are used.
Throughout the book, we will be using certain measures to compare the effectiveness
of various parallel algorithms or architectures for solving desired problems. The following
definitions and notations are applicable [Lee80]:
p Number of processors
W(p) Total number of unit operations performed by the p processors; this is often
referred to as computational work or energy
T(p) Execution time with p processors; clearly, T(1) = W(1) and T(p) ≤ W (p)
S(p) Speed-up =
E(p) Efficiency =
R (p ) Redundancy =
U(p) Utilization =
Q(p) Quality =
The significance of each measure is self-evident from its name and defining equation given
above. It is not difficult to establish the following relationships between these parameters.
The proof is left as an exercise.
1 ≤ S(p ) ≤ p
U(p) = R (p)E(p)
20 INTRODUCTION TO PARALLEL PROCESSING
The efficiency in this latter case is even lower, primarily because the interprocessor transfers
constitute overhead rather than useful operations.
PROBLEMS
a . Draw the task graph for this image processing application problem.
b. What is the maximum speed-up that can be achieved for this application with two
processors?
c . What is an upper bound on the speed-up with parallel processing?
d. How many processors are sufficient to achieve the maximum speed-up derived in part (c)?
e . What is the maximum speed-up in solving five independent instances of the problem on
two processors?
f . What is an upper bound on the speed-up in parallel solution of 100 independent instances
of the problem?
g. How many processors are sufficient to achieve the maximum speed-up derived in part (f)?
h. What is an upper bound on the speed-up, given a steady stream of independent problem
instances?
25
Another Random Scribd Document
with Unrelated Content
Vaikkei asiaa koskevia muistiinpanoja ole olemassa, on kumminkin
varmaa, että Kihlman tarkkaavasti ja suurella mielenkiinnolla seurasi
niitä ilmiöitä herännäisyysliikkeen alalla, jotka vähitellen johtivat tai
myötävaikuttivat v. 1852 tapahtuvaan hajaannukseen. [Rosendal m.
p. III, Luku XI.] Tammikuulla 1849 hän oli mukana, kun Malmberg, F.
O. Durchman, L. Stenbäck, A. O. Törnudd, J. Grönberg, C. G. von
Essen y.m. perheineen kokoontuivat Keuruulle Frans Bergrothin
luokse, sieltä jatkaakseen matkaa Kuopion markkinoille.
Ensinmainitusta sanotaan kuitenkin, että hän juotuaan liiaksi ja
kiivastuttuaan väitellessään toisten kanssa jäi Keuruulle, vaikka
kaikki muut ja hänen vaimonsakin matkustivat Kuopioon. Tällä
matkalla ja erittäin Julius Berghin luona oli keskusteltu Wilhelm
Niskasen opista ja vaikutuksesta Kalajoen varrella ja myöskin
Malmbergistä, joka oli hyvissä väleissä Niskasen kanssa ja jota
ystävät muutenkin näyttävät alkaneen katsella arvostelevasta Siitä
huolimatta pysyi heidän suhteensa Malmbergiin entisellään, ja milloin
aihetta ilmaantui (niinkuin esim. Edv. Svanin hautajaiset Purmossa
30/1 1849, Essenin virkaanasettajaiset Ylihärmässä 1/7 s.v., erään
Bergrothin palvelijan häät lokakuulla 1850 Keuruulla j.n.e.), yhtyivät
he vanhaan tapaan kaikessa ystävyydessä. Että Kihlman itse —
huolimatta kaikenlaisista epäilyksistä, joista hän myöhemmin puhuu
— näinä vuosina yhä pysyi herännäisyyden alkuperäisellä kannalla,
todistanee sekin, että hän v. 1850 kaksi eri kertaa kävi tapaamassa
Paavo-ukkoa. Sinä vuonna hän näet on merkinnyt lähteneensä
maaliskuun 1 p:nä Pyhäjärvelle, (Jonas Lagus), edelleen tulleensa 3
p:nä "saareen" (s.o. Syvärinjärven Aholansaareen, missä Paavo asui
Nilsiässä) ja 4 p:nä illalla Kuopioon. Kenen seurassa hän matkusti on
epätietoista, mutta monikkomuoto ("kommo") osottaa, että hän ei
ollut yksin. Toisesta retkestä saamme tiedon eräästä hänen Gelalle
(8/9) osoittamastaan kirjeestä. Siinä Kihlman sanoo onnellisesti
saapuneensa Haapajärvelle, mutta hänen tullessaan oli Reinhold
Helander juuri lähtemäisillään Nilsiään, eikä tämä voinut lykätä
matkaansa, syystä että hänen piti olla kotona määräaikana. "Jotta ei
matkani tarkotus jäisi saavuttamatta, vaan saisin seurustella
Reinholdin kanssa sekä samalla tavata kuuluisaa, monen parjaamaa
ja monen ylistämää Paavo-ukkoa, [Kirje on laadittu näytettäväksi
isälle — siitä sanat Paavosta, joiden yleisestä muodosta saattaisi
luulla, että Alfred ei ennen ollut ukkoa ravannutkaan.] olen monen
ahdistuksen päästä, päättänyt yhtyä matkaan. Vaikein ja suurin este
on ollut, että minulla ei ole Isän lupaa, ja varmaan en matkustaisi,
jos tietäisin, että Isä matkan johdosta närkästyisi. Mutta toivoen,
että Isä antaa anteeksi, uskallan lähteä." Hän käskee Gelan kohta
mennä pappilaan ilmoittamaan asia isälle, mutta muutoin on matka
pidettävä salassa, "sillä kansa juoruaa, ja piispa on lähimailla (i
farvattnet)." — Etteivät isän ja pojan välit vieläkään olleet hyvät,
todistanee muistiinpano (13/10 1850): "Olin valmis vuorostani
saarnaamaan; pyhäaamuna tulee rovasti kirkkoon ja sanoo, että hän
saarnaa itse." —
*****
*****
Lähtövalmistukset; Ruotsissa.
Ennenkuin Kihlman oli saanut tietoa siitä, että hänen oli ties
kuinka kauan odotettava passiakin, oli hän vielä kerran palannut
Kruununkylään hajottaakseen kotinsa. Heinäkuun 14 p:nä hän lähetti
4-vuotiaan Hannan neiti Emma Candelinin turvissa Keuruulle, missä
hänen kälynsä, Hilda Bergroth, oli ottava lapsen äidilliseen
hoitoonsa. Itse hän vielä viipyi viisi päivää, joten hänen
lähtöpäiväkseen tuli heinäkuun 20:s. Siitä alkaa myöskin hänen
matka-päiväkirjansa, vaikka viikkoja vielä kului ennenkuin hän pääsi
kotimaan ulkopuolelle.
*****
*****
*****
ebookball.com