0% found this document useful (0 votes)
6 views

14013204-3 - Parallel Computing - Lecture1_ (4)

The document outlines a course on Parallel Computing, covering topics such as parallel architectures, algorithm design, and performance tuning. It emphasizes the need for parallel programming due to the limitations of single-processor performance and the increasing complexity of computational problems. The course includes prerequisites, assessment methods, and references for further reading.

Uploaded by

Hiro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

14013204-3 - Parallel Computing - Lecture1_ (4)

The document outlines a course on Parallel Computing, covering topics such as parallel architectures, algorithm design, and performance tuning. It emphasizes the need for parallel programming due to the limitations of single-processor performance and the increasing complexity of computational problems. The course includes prerequisites, assessment methods, and references for further reading.

Uploaded by

Hiro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

14013204-3 - PARALLEL COMPUTING

1/6/25 Lecture - 1 1
14013204-3 - Parallel Computing (3 credits)
n Course Description
n This course examines the theory and practice of parallel computing.
n Topics covered:
n Introduction to Parallel computing.
n Parallel architectures.
n Designing parallel program algorithms, and managing different kinds of parallel
programming overheads, e.g., synchronization, communication, etc. Measuring
and Tuning the parallel performance.
n Programming for shared and distributed parallel architectures.
n Prerequisites
n 14012203-4 Operating Systems,
n 14012401-3 Data Structures

Copyright © 2010, Elsevier Inc. All rights Reserved 2


14013204-3 - Parallel Computing (3 credits)
n Course Weekly Hours
n (2 lec + 2 lab)/week
n Textbook/References
n Introduction to Parallel Programming, Peter Pacheco - Matthew Malensek,
2022.
n Introduction to Parallel Computing From Algorithms to Programming on State-
of-the-Art Platforms, Roman Trobec - Boštjan Slivnik - Patricio Bulić - Borut
Robič, 2018.

Copyright © 2010, Elsevier Inc. All rights Reserved 3


14013204-3 - Parallel Computing (3 credits)
n Assessment Methods
n Quizzes & Participation: 15 %
n Lab : 25 %
n Midterm: 25 %
n Final : 35 %

n There might be a change in this assessment methods.

Copyright © 2010, Elsevier Inc. All rights Reserved 4


An Introduction to Parallel Programming
Peter Pacheco

Chapter 1

Why Parallel Computing?

Copyright © 2010, Elsevier Inc. All rights Reserved 5


Roadmap
n What is parallel Computing?
n How is performance achieved?
n Why we need ever-increasing performance?
n Why we need to write parallel programs?
n How do we write parallel programs?
n Type of Parallel Systems.
n What we’ll be doing.
n Concurrent, parallel, distributed!

Copyright © 2010, Elsevier Inc. All rights Reserved 6


What is Parallel Programming?

§ In general, most software will eventually have to make use of parallelism


as performance matters.

7
How is performance achieved?
n All processors are made of transistors.
n The fundamental components of a CPU and play crucial roles
in its operation.
n Smaller transistors change state faster: they enable higher
speeds.
n Added advanced hardware that made your code faster
automatically.

n Moore’s Law: the number of transistors integrated into


a single chip will double every two years since their
invention, leading to an enormous increase in CPU
performance and consequently to the application.

8
How is performance achieved?

9
How is performance achieved?
n Smaller transistors à More transistors on chips (increase in transistor density).
n More transistors on chips à more computational power (Faster processors )
à higher applications performance.
n Each new generation of processors provides more transistors and offers
higher speed.
n From 1986 – 2003, microprocessors were speeding like a rocket, increasing in
performance by an average of 50% per year.
n This unprecedented increase meant that users and software developers could
often simply wait for the next generation of microprocessors to obtain increased
performance from their applications.
n BUT, this free performance gain is over around 2003-2004!

10
Changing times
n Since 2003, however, single-processor performance improvement has
slowed to the point that in the period from 2015 to 2017, it increased at
less than 4% per year.
n The power of the conventional processor has reached the point where
the processor’s performance and speed are limited and can not be
improved with the increase in transistors.
n Why ??

Copyright © 2010, Elsevier Inc. All rights Reserved 11


Changing times

12
Changing times
n However, as the speed of transistors increases, their power consumption also
increases.
n Most of this power is dissipated as heat, and when an integrated circuit gets too
hot, it becomes unreliable.
n Faster processors à increased power consumption.
n Increased power consumption à increased heat.
n Increased heat à unreliable processors
n Dissipating (removing) the heat is requiring more and more sophisticated
equipment, heat sinks cannot do it anymore.
n In the first decade of the twenty-first century, air-cooled integrated circuits reached the limits of
their ability to dissipate heat. Therefore, it is becoming impossible to continue to increase the
speed of integrated circuits. Indeed, in the last few years, the increase in transistor density has
slowed dramatically.
n Therefore, it is becoming impossible to continue to increase the speed of integrated
circuits.
Copyright © 2010, Elsevier Inc. All rights Reserved 13
Changing times
n Let’s look at some heatsinks:
Intel 386 (25 MHz) Heatsink
n The 386 had no heatsink!
n It did not generate much heat
n Because it has very slow speed

14
Changing times
486 (~50Mhz) Heatsink Pentium 4 (2-3GHz) Heatsink

Pentium 2/3 (233-733MHz) Core i7 (3-3.5GHz) Heatsink


Heatsink

15
Why we need ever-increasing performance
n Computational power is increasing, but so are our computation
problems and needs.
n Problems we never dreamed of have been solved because of past
increases, such as decoding the human genome, ever more
accurate medical imaging, astonishingly fast and accurate Web
searches, and ever more realistic and responsive computer
games would all have been impossible without these increases
n More complex problems are still waiting to be solved.

Copyright © 2010, Elsevier Inc. All rights Reserved 16


Climate modeling
§ To better understand climate change, we need far more accurate computer
models, models that include interactions between the atmosphere, the oceans,
solid land, and the ice caps at the poles. We also need to be able to make detailed
studies of how various interventions might affect the global climate.

Copyright © 2010, Elsevier Inc. All rights Reserved 17


Protein folding
§ It’s believed that misfolded proteins may be involved in diseases such as
Huntington’s, Parkinson’s, and Alzheimer’s, but our ability to study configurations of
complex molecules such as proteins is severely limited by our current
computational power.

Copyright © 2010, Elsevier Inc. All rights Reserved 18


Drug discovery
§ There are many ways in which increased computational power can be used in
research into new medical treatments. For example, there are many drugs that
are effective in treating a relatively small fraction of those suffering from some
disease. It’s possible that we can devise alternative treatments by careful
analysis of the genomes of the individuals for whom the known treatment is
ineffective. This, however, will involve extensive computational analysis of
genomes.

Copyright © 2010, Elsevier Inc. All rights Reserved 19


Energy research
§ Increased computational power will make it possible to program much more
detailed models of technologies, such as wind turbines, solar cells, and
batteries. These programs may provide the information needed to construct far
more efficient clean energy sources.

Copyright © 2010, Elsevier Inc. All rights Reserved 20


Data analysis
§ We generate tremendous amounts of data. By some estimates, the quantity of
data stored worldwide doubles every two years, but most of it is largely useless
unless it’s analyzed.
§ As an example, knowing the sequence of nucleotides in human DNA is, by
itself, of little use. Understanding how this sequence affects development and
how it can cause disease requires extensive analysis.
§ In addition to genomics, huge quantities of data are medical imaging,
astronomical research, and Web search engines—to name a few.

Copyright © 2010, Elsevier Inc. All rights Reserved 21


Solution
n This difference in performance increase has been associated with a
dramatic change in processor design.
n By 2005, most of the major manufacturers of microprocessors had
decided that the road to rapidly increasing performance lay in the
direction of parallelism.
n Instead of designing and building faster processors, put multiple
complete processors on a single integrated circuit.
n Move away from single-core systems to multicore processors.
n “core” = central processing unit (CPU)
n desktop and laptop 4 to 16 cores, even more.
n Servers exceeding 64 cores

Copyright © 2010, Elsevier Inc. All rights Reserved 22


Why we need to write parallel programs
n This change has a very important consequence for software
developers: Adding more processors doesn’t help much if
programmers aren’t aware of them… or don’t know how to use them.

n Serial programs (programs that were written to run on a single


processor)don’t benefit from this approach (in most cases).
n Such programs are unaware of the existence of multiple

processors, and the performance of such a program on a system


with multiple processors will be effectively the same as its
performance on a single processor of the multiprocessor system.

Copyright © 2010, Elsevier Inc. All rights Reserved 23


Why we need to write parallel programs

24
Approaches to the serial problem
n Rewrite serial programs so that they’re parallel, so that
they can make use of multiple cores
n Write translation programs that automatically convert
serial programs into parallel programs.
n This is very difficult to do.
n Success has been limited.
n Sometimes the best parallel solution is to step back and
devise an entirely new algorithm.

Copyright © 2010, Elsevier Inc. All rights Reserved 25


Example
n Compute n values and add them together.
n Serial solution:

Copyright © 2010, Elsevier Inc. All rights Reserved 26


Example (cont.)
n We have p cores, p much smaller than n.
n Each core performs a partial sum of approximately n/p values.

Each core uses it’s own private variables


and executes this block of code
independently of the other cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 27


Example (cont.)
n After each core completes execution of the code, is a private
variable my_sum contains the sum of the values computed by
its calls to Compute_next_value.

n Ex., 8 cores, n = 24, then the calls to Compute_next_value


return: 1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
n Then the values stores in my_sum will be:

Copyright © 2010, Elsevier Inc. All rights Reserved 28


Example (cont.)
n Once all the cores are done computing their private my_sum,
they form a global sum by sending results to a designated
“master” core which adds the final result.

Copyright © 2010, Elsevier Inc. All rights Reserved 29


Example (cont.)

Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95

Copyright © 2010, Elsevier Inc. All rights Reserved 30


But wait!
There’s a much better way
to compute the global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 31


Better parallel algorithm
n Don’t make the master core do all the work.
n Share it among the other cores.
n Pair the cores so that :
n Core 0 adds its result with core 1’s result.
n Core 2 adds its result with core 3’s result, etc.

n Work with odd and even-numbered pairs of cores.


n Then we can repeat the process with only the even-ranked cores:
n 0 adds in the result of 2,
n 4 adds in the result of 6, and so on.

n Now cores divisible by 4 repeat the process, and so forth until core 0 has the
final result.

Copyright © 2010, Elsevier Inc. All rights Reserved 32


Multiple cores forming a global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 33


Analysis
n In the first example, the master core performs 7 receives and
7 additions.

n In the second example, the master core performs 3 receives


and 3 additions.

n The improvement is more than a factor of 2!

Copyright © 2010, Elsevier Inc. All rights Reserved 34


Analysis (cont.)
n The difference is more dramatic with a larger number of cores.
n If we have 1000 cores:
n The first example would require the master to perform 999 receives and

999 additions.
n The second example would only require 10 receives and 10 additions.

n That’s an improvement of almost a factor of 100!


n The first global sum is a fairly obvious generalization of the serial global
sum: divide the work of adding among the cores, and after each core has
computed its part of the sum, the master core simply repeats the basic serial
addition—if there are p cores, then it needs to add p values.
n The second global sum bears little relation to the original serial addition.
n The point here is that it’s unlikely that a translation program would
“discover” the second global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 35


How do we write parallel programs?
n There are a number of possible answers to this question, but
most of them depend on the basic idea of partitioning the
work to be done among the cores.
1. Task parallelism
n Partition various tasks carried out solving the problem among the
cores.

2. Data parallelism
n Partition the data used in solving the problem among the cores.
n Each core carries out similar operations on it’s part of the data.

Copyright © 2010, Elsevier Inc. All rights Reserved 36


Professor P

15 questions
300 exams

Copyright © 2010, Elsevier Inc. All rights Reserved 37


Professor P’s grading assistants

TA#1 TA#3
TA#2

Copyright © 2010, Elsevier Inc. All rights Reserved 38


Division of work – data parallelism

TA#1
100 exams
TA#3

100 exams

100 exams
TA#2

Copyright © 2010, Elsevier Inc. All rights Reserved 39


Division of work – task parallelism

TA#1
TA#3
Questions 11 - 15
Questions 1 - 5

TA#2
Questions 6 - 10

Copyright © 2010, Elsevier Inc. All rights Reserved 40


Division of work – data parallelism

Copyright © 2010, Elsevier Inc. All rights Reserved 41


Division of work – task parallelism

Tasks
1) Receiving
2) Addition

Copyright © 2010, Elsevier Inc. All rights Reserved 42


Coordination
n Cores usually need to coordinate their work.
n Communication – one or more cores send their current partial sums
to another core.
n Load balancing – share the work evenly among the cores so that one
is not heavily loaded.
n Synchronization – because each core works at its own pace, make
sure cores do not get too far ahead of the rest.

Copyright © 2010, Elsevier Inc. All rights Reserved 43


Explicit Vs. Implicit Parallelism
n Explicit Parallelism: Currently, the most powerful parallel programs are written
using explicit parallel constructs.
n that is, they are written using extensions to languages such as C, C++, and
Java.
n These programs include explicit instructions for parallelism:

n E.g. core 0 executes task 0, core 1 executes task 1, . . . , all cores synchronize, .

. . , and so on.
n so such programs are often extremely complex.

n Implicit Parallelism: There are other options for writing parallel programs—for
example, higher level languages.
n They tend to sacrifice performance to make program development somewhat
easier.

Copyright © 2010, Elsevier Inc. All rights Reserved 44


Type of parallel systems
n Shared-memory
n The cores can share access to the computer’s memory.
n Coordinate the cores by having them examine and update
shared memory locations.
n Distributed-memory
n Each core has its own, private memory.
n The cores must communicate explicitly by sending messages
across a network.

Copyright © 2010, Elsevier Inc. All rights Reserved 45


Type of parallel systems

Shared-memory Distributed-memory

Copyright © 2010, Elsevier Inc. All rights Reserved 46


What we’ll be doing
n Learning Parallel HW and SW.
n Parallel architectures.
n Parallel algorithms and coordination details.
n Measuring the performance of parallel algorithms.
n Etc.
n Learn to write programs that are explicitly parallel.
n Will be using the C language.
n Using three different extensions to C.
n Message-Passing Interface (MPI) – Distributed Memory
n OpenMP
n CUDA
Copyright © 2010, Elsevier Inc. All rights Reserved 47
Terminology
n Concurrent computing – a program is one in which multiple
tasks can be in progress at any instant.
n Parallel computing – a program is one in which multiple tasks
cooperate closely to solve a problem
n Distributed computing – a program may need to cooperate
with other programs to solve a problem.

Copyright © 2010, Elsevier Inc. All rights Reserved 48


Terminology
n So parallel and distributed programs are concurrent, but a program such as a
multitasking operating system is also concurrent, even when it is run on a machine
with only one core since multiple tasks can be in progress at any instant.
n There isn’t a clear-cut distinction between parallel and distributed programs:
n a parallel program usually runs multiple tasks simultaneously on cores that
are physically close to each other and that either share the same memory or
are connected by a very highspeed network.
n On the other hand, distributed programs tend to be more “loosely coupled.” The

tasks may be executed by multiple computers that are separated by relatively


large distances, and the tasks themselves are often executed by programs that
were created independently.
n As examples, our two concurrent addition programs would be considered
parallel by most authors, while a Web search program would be considered
distributed.

Copyright © 2010, Elsevier Inc. All rights Reserved 49


Concluding Remarks (1)
n The laws of physics have brought us to the doorstep of
multicore technology.
n Serial programs typically don’t benefit from multiple cores.
n Automatic parallel program generation from serial
program code isn’t the most efficient approach to get high
performance from multicore computers.

Copyright © 2010, Elsevier Inc. All rights Reserved 50


Concluding Remarks (2)
n How Performance achieved?
n Before
n Write a sequential (non-parallel) program.
n It becomes faster with newer processors.
n Higher speed, more advanced.
n Now
n New processor has more cores, but each is slower
n Sequential programs will run slower on a new processor
n They can only use one core
n What will run a faster à Parallel program that can use all the cores!!!

Copyright © 2010, Elsevier Inc. All rights Reserved 51


Concluding Remarks (3)
n Learning to write parallel programs involves learning how to
coordinate the cores.
n Parallel programs are usually very complex and therefore,
require sound program techniques and development.
n Many factors affect performance
n Not easy to find the source of bad performance
n Usually requires a deeper understanding of processor architectures
n This is why there is a whole course for it

Copyright © 2010, Elsevier Inc. All rights Reserved 52

You might also like