14013204-3 - Parallel Computing - Lecture1_ (4)
14013204-3 - Parallel Computing - Lecture1_ (4)
1/6/25 Lecture - 1 1
14013204-3 - Parallel Computing (3 credits)
n Course Description
n This course examines the theory and practice of parallel computing.
n Topics covered:
n Introduction to Parallel computing.
n Parallel architectures.
n Designing parallel program algorithms, and managing different kinds of parallel
programming overheads, e.g., synchronization, communication, etc. Measuring
and Tuning the parallel performance.
n Programming for shared and distributed parallel architectures.
n Prerequisites
n 14012203-4 Operating Systems,
n 14012401-3 Data Structures
Chapter 1
7
How is performance achieved?
n All processors are made of transistors.
n The fundamental components of a CPU and play crucial roles
in its operation.
n Smaller transistors change state faster: they enable higher
speeds.
n Added advanced hardware that made your code faster
automatically.
8
How is performance achieved?
9
How is performance achieved?
n Smaller transistors à More transistors on chips (increase in transistor density).
n More transistors on chips à more computational power (Faster processors )
à higher applications performance.
n Each new generation of processors provides more transistors and offers
higher speed.
n From 1986 – 2003, microprocessors were speeding like a rocket, increasing in
performance by an average of 50% per year.
n This unprecedented increase meant that users and software developers could
often simply wait for the next generation of microprocessors to obtain increased
performance from their applications.
n BUT, this free performance gain is over around 2003-2004!
10
Changing times
n Since 2003, however, single-processor performance improvement has
slowed to the point that in the period from 2015 to 2017, it increased at
less than 4% per year.
n The power of the conventional processor has reached the point where
the processor’s performance and speed are limited and can not be
improved with the increase in transistors.
n Why ??
12
Changing times
n However, as the speed of transistors increases, their power consumption also
increases.
n Most of this power is dissipated as heat, and when an integrated circuit gets too
hot, it becomes unreliable.
n Faster processors à increased power consumption.
n Increased power consumption à increased heat.
n Increased heat à unreliable processors
n Dissipating (removing) the heat is requiring more and more sophisticated
equipment, heat sinks cannot do it anymore.
n In the first decade of the twenty-first century, air-cooled integrated circuits reached the limits of
their ability to dissipate heat. Therefore, it is becoming impossible to continue to increase the
speed of integrated circuits. Indeed, in the last few years, the increase in transistor density has
slowed dramatically.
n Therefore, it is becoming impossible to continue to increase the speed of integrated
circuits.
Copyright © 2010, Elsevier Inc. All rights Reserved 13
Changing times
n Let’s look at some heatsinks:
Intel 386 (25 MHz) Heatsink
n The 386 had no heatsink!
n It did not generate much heat
n Because it has very slow speed
14
Changing times
486 (~50Mhz) Heatsink Pentium 4 (2-3GHz) Heatsink
15
Why we need ever-increasing performance
n Computational power is increasing, but so are our computation
problems and needs.
n Problems we never dreamed of have been solved because of past
increases, such as decoding the human genome, ever more
accurate medical imaging, astonishingly fast and accurate Web
searches, and ever more realistic and responsive computer
games would all have been impossible without these increases
n More complex problems are still waiting to be solved.
24
Approaches to the serial problem
n Rewrite serial programs so that they’re parallel, so that
they can make use of multiple cores
n Write translation programs that automatically convert
serial programs into parallel programs.
n This is very difficult to do.
n Success has been limited.
n Sometimes the best parallel solution is to step back and
devise an entirely new algorithm.
Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95
n Now cores divisible by 4 repeat the process, and so forth until core 0 has the
final result.
999 additions.
n The second example would only require 10 receives and 10 additions.
2. Data parallelism
n Partition the data used in solving the problem among the cores.
n Each core carries out similar operations on it’s part of the data.
15 questions
300 exams
TA#1 TA#3
TA#2
TA#1
100 exams
TA#3
100 exams
100 exams
TA#2
TA#1
TA#3
Questions 11 - 15
Questions 1 - 5
TA#2
Questions 6 - 10
Tasks
1) Receiving
2) Addition
n E.g. core 0 executes task 0, core 1 executes task 1, . . . , all cores synchronize, .
. . , and so on.
n so such programs are often extremely complex.
n Implicit Parallelism: There are other options for writing parallel programs—for
example, higher level languages.
n They tend to sacrifice performance to make program development somewhat
easier.
Shared-memory Distributed-memory