Parallel2 PDF
Parallel2 PDF
LE C T U RE 2
1
First we will show why parallel processing is used
Then, we will discuss Amdahl's low
Finally, we will Solve the some examples
PAGE 2
Climate modeling Protein folding Drug discovery Energy research Data analysis
PAGE 3
Power = C * V 2* F Performance = C ores * F
PAGE 4
Amdahl's law
• Amdahl's Law states that potential program speedup is defined by the fraction of
code that can be parallelized
PAGE 5
𝑥𝑥 𝑓𝑓
Example:
A simple design problem illustrates it well. Suppose a
program runs in 100 seconds on a computer, with
multiply operations responsible for 80 seconds of this
time. How much do I have to improve the speed of
multiplication if I want my program to run five times
faster?
PAGE 6
Suppose you want to achieve a speed-up of 90 times faster with 100 processors.
What percentage of the original computation can be sequential?
• The is formula is usually rewritten assuming that the execution time before is 1
for some unit of time, and the execution time affected by improvement is
considered the fraction of the original execution time
PAGE 7
• Suppose you want to perform two sums: one is a sum
of 10 scalar variables, and one is a matrix sum of a
pair of two-dimensional arrays, with dimensions 10 by
10.For now let’s assume only the matrix sum is
parallelizable. What speed-up do you get with 10
versus 40 processors?
PAGE 8
To achieve the speed-up of 20.5 on the previous larger
problem with 40 processors, we assumed the load was
perfectly balanced. That is, each of the 40 processors
had 2.5% of the work to do. Instead, show the impact on
speed-up if one processor’s load is higher than all the
rest. Calculate at twice the load (5%) and five time the
load (12.5%) for that hardest working processor. How
well utilized are the rest of the processors?
PAGE 9
For 10*10 array and 40 processors
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑤𝑤𝑤𝑤𝑤𝑤 ℎ 𝑜𝑜𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖 𝑝𝑝𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 = 𝑡𝑡𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑤𝑤𝑤𝑤𝑤𝑤 ℎ 𝑜𝑜𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
One processor load is 5% of the load that can be improved (100 addition)
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 5% ∗ 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 5% ∗ 100 = 0.05 ∗ 100 = 5
The remaining 39 processor will share the remaining load
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜 39 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 95% ∗ 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜 39 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 95% ∗ 100 = 0.95 ∗ 100 = 95
Or simply we can say it is
Total load which can be improved – the load on the one processor
=100-5=95
PAGE 10
Now let us calculate the total time after improvement
𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
However the time for improved is the longest time required to calculate the parallel array addition
where it could be either the time for the single processor or the remaining 19
39 processors,
processors; whichever is
highest.
So
Now the 𝑡𝑡𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 is equal to the load over number of processors
PAGE 11
the 𝑟𝑟 39 is equal to the load of the 39 processors over the number of processors
𝑡𝑡
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
We have 95 additions over 39 processor so
95
=
𝑡𝑡
𝑟𝑟 39
39
𝑡𝑡
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
For simplicity we can say 100t/ 40 however in exams the exact value should be used which is 95t/39
𝑡𝑡
𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑜𝑜𝑜𝑜𝑜𝑜
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
5𝑡𝑡 95
𝑎𝑎𝑎𝑎𝐸𝐸𝐸𝐸𝑎𝑎 𝐸𝐸𝑡𝑡𝑖𝑖𝑎𝑎𝐸𝐸𝑖𝑖𝐸𝐸𝑡𝑡𝐸𝐸𝐸𝐸𝐸𝐸 = max( , ) + 10
1 39
𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎 𝑎𝑎 𝑖𝑖 𝑖𝑖 = 5t + 10 = 15
𝑡𝑡
𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
The speed-up will be
𝑡𝑡 𝑤𝑤 ℎ 𝑖𝑖 𝑖𝑖
𝑠𝑠 − = 𝑡𝑡𝑡𝑡
𝑡𝑡
𝑤𝑤𝑤𝑤
𝑜𝑜𝑜𝑜𝑜𝑜
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑡𝑡 𝑤𝑤 ℎ 𝑖𝑖 𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑢𝑢𝑢𝑢
𝑡𝑡𝑡𝑡
𝑡𝑡
𝑤𝑤𝑤𝑤
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
110
𝑠𝑠 − = = 7.333
𝑡𝑡
15
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑢𝑢𝑢𝑢
𝑡𝑡
The student should solve for 20*20 array size as well and find the speed up.
PAGE 12
• Architecture/Systems Continuum
• Flynn’s classification scheme
PAGE 13
The term coupling refers to the act of joining things together, such as the
links of a chain. The term coupling refers to the method of interconnecting
the components in a system or network and how much those components,
also called elements, depend on each other.
Vector processers Clusters
PAGE 14
• The most popular taxonomy of computer architecture was
efine y Flynn in Flynn’s classification scheme is
based on the notion of a stream of information. Two types of
in o mation o into a ocesso inst ctions an ata he
inst ction st eam is efine as the se ence o inst ctions
e o me y the ocessin nit he ata st eam is efine
as the ata t a fic e chan e et een the memo y an the
ocessin nit cco in to Flynn’s classification eithe o
the instruction or data streams can be single or multiple.
PAGE 15
Computer organizations are characterized by the multiplicity of the hardware provided to
service the instruction and data streams. Listed below are Flynn's four machine
organizations:
PAGE 16