0% found this document useful (0 votes)
70 views

Parallel2 PDF

The document discusses parallel architecture and Amdahl's law. It provides examples of parallelizable tasks like climate modeling and protein folding. It then discusses how Amdahl's law can be used to calculate maximum expected speedup from parallelization. Finally, it works through examples of calculating speedup for parallel matrix operations on different numbers of processors.

Uploaded by

bahaaalhusainy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Parallel2 PDF

The document discusses parallel architecture and Amdahl's law. It provides examples of parallelizable tasks like climate modeling and protein folding. It then discusses how Amdahl's law can be used to calculate maximum expected speedup from parallelization. Finally, it works through examples of calculating speedup for parallel matrix operations on different numbers of processors.

Uploaded by

bahaaalhusainy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Parallel Architecture

LE C T U RE 2

1
First we will show why parallel processing is used
Then, we will discuss Amdahl's low
Finally, we will Solve the some examples

PAGE 2
Climate modeling Protein folding Drug discovery Energy research Data analysis

PAGE 3
Power = C * V 2* F Performance = C ores * F

Let’ s have two cores

Power = 2* C * V 2* F Performance = 2* C ores * F

B ut decrease freq uency by 50%

Power = 2* C * V 2/4* F /2 Performance = 2* C ores * F /2

Power = C * V 2/4 * F Performance = C ores * F

PAGE 4
Amdahl's law

• also known as Amdahl's argument, is used to find the maximum expected


improvement to an overall system when only part of the system is improved.

• Amdahl's Law states that potential program speedup is defined by the fraction of
code that can be parallelized

• The speedup of a program using multiple processors in parallel computing is


limited by the time needed for the sequential fraction of the program. For
example, if a program needs 20 hours using a single processor core, and a
particular portion of the program which takes one hour to execute cannot be
parallelized, while the remaining 19 hours (95%) of execution time can be
parallelized, then regardless of how many processors are devoted to a parallelized
execution of this program, the minimum execution time cannot be less than that
critical one hour. Hence the speedup is limited to at most 20×.

PAGE 5
𝑥𝑥 𝑓𝑓

Execution time affected by improvement


= + Execution time unaffected
Amount of improvement

Example:
A simple design problem illustrates it well. Suppose a
program runs in 100 seconds on a computer, with
multiply operations responsible for 80 seconds of this
time. How much do I have to improve the speed of
multiplication if I want my program to run five times
faster?

PAGE 6
Suppose you want to achieve a speed-up of 90 times faster with 100 processors.
What percentage of the original computation can be sequential?

Execution time before


Speed up =
Execution time affected
Execution time before Execution time affected +
Amount of improvement

• The is formula is usually rewritten assuming that the execution time before is 1
for some unit of time, and the execution time affected by improvement is
considered the fraction of the original execution time

PAGE 7
• Suppose you want to perform two sums: one is a sum
of 10 scalar variables, and one is a matrix sum of a
pair of two-dimensional arrays, with dimensions 10 by
10.For now let’s assume only the matrix sum is
parallelizable. What speed-up do you get with 10
versus 40 processors?

• Next, calculate the speed-ups assuming the matrices


grow to 20 by 20.

PAGE 8
To achieve the speed-up of 20.5 on the previous larger
problem with 40 processors, we assumed the load was
perfectly balanced. That is, each of the 40 processors
had 2.5% of the work to do. Instead, show the impact on
speed-up if one processor’s load is higher than all the
rest. Calculate at twice the load (5%) and five time the
load (12.5%) for that hardest working processor. How
well utilized are the rest of the processors?

PAGE 9
For 10*10 array and 40 processors

𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑤𝑤𝑤𝑤𝑤𝑤 ℎ 𝑜𝑜𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖 𝑝𝑝𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 = 𝑡𝑡𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑤𝑤𝑤𝑤𝑤𝑤 ℎ 𝑜𝑜𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑤𝑤𝑤𝑤𝑤𝑤 ℎ 𝑜𝑜𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 100𝑡𝑡 + 10𝑡𝑡 = 110𝑡𝑡

The load is distributed according to the following

One processor load is 5% of the load that can be improved (100 addition)

We have 100 additions then 5% of this load will be on one processor

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 5% ∗ 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑓𝑓 𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 5% ∗ 100 = 0.05 ∗ 100 = 5
The remaining 39 processor will share the remaining load

𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜 39 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 95% ∗ 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜 39 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 95% ∗ 100 = 0.95 ∗ 100 = 95
Or simply we can say it is

Total load which can be improved – the load on the one processor

=100-5=95
PAGE 10
Now let us calculate the total time after improvement

According to Amdahl’s low


𝑡𝑡
𝑎𝑎𝑎𝑎 𝑎𝑎 𝑖𝑖 𝑖𝑖 = 𝑖𝑖 𝑖𝑖 + 𝑠𝑠

𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
However the time for improved is the longest time required to calculate the parallel array addition
where it could be either the time for the single processor or the remaining 19
39 processors,
processors; whichever is
highest.

So

𝑎𝑎𝑎𝑎𝐸𝐸𝐸𝐸𝑎𝑎 𝐸𝐸𝑡𝑡𝑖𝑖𝑎𝑎𝐸𝐸 𝑣𝑣𝐸𝐸𝑡𝑡𝐸𝐸𝐸𝐸𝐸𝐸 = max( 𝑡𝑡𝐸𝐸𝐸𝐸𝐸𝐸 𝑖𝑖𝑎𝑎𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝑠𝑠𝐸𝐸𝑎𝑎 , 𝑡𝑡𝑎𝑎𝐸𝐸𝑡𝑡𝑎𝑎𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑟𝑟 39 𝑖𝑖𝑎𝑎𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝑠𝑠𝐸𝐸𝑎𝑎𝑠𝑠 ) + 𝑡𝑡𝑠𝑠𝐸𝐸𝑎𝑎𝐸𝐸𝑎𝑎𝑡𝑡


𝑡𝑡
Where the max operation returns one value which is the highest value between its arguments.

Now the 𝑡𝑡𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 is equal to the load over number of processors

We have 5 additions over one processor so


5𝑡𝑡
𝑡𝑡𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
1
We multiply the number of additions by t (addition time) so we get 5t

PAGE 11
the 𝑟𝑟 39 is equal to the load of the 39 processors over the number of processors

𝑡𝑡
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
We have 95 additions over 39 processor so
95
=

𝑡𝑡
𝑟𝑟 39
39

𝑡𝑡
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
For simplicity we can say 100t/ 40 however in exams the exact value should be used which is 95t/39

Returning to our equation


𝑡𝑡
𝑎𝑎𝑎𝑎 𝑎𝑎 𝑖𝑖 𝑖𝑖 = max( , 𝑟𝑟 39 )+ 𝑠𝑠

𝑡𝑡
𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑜𝑜𝑜𝑜𝑜𝑜
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
5𝑡𝑡 95
𝑎𝑎𝑎𝑎𝐸𝐸𝐸𝐸𝑎𝑎 𝐸𝐸𝑡𝑡𝑖𝑖𝑎𝑎𝐸𝐸𝑖𝑖𝐸𝐸𝑡𝑡𝐸𝐸𝐸𝐸𝐸𝐸 = max( , ) + 10
1 39

𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎 𝑎𝑎 𝑖𝑖 𝑖𝑖 = 5t + 10 = 15
𝑡𝑡
𝑡𝑡
𝑡𝑡
𝑎𝑎𝑎𝑎
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
The speed-up will be
𝑡𝑡 𝑤𝑤 ℎ 𝑖𝑖 𝑖𝑖
𝑠𝑠 − = 𝑡𝑡𝑡𝑡
𝑡𝑡
𝑤𝑤𝑤𝑤
𝑜𝑜𝑜𝑜𝑜𝑜
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑡𝑡 𝑤𝑤 ℎ 𝑖𝑖 𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑢𝑢𝑢𝑢
𝑡𝑡𝑡𝑡
𝑡𝑡
𝑤𝑤𝑤𝑤
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
110
𝑠𝑠 − = = 7.333

𝑡𝑡
15
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑢𝑢𝑢𝑢
𝑡𝑡
The student should solve for 20*20 array size as well and find the speed up.

PAGE 12
• Architecture/Systems Continuum
• Flynn’s classification scheme

PAGE 13
The term coupling refers to the act of joining things together, such as the
links of a chain. The term coupling refers to the method of interconnecting
the components in a system or network and how much those components,
also called elements, depend on each other.
Vector processers Clusters

Instruction-level parallelism (ILP)


Grid Client/ server model
Memory-level parallelism (MLP) Multi-core processors

Tightly Coupled Loosely Coupled

PAGE 14
• The most popular taxonomy of computer architecture was
efine y Flynn in Flynn’s classification scheme is
based on the notion of a stream of information. Two types of
in o mation o into a ocesso inst ctions an ata he
inst ction st eam is efine as the se ence o inst ctions
e o me y the ocessin nit he ata st eam is efine
as the ata t a fic e chan e et een the memo y an the
ocessin nit cco in to Flynn’s classification eithe o
the instruction or data streams can be single or multiple.

PAGE 15
Computer organizations are characterized by the multiplicity of the hardware provided to
service the instruction and data streams. Listed below are Flynn's four machine
organizations:

• Single instruction stream-single data stream (SISD)

• Single instruction stream-multiple data stream (SIMD)

• Multiple instruction stream-single data stream (MISD)

• Multiple instruction stream-multiple data stream (MIMD)

PAGE 16

You might also like